5.3: Perception Pipeline

End-to-End Perception Architecture

Lane Detection Pipeline

System: Camera → Preprocessing → CNN → Postprocessing → LKA Control

The following diagram details the lane detection perception pipeline stages, showing data flow from raw camera frames through preprocessing, neural network inference, postprocessing, and output to the LKA controller.

Lane Detection Perception Pipeline

Latency Breakdown:

Camera capture: 33ms (hardware constraint, 30 FPS = 33ms per frame)
Preprocessing: 3ms (GPU-accelerated resize, normalize)
CNN inference: 15ms (TensorRT optimized, INT8 quantization)
Postprocessing: 5ms (curve fitting, confidence scoring)
Control input: 2ms (data transfer to LKA controller)

Total: 58ms (exceeds 30ms target, but acceptable for 100ms control loop)

CNN Deployment: PyTorch → TensorRT

Model Optimization Pipeline

Challenge: PyTorch model (FP32, 48 MB) too slow for real-time (45ms latency)

Solution: Optimize with NVIDIA TensorRT (INT8 quantization, kernel fusion)

Optimization Steps:

Step 1: PyTorch → ONNX Export

import torch
import torch.onnx

# Load trained PyTorch model
model = torch.load("lane_detection_cnn.pth", map_location="cuda")
model.eval()

# Dummy input (640x480 RGB image)
dummy_input = torch.randn(1, 3, 640, 480).cuda()

# Export to ONNX format (interoperable with TensorRT)
torch.onnx.export(
    model,
    dummy_input,
    "lane_detection_cnn.onnx",
    export_params=True,
    opset_version=13,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},   # Support variable batch size
        "output": {0: "batch_size"}
    }
)

print("ONNX model exported: lane_detection_cnn.onnx")

Output: ONNX model (48 MB, FP32 precision)

Step 2: ONNX → TensorRT Optimization

Tool: TensorRT (NVIDIA inference optimizer)

Optimizations:

INT8 Quantization: FP32 (32-bit float) → INT8 (8-bit integer), 4× size reduction
Kernel Fusion: Combine Conv + BatchNorm + ReLU into single GPU kernel (3× speedup)
Layer Profiling: Auto-select fastest GPU kernel for each layer

Calibration (INT8 quantization requires calibration dataset):

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

# TensorRT logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Load ONNX model
def build_engine(onnx_file_path, calibration_dataset):
    """
    Build TensorRT engine with INT8 quantization
    """
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    parser = trt.OnnxParser(network, TRT_LOGGER)

    # Parse ONNX model
    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    # Builder config (INT8 quantization)
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.INT8)  # Enable INT8
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1 GB

    # INT8 Calibration (using 1,000 images from training set)
    calibrator = Int8Calibrator(calibration_dataset)
    config.int8_calibrator = calibrator

    # Build TensorRT engine (optimized for Jetson Orin)
    engine = builder.build_serialized_network(network, config)

    # Save engine to file
    with open("lane_detection_cnn.trt", "wb") as f:
        f.write(engine)

    return engine

class Int8Calibrator(trt.IInt8EntropyCalibrator2):
    """
    INT8 calibration: Find optimal quantization scales
    """
    def __init__(self, calibration_dataset):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.dataset = calibration_dataset  # 1,000 images
        self.batch_size = 1
        self.current_index = 0

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index < len(self.dataset):
            # Load next image from calibration set
            image = self.dataset[self.current_index]
            self.current_index += 1

            # Copy to GPU memory
            cuda.memcpy_htod(self.device_input, image)
            return [int(self.device_input)]
        else:
            return None

# Build TensorRT engine
calibration_dataset = load_calibration_images("data/calibration/", 1000)
engine = build_engine("lane_detection_cnn.onnx", calibration_dataset)

Output: TensorRT engine file (lane_detection_cnn.trt, 12 MB, INT8 quantized)

Step 3: TensorRT Inference (C++ Deployment)

Integration: C++ wrapper for AUTOSAR/ROS2 integration

/**
 * @file lane_detection_inference.cpp
 * @brief TensorRT inference wrapper for lane detection CNN
 * @safety_class ASIL-B (perception critical)
 */

#include <NvInfer.h>
#include <cuda_runtime_api.h>
#include <opencv2/opencv.hpp>
#include <fstream>
#include <vector>

using namespace nvinfer1;

class LaneDetectionInference {
public:
    LaneDetectionInference(const std::string& engine_file) {
        // Load TensorRT engine
        std::ifstream file(engine_file, std::ios::binary);
        file.seekg(0, std::ios::end);
        size_t size = file.tellg();
        file.seekg(0, std::ios::beg);
        std::vector<char> buffer(size);
        file.read(buffer.data(), size);

        // Deserialize engine
        IRuntime* runtime = createInferRuntime(gLogger);
        engine_ = runtime->deserializeCudaEngine(buffer.data(), size);
        context_ = engine_->createExecutionContext();

        // Allocate GPU buffers
        cudaMalloc(&input_buffer_, 640 * 480 * 3 * sizeof(float));
        cudaMalloc(&output_buffer_, 640 * 480 * sizeof(float));
    }

    /**
     * @brief Run inference on input image
     * @param[in] image Input RGB image (640x480)
     * @param[out] segmentation_mask Output binary mask (640x480)
     * @return Latency (ms)
     */
    float infer(const cv::Mat& image, cv::Mat& segmentation_mask) {
        auto start = std::chrono::high_resolution_clock::now();

        // Preprocess: Normalize [0, 255] → [0.0, 1.0]
        cv::Mat normalized;
        image.convertTo(normalized, CV_32F, 1.0 / 255.0);

        // Copy input to GPU
        cudaMemcpy(input_buffer_, normalized.data,
                   640 * 480 * 3 * sizeof(float),
                   cudaMemcpyHostToDevice);

        // Run inference
        void* buffers[] = {input_buffer_, output_buffer_};
        context_->executeV2(buffers);

        // Copy output from GPU
        std::vector<float> output(640 * 480);
        cudaMemcpy(output.data(), output_buffer_,
                   640 * 480 * sizeof(float),
                   cudaMemcpyDeviceToHost);

        // Threshold output (0.5) to binary mask
        segmentation_mask = cv::Mat(480, 640, CV_8U);
        for (int i = 0; i < 640 * 480; i++) {
            segmentation_mask.data[i] = (output[i] > 0.5) ? 255 : 0;
        }

        auto end = std::chrono::high_resolution_clock::now();
        float latency_ms = std::chrono::duration<float, std::milli>(end - start).count();

        return latency_ms;
    }

private:
    IExecutionContext* context_;
    ICudaEngine* engine_;
    void* input_buffer_;
    void* output_buffer_;
    Logger gLogger;
};

// Example usage
int main() {
    LaneDetectionInference detector("lane_detection_cnn.trt");

    // Load test image
    cv::Mat image = cv::imread("test_image.jpg");
    cv::resize(image, image, cv::Size(640, 480));

    // Run inference
    cv::Mat segmentation_mask;
    float latency = detector.infer(image, segmentation_mask);

    std::cout << "Inference latency: " << latency << " ms" << std::endl;
    // Output: Inference latency: 15.2 ms

    // Visualize result
    cv::imshow("Segmentation Mask", segmentation_mask);
    cv::waitKey(0);

    return 0;
}

Performance:

FP32 (PyTorch): 45ms latency
INT8 (TensorRT): 15ms latency [PASS] (3× speedup)
Model Size: 48 MB → 12 MB (4× reduction)
Accuracy: 95.2% IoU → 94.8% IoU (0.4% loss, acceptable)

Postprocessing: Classical Computer Vision

Polynomial Curve Fitting

Goal: Convert pixel segmentation mask → Lane line polynomial (for steering control)

Algorithm: RANSAC + 3rd-order polynomial fitting

import numpy as np
from sklearn.linear_model import RANSACRegressor
from sklearn.preprocessing import PolynomialFeatures

def fit_lane_polynomial(segmentation_mask):
    """
    Fit 3rd-order polynomial to lane line pixels

    Input:
      - segmentation_mask: Binary mask (640x480, lane pixels = 255)

    Output:
      - left_poly: [a3, a2, a1, a0] (left lane: y = a3*x³ + a2*x² + a1*x + a0)
      - right_poly: [a3, a2, a1, a0] (right lane)
      - confidence: 0.0-1.0
    """
    # Extract lane pixels (white pixels in mask)
    lane_pixels = np.argwhere(segmentation_mask == 255)  # (y, x) coordinates

    if len(lane_pixels) < 100:
        return None, None, 0.0  # Too few pixels, low confidence

    # Split left and right lanes (assume center x=320)
    left_pixels = lane_pixels[lane_pixels[:, 1] < 320]
    right_pixels = lane_pixels[lane_pixels[:, 1] > 320]

    # Fit left lane (RANSAC for robustness to outliers)
    if len(left_pixels) > 50:
        X_left = left_pixels[:, 0].reshape(-1, 1)  # y coordinates
        y_left = left_pixels[:, 1]                 # x coordinates

        # Polynomial features (3rd order)
        poly = PolynomialFeatures(degree=3)
        X_left_poly = poly.fit_transform(X_left)

        # RANSAC regression (reject outliers)
        ransac = RANSACRegressor(residual_threshold=10.0, max_trials=1000)
        ransac.fit(X_left_poly, y_left)

        left_poly = ransac.estimator_.coef_
    else:
        left_poly = None

    # Fit right lane (same procedure)
    if len(right_pixels) > 50:
        X_right = right_pixels[:, 0].reshape(-1, 1)
        y_right = right_pixels[:, 1]

        poly = PolynomialFeatures(degree=3)
        X_right_poly = poly.fit_transform(X_right)

        ransac = RANSACRegressor(residual_threshold=10.0, max_trials=1000)
        ransac.fit(X_right_poly, y_right)

        right_poly = ransac.estimator_.coef_
    else:
        right_poly = None

    # Calculate confidence (based on lane continuity, width)
    confidence = calculate_confidence_from_fit(left_poly, right_poly, lane_pixels)

    return left_poly, right_poly, confidence


def calculate_lateral_offset(left_poly, right_poly, image_height=480):
    """
    Calculate lateral offset from lane center (meters)

    Camera calibration: 1 pixel = 0.01 meters (at 50m distance)
    """
    if left_poly is None or right_poly is None:
        return None, 0.0  # Missing lane, no offset

    # Evaluate polynomials at bottom of image (y=480, closest to vehicle)
    x_left = np.polyval(left_poly[::-1], image_height)
    x_right = np.polyval(right_poly[::-1], image_height)

    # Lane center (x coordinate)
    lane_center_x = (x_left + x_right) / 2.0

    # Vehicle center (assume camera at image center)
    vehicle_center_x = 320.0  # pixels

    # Lateral offset (pixels)
    offset_pixels = vehicle_center_x - lane_center_x

    # Convert to meters (calibration: 1 pixel = 0.01m)
    offset_meters = offset_pixels * 0.01

    return offset_meters, 1.0  # High confidence (both lanes detected)

Example:

Input: Segmentation mask (left lane at x=150-170, right lane at x=470-490)
Output:
- Left polynomial: y = 0.002x³ - 0.5x² + 60x + 150
- Right polynomial: y = 0.002x³ - 0.5x² + 60x + 470
- Lane center: (150 + 470) / 2 = 310 pixels
- Vehicle center: 320 pixels
- Lateral offset: (320 - 310) × 0.01 = +0.10 meters (10cm right of center)

Latency: 5ms (NumPy, RANSAC on CPU)

Lessons Learned

What Worked Well [PASS]

1. TensorRT Optimization (3× Speedup)

Success: INT8 quantization + kernel fusion reduced latency 45ms → 15ms

Evidence:

Meets 30ms latency requirement (exceeds by 15ms margin)
0.4% accuracy loss (95.2% → 94.8% IoU, acceptable)
4× smaller model (48 MB → 12 MB, fits in ECU flash)

Recommendation: TensorRT mandatory for production ML deployment on embedded GPUs

2. Confidence Scoring (SOTIF Safe Degradation)

Success: 78% of ODD exit scenarios correctly detected (confidence <0.5) → LKA disabled

Example: Heavy rain scenario

CNN outputs low-quality segmentation (many false positives)
Confidence score: 0.42 (below 0.5 threshold)
LKA disables within 2 seconds (safe degradation)
Driver alerted: "LKA unavailable - Lane markings not detected"

Lesson: Confidence scoring is critical for ML in safety-critical systems (ISO 21448 requirement)

3. Shadow Mode Field Testing (100 Vehicles, 12 Months)

Success: Discovered 542 "Unknown" scenarios, moved to "Known" quadrant

Example Discoveries:

Road art mistaken for lanes (Berlin, Month 8)
Construction zone temporary markings (Munich, Month 5)
Wet road glare (France, Month 9)

Benefit: Continuous improvement (dataset v1.1 includes discovered scenarios, model retrained)

Recommendation: Shadow mode essential for ML validation (real-world >> simulation)

What Didn't Work [WARN]

1. Initial Dataset Bias (Underrepresented ODD Boundaries)

Problem: Training dataset had 70% well-marked highways, only 4% faded markings

Impact: Model accuracy at ODD boundary (faded lanes): 78% IoU (vs 95% nominal)

Root Cause: Data collection focused on nominal scenarios, neglected edge cases

Fix (Month 16):

Collected 10,000 additional faded lane images (targeted data collection)
Retrained model (dataset v1.1) → Faded lane IoU improved to 88%

Lesson: ODD boundaries need deliberate oversampling (not just nominal scenarios)

Recommendation: Allocate 20% of dataset to edge cases (vs 80% nominal)

2. INT8 Quantization Accuracy Loss Underestimated

Problem: Initial INT8 quantization: 95.2% → 92.1% IoU (3.1% loss, exceeded 1% target)

Root Cause: Naive quantization (symmetric, per-tensor) not suitable for CNN with outliers

Fix:

Asymmetric quantization (separate zero-point for negative/positive)
Per-channel quantization (different scales for each conv filter)
Calibration dataset: 1,000 → 5,000 images (better statistical coverage)

Result: 95.2% → 94.8% IoU (0.4% loss, acceptable) [PASS]

Lesson: INT8 quantization is non-trivial (requires careful calibration, not automatic)

Time Cost: 3 weeks to optimize quantization (vs 1 day expected)

Quantization Best Practices: For safety-critical ML: (1) Always compare quantized vs full-precision accuracy on held-out test set, (2) Test quantized model on corner cases (edge cases may be more sensitive to precision loss), (3) Document quantization methodology in MLE.5 deployment artifacts for safety assessment.

3. Postprocessing Latency Underestimated (5ms Target, 12ms Actual)

Problem: Polynomial curve fitting (RANSAC) slower than expected

Root Cause:

RANSAC: 1,000 trials per lane (overkill, could reduce to 200)
CPU-based (not GPU-accelerated, bottleneck)

Fix:

Reduced RANSAC trials: 1,000 → 200 (still robust, 3× speedup)
Moved polynomial fitting to GPU (cuRANSAC, CUDA library)
Result: 12ms → 5ms latency [PASS]

Lesson: Profile early (don't assume classical CV is "cheap")

Project Metrics Summary

Final Results

Metric	Target	Achieved	Notes
Timeline	24 months	24 months	[PASS] On time
Budget	€3.5M	€3.48M	[PASS] Under budget (€20k savings)
Model Accuracy	≥92% IoU	95.2% IoU	[PASS] Exceeds target
Latency	≤30ms	15ms	[PASS] 2× faster than target
SOTIF Scenarios	10,000	10,542	[PASS] 97.9% pass rate
Field Performance	≥95% success	98.7%	[PASS] 5,000 km proving ground
TÜV Certification	ASIL-B + SOTIF	[PASS] Certified	ISO 26262 + ISO 21448

AI Contribution Summary

Activity	Traditional Time	AI-Assisted Time	Improvement
Dataset Annotation	20,000 hours	20,000 hours	0% (manual labor) [FAIL]
Hyperparameter Tuning	300 hours	150 hours	50% faster (Optuna) [PASS]
Model Architecture Search	80 hours	40 hours	50% faster (literature review, ChatGPT-4) [PASS]
SOTIF Scenario Generation	120 hours	60 hours	50% faster (ChatGPT-4 brainstorming) [PASS]
Code Generation (C++ inference)	40 hours	25 hours	38% faster (GitHub Copilot) [PASS]

Overall: 20,540 hours → 20,275 hours (1.3% reduction)

Why So Small?: Dataset annotation (20,000 hours) dominates, no AI replacement available yet

Future: Semi-automated annotation (SAM, Segment Anything Model) could reduce to 10,000 hours (50% savings)

Emerging Annotation Tools: Foundation models like Segment Anything (SAM), CLIP, and domain-specific auto-labeling tools are rapidly improving. For future projects, evaluate semi-automated pipelines where AI generates initial annotations and humans verify/correct. This human-in-the-loop approach can reduce annotation time by 30-50% while maintaining quality.

Recommendations for Future ML ADAS Projects

For MLE Process

Dataset is King
- Invest 50% of budget in dataset (collection, annotation, quality)
- Oversample ODD boundaries (20% of dataset) → Improves edge case performance
- Version datasets with DVC (reproducibility essential for safety assessment)
INT8 Quantization is Non-Trivial
- Budget 3 weeks for quantization optimization (not 1 day)
- Use per-channel, asymmetric quantization (better accuracy preservation)
- Calibration dataset: 5,000+ images (vs 1,000 minimum)
Shadow Mode Essential
- Field testing (100 vehicles, 12 months) → Discover "Unknown Unsafe" scenarios
- Budget €200k for fleet instrumentation (cameras, data logging, cloud storage)
- ROI: 542 scenarios discovered (invaluable for SOTIF compliance)

For SOTIF Compliance

Confidence Scoring from Day 1
- Design confidence metric during architecture phase (Month 5, not Month 18)
- Validate thresholds on 10,000 scenarios (tune 0.5 threshold empirically)
- Integrate with LKA state machine (safe degradation)
10,000+ Scenarios is Real
- Don't underestimate SOTIF test effort (10,542 scenarios = 3 months testing)
- Simulate 60%, real-world 40% (proving ground, instrumented vehicles)
- TÜV assessor will audit scenario catalog (ensure representative coverage)
ODD Definition is Contractual
- ODD = promise to customer ("we guarantee safety within these conditions")
- Conservative ODD (highway only) easier to validate than ambitious (urban)
- Document ODD exit strategy (what happens when vehicle leaves ODD?)

For TensorRT Deployment

Start Optimization Early (Month 10, Not Month 20)
- Don't wait until integration phase to optimize model
- Prototype TensorRT conversion at Month 10 (identify quantization issues early)
- Budget 1 month for TensorRT optimization (kernel fusion, calibration)
GPU Selection Matters
- NVIDIA Jetson Orin: 254 TOPS (15ms latency) [PASS]
- Alternative (Xavier): 30 TOPS (60ms latency) [FAIL] (doesn't meet 30ms requirement)
- Don't cheap out on GPU (€2,500 vs €1,000, but 4× performance difference)
C++ Integration (Not Python)
- Production deployment: C++ only (Python too slow, 100ms overhead)
- Use PyTorch C++ API (libtorch) or TensorRT C++ API
- Budget 2 weeks for C++ integration (vs 1 week Python prototype)

Conclusion

ML-Enabled ADAS Development: Cutting Edge, Highly Regulated

[PASS] ASIL-B + SOTIF Certified: ISO 26262 + ISO 21448 compliance
[PASS] 95.2% IoU Accuracy: Exceeds 92% target
[PASS] 15ms Latency: 2× faster than 30ms requirement
[PASS] 98.7% Field Success: 5,000 km proving ground validation

Key Differences from Traditional ADAS (Chapters 25-26):

ML Perception: Non-deterministic, requires SOTIF (ISO 21448)
Dataset Management: 250,000 images, 20,000 hours annotation (vs 0 for traditional code)
MLE Process: New lifecycle (MLE.1-6) extends ASPICE (SWE.1-6)
Continuous Validation: Shadow mode (12 months) discovers 542 scenarios (vs one-time testing)

Message: Machine learning in safety-critical ADAS is achievable but challenging. ASPICE provides 40% foundation (traditional control code), but ML requires MLE process extension (dataset, training, SOTIF). ISO 21448 (SOTIF) is non-negotiable for ML safety argumentation. Dataset quality is king (50% of budget). TensorRT optimization is mandatory (3× speedup). Shadow mode field testing is essential (discover "Unknown Unsafe" scenarios). AI tools help (hyperparameter tuning, scenario generation), but core work (annotation, quantization) still manual.

Chapter 28 Complete: ML-enabled ADAS development demonstrates MLE + SOTIF + TensorRT deployment for production CNN.

Part V Complete: Industry applications across automotive, industrial, medical, ML domains show ASPICE + AI integration in practice.