5.1: MLE Process Application

MLE (Machine Learning Engineering) Process Overview

MLE Lifecycle for Safety-Critical ML

MLE Process: Extension of traditional software engineering for ML systems

6-Phase MLE Lifecycle (adapted for ASPICE context):

MLE Process Application


MLE.1: ML Requirements Analysis

Operational Design Domain (ODD)

ODD: Conditions under which ML model is designed to operate safely

IEC 62304 Analogy: Similar to "intended use" for medical devices

LKA ODD Definition:

Operational Design Domain (ODD) - Lane Keeping Assist
─────────────────────────────────────────────────────────

Geographic:
  - Road Type: Highway, rural roads (NOT urban city streets)
  - Lane Markings: Visible lane lines (white/yellow, solid/dashed)
  - Lane Width: 2.5 - 3.7 meters (standard lane widths)

Environmental:
  - Weather: Dry, light rain (NOT heavy rain, snow, fog)
  - Lighting: Daytime (6am-8pm), dusk (NOT night with poor visibility)
  - Visibility: ≥100 meters (NOT heavy fog, smoke)

Operational:
  - Speed: 60-130 km/h (NOT stop-and-go traffic, parking)
  - Curvature: Radius ≥150 meters (NOT sharp curves, roundabouts)
  - Traffic: Light-moderate (NOT construction zones, lane merges)

Exclusions (Out of ODD):
  [FAIL] Tunnel entrances (lighting transition)
  [FAIL] Faded/missing lane markings
  [FAIL] Snow-covered roads
  [FAIL] Unpaved roads
  [FAIL] Parking lots

Traceability: ODD → ML Requirements → Test Cases

Example Requirement Derivation:

[MLE-REQ-001] Lane Detection Accuracy within ODD

Description:
  The lane detection model shall achieve ≥92% IoU (Intersection over Union)
  when tested on images captured within the defined ODD.

Rationale:
  92% IoU ensures lane line segmentation accurate enough for steering control
  (lateral offset error ≤0.15 meters, acceptable for ASIL-B)

Acceptance Criteria:
  1. Test on 25,000 held-out images (sampled from ODD conditions)
  2. Calculate IoU for each image (true lane pixels ∩ predicted / union)
  3. Mean IoU ≥ 92% (95% confidence interval)
  4. Worst-case IoU ≥ 70% (no catastrophic failures)

Traceability:
  - Derived from: [SYS-REQ-LKA-003] "LKA shall keep vehicle in lane center ±0.2m"
  - Verified by: [TC-MLE-001-1] "Test set evaluation (25,000 images)"

Safety Class: ASIL-B (lane detection critical to LKA safety function)

[MLE-REQ-002] Inference Latency Requirement

Description:
  The lane detection model shall process camera frames with latency ≤30ms
  on NVIDIA Jetson AGX Orin target hardware.

Rationale:
  At 130 km/h (36 m/s), 30ms latency = 1.08 meter traveled
  Acceptable for steering control (PID can compensate for 1m delay)

Acceptance Criteria:
  1. Measure end-to-end latency: Image capture → CNN inference → Output
  2. Average latency ≤ 25ms (target), max latency ≤ 30ms (requirement)
  3. Test under CPU load (other ECU tasks running)

Verification Method:
  - Benchmark on Jetson Orin (TensorRT optimized model)
  - 1,000 frames, log timestamps, calculate P95 latency

Traceability:
  - Derived from: [SYS-REQ-LKA-012] "LKA response time ≤100ms end-to-end"

MLE.2: Dataset Management

Dataset Collection Strategy

Goal: 250,000 annotated images covering ODD conditions

Data Sources:

  1. Public Datasets (40%, 100,000 images):

    • TuSimple (USA highways, 6,408 images)
    • CULane (China urban/rural, 88,880 images)
    • BDD100K (USA diverse, 5,712 lane-marked images)
    • Advantage: Pre-labeled, diverse scenarios
    • Disadvantage: May not match our ODD (e.g., China roads differ from Europe)
  2. Proprietary Data Collection (60%, 150,000 images):

    • Method: 3 test vehicles, 500,000 km driven over 6 months
    • Routes: German Autobahn (40%), French highways (30%), Italian rural (30%)
    • Conditions: Dry (70%), light rain (20%), dusk (10%)
    • Sampling: Extract 1 frame per 10 meters (avoid redundant similar frames)

Dataset Composition (ensuring ODD coverage):

Condition Images % Purpose
Daytime, dry, straight highway 150,000 60% Nominal operation (most common)
Daytime, light rain 40,000 16% Robustness (lane lines less visible)
Dusk lighting 25,000 10% Edge case (shadows, glare)
Curved roads (R≥150m) 20,000 8% ODD boundary (test curvature limit)
Faded lane markings 10,000 4% Corner case (near ODD exit condition)
Construction zones (with markings) 5,000 2% Rare scenario (temporary lane lines)

Total: 250,000 images


Data Annotation Process

Tool: CVAT (Computer Vision Annotation Tool) - open-source, web-based

Annotation Task: Pixel-wise lane line segmentation (binary mask)

Annotation Guidelines (55-page manual):

## Lane Line Annotation Guidelines v2.3

### Objective
Create binary segmentation masks where:
  - White pixels (255): Lane line (left/right boundaries)
  - Black pixels (0): Everything else (road, sky, vehicles)

### Rules
1. **Lane Line Definition**: Paint road markings delineating lane boundaries
   - Solid white/yellow lines: Full width (typically 10-15 cm)
   - Dashed lines: Paint dashed segments only (not gaps)
   - Double lines: Paint both lines

2. **Occlusions**: If vehicle/shadow partially occludes lane line:
   - Paint visible portions only
   - Do NOT interpolate occluded segments (let model learn to handle occlusions)

3. **Faded Markings**: If lane line barely visible:
   - Paint what you can see (even if low contrast)
   - Quality control: 2nd annotator reviews faded cases

4. **Edge Cases**:
   - Construction zone temporary markings: Paint if visible
   - Road repairs (black patches over lines): Do NOT paint (line not visible)
   - Reflective markers (Botts' dots): Do NOT paint (not painted lines)

### Quality Control
- Each image reviewed by 2nd annotator
- Inter-annotator agreement target: ≥95% (IoU between 2 annotators)
- Disputed cases escalated to senior annotator

Annotation Metrics:

  • Time per Image: 4.8 minutes (average)
  • Total Annotation Effort: 250,000 images × 4.8 min = 20,000 hours
  • Cost: 20,000 hours × €15/hour = €300,000
  • Team: 10 annotators (2,000 hours each over 6 months)

Quality Control:

  • Inter-Annotator Agreement: 96.2% IoU (exceeds 95% target)
  • Re-annotation Rate: 8% (20,000 images re-annotated after QC review)

Dataset Versioning (DVC)

Tool: DVC (Data Version Control) - Git-like for ML datasets

Why DVC?:

  • [FAIL] Git doesn't scale for 250,000 images (50 GB dataset, Git limit ~1 GB)
  • [PASS] DVC tracks data in remote storage (S3), Git tracks metadata (hashes, versions)
  • [PASS] Reproducibility: Checkout dataset v1.0 → Train model → Reproduce results

DVC Workflow:

# Initialize DVC in Git repository
cd lane-detection-project
dvc init

# Add dataset to DVC (stores in S3, tracks hash in Git)
dvc add data/train_images/
dvc add data/train_annotations/

# Commit DVC metadata to Git (not actual images)
git add data/train_images.dvc data/train_annotations.dvc
git commit -m "Dataset v1.0: 250k images, 96% inter-annotator agreement"
git tag dataset-v1.0

# Push data to S3 (DVC remote storage)
dvc remote add -d s3-storage s3://lane-detection-datasets/
dvc push

# Later: Reproduce training with exact same dataset
git checkout dataset-v1.0
dvc pull  # Downloads data from S3
python train.py  # Uses dataset v1.0

Dataset Versioning History:

Version Date Images Changes Model Trained
v0.1 2024-04 50,000 Initial pilot dataset Baseline (85% IoU)
v0.5 2024-07 150,000 Added rain, dusk scenarios Improved (91% IoU)
v1.0 2024-10 250,000 Full ODD coverage, QC pass Final (95.2% IoU) [PASS]

Traceability: Model performance → Dataset version (reproducibility for safety assessment)

Dataset Lineage for Safety Assessment: TÜV assessors require complete dataset lineage: (1) Source provenance (public datasets vs proprietary), (2) Annotation methodology (guidelines, QC process), (3) Data splits (train/val/test), (4) Augmentation applied. Document this in the MLE.2 Dataset Management work product.


MLE.3: Model Development

Architecture Selection

Task: Pixel-wise lane line segmentation (semantic segmentation)

Candidate Architectures:

Model Params Latency (Jetson Orin) IoU (val) Decision
U-Net 31M 45ms 93.1% [FAIL] Too slow (>30ms)
FCN-ResNet50 35M 50ms 92.8% [FAIL] Too slow
DeepLabV3-MobileNetV2 5M 18ms 90.2% [WARN] Fast but low accuracy
EfficientNet-Lite4 + DeepLabV3 12M 25ms 95.2% [PASS] Selected (best accuracy/latency trade-off)

Rationale: EfficientNet-Lite4 optimized for mobile/embedded (Jetson Orin), DeepLabV3 SOTA segmentation


Training Process

Hyperparameter Search (150 experiments tracked in MLflow):

Experiment Tracker: MLflow

Hyperparameters Tuned:

  • Learning rate: {1e-4, 5e-4, 1e-3, 5e-3}
  • Batch size: {16, 32, 64}
  • Optimizer: {Adam, AdamW, SGD with momentum}
  • Data augmentation: {Random brightness/contrast, Gaussian blur, synthetic rain}
  • Loss function: {Dice loss, Focal loss, Combo (Dice + Focal)}

Best Configuration (Experiment #127):

# MLflow experiment tracking
import mlflow

mlflow.start_run(run_name="exp-127-efficientnet-deeplabv3")

config = {
    "architecture": "EfficientNet-Lite4 + DeepLabV3",
    "learning_rate": 5e-4,
    "batch_size": 32,
    "optimizer": "AdamW",
    "epochs": 200,
    "loss_function": "Combo (0.5 Dice + 0.5 Focal)",
    "data_augmentation": {
        "brightness": [-0.2, +0.2],
        "contrast": [0.8, 1.2],
        "gaussian_blur": 0.1,  # 10% of images
        "synthetic_rain": 0.05  # 5% of images
    }
}

mlflow.log_params(config)

# Training loop (200 epochs, ~120 hours on 8x A100 GPUs)
for epoch in range(200):
    train_loss, train_iou = train_one_epoch(model, train_loader)
    val_loss, val_iou = validate(model, val_loader)

    mlflow.log_metrics({
        "train_loss": train_loss,
        "train_iou": train_iou,
        "val_loss": val_loss,
        "val_iou": val_iou
    }, step=epoch)

    # Save best model (checkpoint)
    if val_iou > best_iou:
        best_iou = val_iou
        torch.save(model.state_dict(), "best_model.pth")
        mlflow.pytorch.log_model(model, "lane_detection_cnn")

mlflow.end_run()

Training Results (Experiment #127):

  • Final Validation IoU: 95.2%
  • Training Time: 120 hours (8x NVIDIA A100 GPUs)
  • Convergence: Epoch 180 (plateaued, early stopping at epoch 200)

MLE.4: Model Verification

Test Set Evaluation

Test Set: 25,000 images (held-out, never seen during training)

Evaluation Metrics:

from sklearn.metrics import jaccard_score, precision_score, recall_score

# Load test set (25,000 images + ground truth masks)
test_images, test_masks = load_test_set("data/test/")

# Inference on test set
predictions = []
for image in test_images:
    pred_mask = model.predict(image)  # Output: binary mask (lane pixels)
    predictions.append(pred_mask)

# Calculate metrics
iou = jaccard_score(test_masks.flatten(), predictions.flatten(), average='binary')
precision = precision_score(test_masks.flatten(), predictions.flatten())
recall = recall_score(test_masks.flatten(), predictions.flatten())

print(f"Test Set IoU: {iou:.3f}")  # 0.952 (95.2%)
print(f"Precision: {precision:.3f}")  # 0.968 (96.8%)
print(f"Recall: {recall:.3f}")  # 0.937 (93.7%)

Results:

  • IoU: 95.2% [PASS] (exceeds 92% requirement)
  • Precision: 96.8% (few false positives, predicted lane pixels are correct)
  • Recall: 93.7% (6.3% of actual lane pixels missed, acceptable)

Failure Analysis (worst 100 images, IoU <70%):

Failure Mode Count IoU Range Root Cause
Severe occlusion (truck blocks view) 35 50-65% Model can't see lane lines (expected failure)
Heavy shadows (trees, overpasses) 28 55-70% Low contrast, model confuses shadows with lanes
Construction zone (temporary yellow lines) 22 60-68% Yellow lines not in training data (dataset bias)
Worn markings (barely visible) 15 58-70% At ODD boundary (faded lanes), marginal detection

Mitigation:

  • Occlusion: Temporal smoothing (use previous frames to interpolate)
  • Shadows: Data augmentation (add synthetic shadows during training)
  • Construction zones: Add 5,000 construction images to dataset v1.1 (future retraining)

Failure Mode Categorization: Document failure modes by root cause category: (1) Dataset gaps (missing scenarios), (2) Model architecture limitations (receptive field too small), (3) Sensor limitations (camera dynamic range), (4) Labeling errors (annotation mistakes). This categorization guides targeted improvements.


Corner Case Testing (SOTIF)

Goal: Test model on 10,000 edge cases (ISO 21448 SOTIF requirement)

Corner Case Categories:

Category Scenarios Purpose
ODD Boundary Near-faded markings, max curvature (R=150m) Test model at ODD limits
Rare Events Roadkill on lane line, glare from wet road Low-probability scenarios
Adversarial Synthetic perturbations (add noise to image) Robustness to attacks
Out-of-ODD Snow, night, construction Verify model degrades gracefully

Example Corner Case Test: ODD Boundary (Faded Lane Markings)

# Test Case: TC-MLE-SOTIF-042
# Description: Faded lane markings (visibility ≈30%, near ODD exit)

# Load faded lane test images (collected from rural roads, poor maintenance)
faded_test_images = load_images("data/corner_cases/faded_lanes/")  # 500 images

# Inference
confidence_scores = []
for image in faded_test_images:
    pred_mask = model.predict(image)
    confidence = calculate_confidence(pred_mask)  # 0.0-1.0 score
    confidence_scores.append(confidence)

# Acceptance Criteria:
# 1. Model outputs low confidence (<0.5) for faded lanes → Triggers LKA disable
# 2. No high-confidence false detections (precision ≥ 80%)

mean_confidence = np.mean(confidence_scores)  # 0.42 (low, as expected)
low_conf_rate = np.sum(np.array(confidence_scores) < 0.5) / len(confidence_scores)

assert mean_confidence < 0.6, f"Expected low confidence for faded lanes, got {mean_confidence}"
assert low_conf_rate > 0.7, f"Expected 70%+ low-confidence predictions, got {low_conf_rate}"

print(f"[PASS] PASS: Model correctly identifies faded lanes as low-confidence")

Result: 78% of faded lane images → low confidence (<0.5) → LKA disables (safe degradation) [PASS]


Summary

MLE Process Deliverables:

MLE Phase Work Product Tool ASPICE Mapping
MLE.1 Requirements ML Requirements Spec (120 reqs) Jama Connect SWE.1 (extended for ML)
MLE.2 Dataset Versioned dataset (250k images) DVC, CVAT - (ML-specific)
MLE.3 Development Trained model (95.2% IoU) PyTorch, MLflow SWE.3 (model as "code")
MLE.4 Verification Model Verification Report Python (test scripts) SWE.4 (extended for ML)
MLE.5 Deployment TensorRT optimized model NVIDIA TensorRT SWE.5 Integration
MLE.6 Monitoring Field performance dashboard MLflow, Prometheus - (post-market)

AI Contribution to MLE:

  • Dataset annotation: 20,000 hours (manual, no AI replacement yet)
  • Hyperparameter tuning: Optuna (Bayesian optimization) saved 50% trial-and-error time
  • Model architecture selection: Literature review (ChatGPT-4 summarized 30 papers)

Next: SOTIF considerations for ML-based perception (28.02).