6.5: Machine Learning Engineering (MLE) Process Checklists

Overview

Machine Learning Engineering (MLE) processes (MLE.1-5) introduced in ASPICE 4.0 address the unique challenges of integrating ML models into safety-critical embedded systems. Unlike traditional software with deterministic behavior, ML models exhibit probabilistic outputs, dataset dependencies, and emergent behaviors that require specialized verification approaches.

This chapter provides comprehensive, actionable checklists for each MLE process, enabling development teams to systematically verify ML model quality, safety, and ASPICE compliance. Checklists cover:

MLE.1 (ML Requirements Analysis): Dataset requirements, performance metrics, safety constraints
MLE.2 (ML Architectural Design): Model architecture selection, inference constraints, safety mechanisms
MLE.3 (ML Model Training): Hyperparameter tuning, convergence criteria, overfitting prevention
MLE.4 (ML Model Validation): Accuracy metrics, adversarial robustness, edge case testing
MLE.5 (ML Model Deployment): Performance monitoring, drift detection, fallback strategies
Safety-Critical ML Checklist: ASIL/SIL allocation, failure modes, fault tolerance
Real Example: Autonomous vehicle perception model (object detection) checklist walkthrough
Tool Integration: CI/CD automation with MLflow, Weights & Biases, DVC

ASPICE Alignment: Checklists directly support MLE base practices (BP), output work products (13-series), and traceability requirements.

MLE.1: ML Requirements Analysis Checklist

Process Purpose (ASPICE 4.0): Establish ML-specific requirements including dataset characteristics, model performance targets, and operational constraints.

Checklist: MLE.1 Requirements Definition

1. Dataset Requirements

1.1 Dataset Scope and Representativeness

Data sources identified: Enumerate all data collection methods (e.g., vehicle sensors, simulation, public datasets, augmented data)
Operational Design Domain (ODD) coverage: Dataset represents all intended operating conditions (weather, lighting, road types, geographic regions)
Edge case inclusion: Rare but critical scenarios included (e.g., emergency vehicles, construction zones, pedestrians with unusual clothing)
Sample size justified: Statistical power analysis confirms dataset size sufficient for target accuracy (rule of thumb: 1000+ samples per class for image classification)
Class balance assessed: Imbalanced classes addressed via oversampling, undersampling, or weighted loss functions
Temporal coverage: Dataset spans multiple time periods to capture seasonal variations, sensor aging effects

Example:

# Dataset Requirements Specification (ASIL-C Pedestrian Detection)
dataset:
  name: "Urban_Pedestrian_Detection_v2.0"
  size: 250,000 images
  sources:
    - vehicle_camera: 180,000 images (real-world)
    - simulation: 50,000 images (CARLA simulator)
    - augmentation: 20,000 images (synthetic rain, fog, night)
  odd_coverage:
    weather: [clear, rain, fog, snow]
    lighting: [day, dusk, night, tunnel]
    regions: [urban_dense, suburban, highway, rural]
  class_distribution:
    pedestrian: 80,000 instances
    background: 170,000 instances
  edge_cases: 5,000 images (children, wheelchairs, occluded pedestrians, reflective clothing)

1.2 Dataset Quality Requirements

Annotation accuracy target: Inter-annotator agreement ≥ 95% (Cohen's kappa or IoU for bounding boxes)
Annotation guidelines documented: Clear criteria for edge cases (e.g., "Label partially occluded pedestrian if > 30% visible")
Annotation verification process: Random sample (10%) reviewed by independent annotators
Data integrity checks: Automated validation for corrupt images, missing labels, duplicate samples
Sensor calibration requirements: Camera intrinsic/extrinsic parameters verified, lens distortion corrected
Data provenance tracked: Metadata records collection time, location, sensor ID for traceability

Verification Checklist:

# Automated Dataset Quality Checks
def verify_dataset_quality(dataset_path):
    """
    MLE.1 Dataset Quality Verification Script
    """
    checks = {
        "corrupt_images": 0,
        "missing_labels": 0,
        "duplicate_images": 0,
        "annotation_errors": 0
    }

    # Check 1: Image integrity
    for img_path in glob.glob(f"{dataset_path}/images/*.jpg"):
        try:
            img = cv2.imread(img_path)
            if img is None or img.size == 0:
                checks["corrupt_images"] += 1
        except Exception:
            checks["corrupt_images"] += 1

    # Check 2: Label completeness
    for img_path in glob.glob(f"{dataset_path}/images/*.jpg"):
        label_path = img_path.replace("/images/", "/labels/").replace(".jpg", ".txt")
        if not os.path.exists(label_path):
            checks["missing_labels"] += 1

    # Check 3: Duplicate detection (perceptual hashing)
    image_hashes = {}
    for img_path in glob.glob(f"{dataset_path}/images/*.jpg"):
        img_hash = compute_phash(img_path)
        if img_hash in image_hashes:
            checks["duplicate_images"] += 1
        else:
            image_hashes[img_hash] = img_path

    # Check 4: Annotation plausibility (bounding box sanity checks)
    for label_path in glob.glob(f"{dataset_path}/labels/*.txt"):
        with open(label_path, 'r') as f:
            for line in f:
                # Format: class_id x_center y_center width height (YOLO format)
                parts = line.strip().split()
                x, y, w, h = map(float, parts[1:5])
                if not (0 <= x <= 1 and 0 <= y <= 1 and 0 < w <= 1 and 0 < h <= 1):
                    checks["annotation_errors"] += 1

    # Report
    print("┌─────────────────────────────────────────────┐")
    print("│ MLE.1 Dataset Quality Verification Report   │")
    print("├─────────────────────────────────────────────┤")
    for check, count in checks.items():
        status = "[OK] PASS" if count == 0 else f"[FAIL] FAIL ({count} issues)"
        print(f"│ {check:30s} │ {status:10s} │")
    print("└─────────────────────────────────────────────┘")

    return all(count == 0 for count in checks.values())

1.3 Performance Requirements

Accuracy target specified: Quantitative metric (e.g., mAP ≥ 95% for object detection, F1 ≥ 0.98 for classification)
Latency constraint defined: Maximum inference time (e.g., ≤ 50ms for real-time perception)
Throughput requirement: Frames per second (e.g., 20 FPS for camera processing)
Resource constraints: Memory budget (≤ 2GB RAM), compute budget (≤ 10 GFLOPS on embedded GPU)
Robustness requirements: Performance degradation limits under noise, occlusion, adversarial perturbations
Failure rate target: Acceptable false positive/negative rates (e.g., false negative ≤ 0.1% for ASIL-D)

Example Requirements Table:

┌────────────────────────────────────────────────────────────────────────┐
│ MLE Requirements: Pedestrian Detection System (ASIL-C)                 │
├────────────────────────────────────────────────────────────────────────┤
│ Requirement ID │ Description                            │ Acceptance  │
│                │                                        │ Criteria    │
├────────────────┼────────────────────────────────────────┼─────────────┤
│ MLE-REQ-001    │ Detection accuracy (nominal)           │ mAP ≥ 95%   │
│ MLE-REQ-002    │ Detection accuracy (rain)              │ mAP ≥ 90%   │
│ MLE-REQ-003    │ Detection accuracy (night)             │ mAP ≥ 85%   │
│ MLE-REQ-004    │ False negative rate (ASIL-C)           │ ≤ 1%        │
│ MLE-REQ-005    │ Inference latency (max)                │ ≤ 50ms      │
│ MLE-REQ-006    │ Model size (embedded deployment)       │ ≤ 100MB     │
│ MLE-REQ-007    │ Memory usage (runtime)                 │ ≤ 2GB RAM   │
│ MLE-REQ-008    │ Adversarial robustness                 │ ≥ 80% acc   │
│                │ (FGSM, ε=0.01)                         │ under attack│
└────────────────────────────────────────────────────────────────────────┘

2. Traceability Requirements

MLE.1-BP6: Ensure bidirectional traceability between system requirements and ML requirements.

Parent requirements linked: Each ML requirement traces to system/software requirement
Rationale documented: Justification for each performance target (e.g., "mAP ≥ 95% required to meet ISO 26262 ASIL-C braking distance safety goal")
Stakeholder approval: Requirements reviewed by safety engineer, system architect, ML engineer

Traceability Matrix Template:

SYS-REQ-123 (Pedestrian detection for AEB)
    └─→ MLE-REQ-001 (Detection accuracy ≥ 95%)
    └─→ MLE-REQ-004 (False negative ≤ 1%)
    └─→ MLE-REQ-005 (Latency ≤ 50ms)

SYS-REQ-124 (Operate in degraded weather)
    └─→ MLE-REQ-002 (Rain performance ≥ 90%)
    └─→ MLE-REQ-003 (Night performance ≥ 85%)

MLE.2: ML Architectural Design Checklist

Process Purpose: Define ML model architecture, data pipeline, inference deployment strategy.

Checklist: MLE.2 Architecture Definition

1. Model Architecture Selection

1.1 Architecture Justification

Model family evaluated: Compared alternatives (e.g., YOLO, Faster R-CNN, RetinaNet for object detection)
Trade-off analysis documented: Accuracy vs. latency vs. model size
Benchmark results: Performance on representative dataset for each candidate architecture
Embedded compatibility: Architecture deployable on target hardware (e.g., NVIDIA Jetson, Intel Myriad, Qualcomm Snapdragon)
Quantization support: Model supports INT8 quantization for edge deployment (accuracy drop < 2%)

Example Architecture Selection Matrix:

┌────────────────────────────────────────────────────────────────────────┐
│ Model Architecture Comparison (Pedestrian Detection)                   │
├────────────────────┬─────────┬─────────┬──────────┬──────────┬─────────┤
│ Architecture       │ mAP (%) │ Latency │ Size (MB)│ GPU Mem  │ Selected│
│                    │         │ (ms)    │          │ (GB)     │         │
├────────────────────┼─────────┼─────────┼──────────┼──────────┼─────────┤
│ YOLOv8n (nano)     │ 89.2    │ 12      │ 6        │ 0.5      │         │
│ YOLOv8s (small)    │ 93.7    │ 18      │ 22       │ 1.2      │ [OK] SELECTED│
│ YOLOv8m (medium)   │ 96.1    │ 35      │ 52       │ 2.8      │         │
│ Faster R-CNN       │ 95.8    │ 78      │ 108      │ 3.5      │         │
│ RetinaNet          │ 94.5    │ 52      │ 145      │ 4.1      │         │
└────────────────────────────────────────────────────────────────────────┘

Selection Rationale:
- YOLOv8s meets mAP ≥ 95% requirement (93.7% baseline + 2% expected from fine-tuning)
- Latency 18ms << 50ms requirement (3× margin for safety)
- Size 22MB fits embedded 100MB constraint
- GPU memory 1.2GB within 2GB budget

1.2 Architecture Components

Input preprocessing defined: Image normalization, resizing, augmentation strategy
Backbone network specified: Feature extractor (e.g., CSPDarknet, ResNet, EfficientNet)
Neck/head design: Detection head, classification head, regression head (bounding boxes)
Loss function defined: Multi-task loss (classification + localization + objectness)
Anchor strategy: Anchor-based vs. anchor-free detection (justify choice)
Post-processing: Non-maximum suppression (NMS) thresholds, confidence filtering

2. Data Pipeline Architecture

2.1 Training Pipeline

Data loading strategy: Efficient data loaders (e.g., PyTorch DataLoader with num_workers tuning)
Data augmentation: Random crops, flips, color jitter, mixup/cutout (specify hyperparameters)
Batch size selection: Justified based on GPU memory and convergence stability
Distributed training: Multi-GPU strategy if needed (DataParallel, DistributedDataParallel)

2.2 Inference Pipeline

Preprocessing optimization: Fast resize/normalization (OpenCV, TensorRT optimizations)
Batch inference: Support for batching multiple frames (if applicable)
Output format: Bounding boxes in vehicle coordinate system (camera → world transform)
Latency profiling: End-to-end latency measured (camera capture → detection output)

3. Safety Mechanisms

MLE.2 Safety Architecture Requirements (ASIL-Dependent):

Confidence thresholding: Reject low-confidence predictions (e.g., confidence < 0.7)
Plausibility checks: Sanity checks on model outputs (e.g., pedestrian cannot be 10m tall)
Multi-sensor fusion: Combine camera with radar/lidar for redundancy (ASIL-C/D)
Fallback strategy: Degraded mode if ML model fails (e.g., fall back to radar-only detection)
Watchdog monitoring: Detect inference timeouts or crashes
Output diversity: Ensemble of multiple models for critical decisions (ASIL-D)

Safety Mechanism Example:

def safe_inference_wrapper(image, model, confidence_threshold=0.7):
    """
    MLE.2 Safety Wrapper for Inference
    Implements plausibility checks and fallback logic
    """
    # Timeout watchdog (50ms max per requirement MLE-REQ-005)
    try:
        with timeout(0.05):  # 50ms timeout
            detections = model.predict(image)
    except TimeoutError:
        log_error("Inference timeout - activating fallback")
        return fallback_radar_detection()  # Fallback to radar

    # Plausibility checks
    valid_detections = []
    for det in detections:
        # Check 1: Confidence threshold
        if det.confidence < confidence_threshold:
            continue

        # Check 2: Bounding box sanity (pedestrian height 1.0m - 2.5m typical)
        height_m = bbox_to_real_world_height(det.bbox, camera_calibration)
        if not (0.5 < height_m < 3.0):
            log_warning(f"Implausible pedestrian height: {height_m}m, rejecting detection")
            continue

        # Check 3: Velocity plausibility (pedestrian speed < 15 km/h typical)
        if det.velocity_kmh > 20:
            log_warning(f"Implausible pedestrian velocity: {det.velocity_kmh} km/h")
            continue

        valid_detections.append(det)

    return valid_detections

MLE.3: ML Model Training Checklist

Process Purpose: Train ML model according to architectural design, validate convergence, prevent overfitting.

Checklist: MLE.3 Training Process

1. Training Configuration

1.1 Hyperparameter Selection

Learning rate schedule: Initial LR, warmup, decay strategy (cosine, step, exponential)
Optimizer choice: Justified (SGD, Adam, AdamW) with momentum/weight decay settings
Batch size: Documented and justified (larger batch → stable gradients, smaller → better generalization)
Regularization: Dropout, weight decay, label smoothing (if applicable)
Training epochs: Maximum epochs defined, early stopping criteria specified
Loss weighting: Multi-task loss component weights (e.g., classification:localization = 1.0:2.0)

Example Training Configuration:

# MLE.3 Training Configuration (Pedestrian Detection YOLOv8s)
training:
  epochs: 300
  batch_size: 32
  optimizer: AdamW
  learning_rate:
    initial: 0.001
    warmup_epochs: 5
    schedule: cosine
    min_lr: 0.00001
  regularization:
    weight_decay: 0.0005
    dropout: 0.1
  loss_weights:
    box_loss: 7.5       # Bounding box regression
    cls_loss: 0.5       # Classification
    dfl_loss: 1.5       # Distribution focal loss
  early_stopping:
    patience: 50        # Stop if no improvement for 50 epochs
    metric: val_mAP
    min_delta: 0.001

1.2 Training Data Management

Train/validation/test split: Ratios defined (typical: 70%/15%/15%), stratified by class
Cross-validation strategy: K-fold CV for small datasets (k=5 typical), hold-out set for large datasets
Data versioning: Dataset version tracked (DVC, Git LFS, MLflow)
Reproducibility: Random seeds fixed, hardware configuration documented

2. Convergence Monitoring

2.1 Training Metrics Tracking

Loss curves: Training loss and validation loss monitored (divergence indicates overfitting)
Accuracy metrics: mAP, F1, precision/recall tracked per epoch
Convergence criteria: Training converged (validation loss plateau for ≥ 50 epochs)
Gradient monitoring: Gradient norms tracked (vanishing/exploding gradients detected)
Learning rate schedule effectiveness: LR adjustments correlated with loss improvements

Convergence Validation Script:

def validate_training_convergence(training_log):
    """
    MLE.3 Convergence Validation
    """
    checks = {}

    # Check 1: Training loss decreasing trend
    train_loss = training_log["train_loss"]
    if train_loss[-1] > train_loss[0] * 0.5:
        checks["loss_reduction"] = "[FAIL] FAIL: Training loss did not reduce significantly"
    else:
        checks["loss_reduction"] = "[OK] PASS"

    # Check 2: Validation loss stable (not increasing)
    val_loss = training_log["val_loss"]
    # Compare last 20% epochs to middle 20%
    mid_val_loss = np.mean(val_loss[len(val_loss)//2 - 10:len(val_loss)//2 + 10])
    final_val_loss = np.mean(val_loss[-20:])
    if final_val_loss > mid_val_loss * 1.1:
        checks["overfitting"] = "[WARN] WARNING: Validation loss increasing (overfitting)"
    else:
        checks["overfitting"] = "[OK] PASS"

    # Check 3: Validation metric meets target
    final_mAP = training_log["val_mAP"][-1]
    if final_mAP < 0.95:  # MLE-REQ-001 target
        checks["accuracy_target"] = f"[FAIL] FAIL: mAP {final_mAP:.2%} < 95% target"
    else:
        checks["accuracy_target"] = f"[OK] PASS: mAP {final_mAP:.2%}"

    # Check 4: No gradient pathologies
    grad_norms = training_log["gradient_norm"]
    if max(grad_norms) > 1000 or min(grad_norms) < 1e-6:
        checks["gradient_health"] = "[WARN] WARNING: Gradient explosion/vanishing detected"
    else:
        checks["gradient_health"] = "[OK] PASS"

    return checks

2.2 Overfitting Prevention

Train/val gap monitored: Validation accuracy within 5% of training accuracy
Regularization effectiveness: Dropout/weight decay reducing overfitting
Data augmentation sufficiency: Augmentation diversity validated (visual inspection)
Model capacity appropriate: Not over-parameterized for dataset size (rule: 10× samples per parameter)

3. Experiment Tracking

MLE.3-BP5: Document all training experiments for reproducibility and traceability.

Experiment metadata logged: Model version, dataset version, hyperparameters, hardware config
Artifact storage: Trained model checkpoints saved (every N epochs + best model)
Metrics logged: All training/validation metrics stored (MLflow, W&B, TensorBoard)
Code version tracked: Git commit SHA of training code recorded

Example: MLflow Integration

import mlflow
import mlflow.pytorch

def train_with_mlflow(config, dataset):
    """
    MLE.3 Training with MLflow Experiment Tracking
    """
    # Start MLflow run
    with mlflow.start_run(run_name=f"yolov8s_pedestrian_{datetime.now().isoformat()}"):
        # Log hyperparameters
        mlflow.log_params({
            "model": "YOLOv8s",
            "dataset_version": dataset.version,
            "batch_size": config.batch_size,
            "learning_rate": config.learning_rate,
            "optimizer": config.optimizer,
            "epochs": config.epochs
        })

        # Training loop
        model = build_model(config)
        for epoch in range(config.epochs):
            train_loss = train_one_epoch(model, dataset.train_loader)
            val_metrics = validate(model, dataset.val_loader)

            # Log metrics
            mlflow.log_metrics({
                "train_loss": train_loss,
                "val_loss": val_metrics["loss"],
                "val_mAP": val_metrics["mAP"],
                "val_precision": val_metrics["precision"],
                "val_recall": val_metrics["recall"]
            }, step=epoch)

            # Save checkpoint
            if val_metrics["mAP"] > best_mAP:
                best_mAP = val_metrics["mAP"]
                mlflow.pytorch.log_model(model, "best_model")

        # Log final artifacts
        mlflow.log_artifact("training_curves.png")
        mlflow.log_artifact("confusion_matrix.png")

        return model

MLE.4: ML Model Validation Checklist

Process Purpose: Verify trained model meets performance, robustness, and safety requirements.

Checklist: MLE.4 Validation Process

1. Accuracy and Performance Validation

1.1 Test Set Evaluation

Test set independent: No overlap with train/validation sets (verified via data provenance)
Test set representative: Covers all ODD scenarios (weather, lighting, edge cases)
Quantitative metrics computed: mAP, precision, recall, F1, confusion matrix
Per-class performance: Metrics broken down by scenario (day vs. night, clear vs. rain)
Acceptance criteria met: All MLE.1 requirements satisfied (e.g., mAP ≥ 95%)

Test Report Template:

┌────────────────────────────────────────────────────────────────────────┐
│ MLE.4 Validation Report: Pedestrian Detection Model v2.1               │
│ Test Date: 2026-01-04                                                  │
│ Test Set: Urban_Pedestrian_Test_v2.0 (25,000 images, unseen data)     │
├────────────────────────────────────────────────────────────────────────┤
│ Metric                     │ Value    │ Requirement  │ Status          │
├────────────────────────────┼──────────┼──────────────┼─────────────────┤
│ Overall mAP                │ 95.8%    │ ≥ 95%        │ [OK] PASS          │
│ Precision                  │ 94.2%    │ N/A          │ [OK] INFO          │
│ Recall                     │ 97.1%    │ N/A          │ [OK] INFO          │
│ F1 Score                   │ 95.6%    │ N/A          │ [OK] INFO          │
│ False Negative Rate        │ 0.8%     │ ≤ 1%         │ [OK] PASS          │
├────────────────────────────┴──────────┴──────────────┴─────────────────┤
│ Scenario-Specific Performance:                                         │
│ - Clear day:     mAP 97.2% ([OK] exceeds 95%)                             │
│ - Rain:          mAP 92.4% ([OK] exceeds 90% degraded requirement)        │
│ - Night:         mAP 88.1% ([OK] exceeds 85% night requirement)           │
│ - Edge cases:    mAP 79.3% ([WARN] below 80% target, needs review)         │
├────────────────────────────────────────────────────────────────────────┤
│ Inference Performance:                                                  │
│ - Latency (mean):  18.2ms ([OK] < 50ms requirement)                       │
│ - Latency (95th):  23.1ms ([OK] < 50ms requirement)                       │
│ - Throughput:      54.9 FPS ([OK] exceeds 20 FPS requirement)             │
│ - Memory usage:    1.8 GB ([OK] < 2 GB requirement)                       │
└────────────────────────────────────────────────────────────────────────┘

RECOMMENDATION: APPROVE with caveat - improve edge case performance in next iteration

1.2 Latency and Resource Validation

Inference latency measured: Mean, median, 95th percentile, max latency on target hardware
Latency requirement met: All percentiles < target (e.g., 95th < 50ms)
Throughput validated: FPS meets real-time constraint
Memory profiling: Peak RAM usage within budget
GPU utilization: Compute efficiency measured (GFLOPS, % GPU utilization)
Model size confirmed: Checkpoint file size ≤ target (e.g., 100MB)

2. Robustness and Adversarial Testing

2.1 Noise Robustness

Gaussian noise: Performance degradation < 10% under σ=0.05 noise
Salt-and-pepper noise: Robust to sensor artifacts
JPEG compression: Accuracy stable under quality=50-90 compression
Motion blur: Simulated vehicle motion (blur kernel validation)

2.2 Adversarial Robustness (ASIL-C/D Requirement)

FGSM attack: Accuracy under ε=0.01 perturbation ≥ 80% (MLE-REQ-008)
PGD attack: Multi-step attack resistance validated
Physical perturbations: Validated against adversarial patches (if applicable to use case)
Certified robustness: Formal verification for critical scenarios (optional, research-grade tools)

Adversarial Test Script:

from torchattacks import FGSM, PGD

def adversarial_robustness_test(model, test_loader, epsilon=0.01):
    """
    MLE.4 Adversarial Robustness Validation
    """
    # Clean accuracy baseline
    clean_acc = evaluate_model(model, test_loader)

    # FGSM attack
    fgsm_attack = FGSM(model, eps=epsilon)
    fgsm_acc = 0
    for images, labels in test_loader:
        adv_images = fgsm_attack(images, labels)
        preds = model(adv_images)
        fgsm_acc += (preds.argmax(1) == labels).sum().item()
    fgsm_acc /= len(test_loader.dataset)

    # PGD attack (stronger)
    pgd_attack = PGD(model, eps=epsilon, alpha=epsilon/4, steps=10)
    pgd_acc = 0
    for images, labels in test_loader:
        adv_images = pgd_attack(images, labels)
        preds = model(adv_images)
        pgd_acc += (preds.argmax(1) == labels).sum().item()
    pgd_acc /= len(test_loader.dataset)

    print(f"Clean Accuracy: {clean_acc:.2%}")
    print(f"FGSM Accuracy (ε={epsilon}): {fgsm_acc:.2%}")
    print(f"PGD Accuracy (ε={epsilon}): {pgd_acc:.2%}")

    # Requirement check (MLE-REQ-008: ≥ 80% under FGSM ε=0.01)
    if fgsm_acc >= 0.80:
        print("[OK] PASS: Adversarial robustness requirement met")
        return True
    else:
        print("[FAIL] FAIL: Adversarial robustness below 80% threshold")
        return False

3. Edge Case and Failure Mode Analysis

3.1 Edge Case Testing

Rare scenarios validated: Model tested on edge case subset (5,000 images in example dataset)
Failure analysis: Top-K failure cases reviewed (manual inspection of false negatives/positives)
Root cause documented: Why did model fail? (occlusion, lighting, novel object appearance)
Mitigation strategies: Recommendations for dataset augmentation or architecture improvements

3.2 Out-of-Distribution (OOD) Detection

OOD test set: Validate on data outside training distribution (e.g., different geographic region)
Confidence calibration: Model confidence scores correlate with accuracy (calibration plot)
Uncertainty quantification: Epistemic uncertainty estimated (Bayesian methods, ensemble variance)
Rejection threshold: Low-confidence predictions flagged for human review

MLE.5: ML Model Deployment Checklist

Process Purpose: Deploy trained model to production, monitor performance, detect drift.

Checklist: MLE.5 Deployment Process

1. Model Deployment Preparation

1.1 Model Optimization

Quantization: Model converted to INT8 (TensorRT, ONNX Runtime) with < 2% accuracy drop
Pruning: Optional - remove redundant weights (if applicable)
Compilation: Model compiled for target hardware (TensorRT engine, OpenVINO IR, Core ML)
Benchmark on target: Latency re-validated on embedded platform (NVIDIA Jetson, Intel Myriad)

Deployment Optimization Example:

import tensorrt as trt
import onnx

def optimize_for_deployment(onnx_model_path, target="jetson_xavier"):
    """
    MLE.5 Model Optimization for Embedded Deployment
    """
    # Step 1: Convert PyTorch → ONNX (already done)
    # onnx_model = torch.onnx.export(model, dummy_input, onnx_model_path)

    # Step 2: ONNX → TensorRT (INT8 quantization)
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, TRT_LOGGER)

    with open(onnx_model_path, 'rb') as f:
        parser.parse(f.read())

    # Configure INT8 quantization
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = create_calibrator(calibration_dataset)  # Uses subset of training data

    # Build engine
    engine = builder.build_serialized_network(network, config)

    # Save optimized engine
    with open(f"model_{target}_int8.trt", 'wb') as f:
        f.write(engine)

    # Validate accuracy post-quantization
    accuracy_drop = validate_quantized_model(engine, test_dataset)
    assert accuracy_drop < 0.02, f"Quantization accuracy drop {accuracy_drop:.2%} exceeds 2% limit"

    return engine

1.2 Integration Testing

End-to-end system test: Model integrated with perception pipeline (sensor input → model → vehicle control)
Hardware-in-loop (HiL) testing: Deployed on target ECU, tested in vehicle simulator
Fault injection: Sensor failures, communication errors, model timeout scenarios tested
Safety mechanism validation: Confidence thresholding, plausibility checks working as designed

2. Performance Monitoring (Runtime)

2.1 Monitoring Infrastructure

Inference metrics logged: Latency, throughput, confidence scores per frame
Resource usage tracked: CPU/GPU utilization, memory consumption, power draw
Error rate monitoring: False positive/negative rates (if ground truth available)
Alerting configured: Alerts for latency spikes, accuracy degradation, crashes

Monitoring Dashboard (Prometheus + Grafana):

# Prometheus metrics for ML model monitoring
metrics:
  - name: inference_latency_ms
    type: histogram
    buckets: [10, 20, 30, 50, 100]
    labels: [model_version, hardware]

  - name: inference_throughput_fps
    type: gauge
    labels: [model_version]

  - name: detection_confidence
    type: histogram
    buckets: [0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
    labels: [model_version, scenario]

  - name: model_error_rate
    type: counter
    labels: [error_type]  # timeout, crash, low_confidence

alerts:
  - name: HighLatency
    condition: inference_latency_ms_p95 > 50
    severity: critical
    message: "Model inference latency exceeds 50ms requirement"

  - name: LowConfidence
    condition: avg(detection_confidence) < 0.7 over 5min
    severity: warning
    message: "Model confidence degraded - possible distribution shift"

2.2 Drift Detection

Input drift monitoring: Statistical tests (KS test, MMD) detect changes in input distribution
Output drift monitoring: Prediction distribution shifts tracked
Performance drift: Accuracy degrades over time (if ground truth labels available)
Retraining trigger: Automated retraining when drift detected (threshold: > 5% accuracy drop)

Drift Detection Example:

from scipy.stats import ks_2samp

def detect_input_drift(reference_data, production_data, threshold=0.05):
    """
    MLE.5 Input Drift Detection (Kolmogorov-Smirnov Test)
    """
    # Compare distributions of image statistics (e.g., mean pixel intensity)
    ref_means = [img.mean() for img in reference_data]
    prod_means = [img.mean() for img in production_data]

    # KS test
    statistic, p_value = ks_2samp(ref_means, prod_means)

    if p_value < threshold:
        print(f"[WARN] INPUT DRIFT DETECTED: p-value={p_value:.4f} (distributions differ significantly)")
        trigger_retraining_pipeline()
        return True
    else:
        print(f"[OK] No drift: p-value={p_value:.4f}")
        return False

3. Fallback and Safe State Management

3.1 Degraded Mode Operation

Fallback sensors: If camera ML fails, system falls back to radar/lidar
Reduced ODD: Limit operating conditions (e.g., disable in heavy rain if performance < threshold)
Driver warnings: Inform driver of degraded perception capability
Safe stop: Vehicle can execute safe stop maneuver if critical perception lost

3.2 Model Update Strategy

A/B testing: New model version deployed to subset of fleet (10%) before full rollout
Canary deployment: Gradual rollout with performance monitoring (rollback if issues)
Versioning: Model version tracked in production (Git tag, MLflow model registry)
Rollback plan: Procedure to revert to previous model version (max rollback time: 1 hour)

Safety-Critical ML Checklist (ASIL/SIL-Specific)

Additional Requirements for ASIL-C / ASIL-D Models

1. Hazard Analysis and Risk Assessment

HARA completed: Hazards associated with ML model failures identified (ISO 26262-3)
ASIL assigned: Model components classified (e.g., pedestrian detection = ASIL-C)
Safety goals derived: Quantitative safety targets (e.g., false negative ≤ 0.1% for ASIL-D)

2. Fault Tolerance and Redundancy

Redundant sensors: Multi-modal perception (camera + radar + lidar) for ASIL-D
Model diversity: Ensemble of different architectures (mitigate correlated failures)
Voting logic: Majority voting for safety-critical decisions
Graceful degradation: System remains safe with single-point failures

3. Verification and Validation (V&V)

Independent V&V: Model validated by team not involved in development (ASIL-C/D)
Formal verification: Critical components verified (e.g., input sanitization, output bounds)
Back-to-back testing: Compare ML model output with physics-based fallback (e.g., radar)
Field testing: Validation on real vehicle (X million km driven for ASIL-D)

4. Tool Qualification (DO-330 / ISO 26262-8)

Training framework qualified: TensorFlow/PyTorch tool confidence level (TCL) assessed
Compiler qualified: TensorRT/ONNX Runtime qualified for safety use
Testing tools qualified: Coverage tools, test harness validated

Real-World Example: Autonomous Vehicle Perception Model

Case Study: ASIL-C Pedestrian Detection System

System Overview:

Function: Detect pedestrians for Autonomous Emergency Braking (AEB)
ASIL Level: ASIL-C (inadvertent collision hazard)
Target Platform: NVIDIA Jetson AGX Xavier (embedded GPU)
Model: YOLOv8s (22MB, 18ms latency, 95.8% mAP)

Full Checklist Walkthrough (Abbreviated):

MLE.1: Requirements

[OK] Dataset: 250K images (urban pedestrians, weather variations) [OK] Performance: mAP ≥ 95%, latency ≤ 50ms, false negative ≤ 1% [OK] Traceability: SYS-REQ-123 → MLE-REQ-001/004/005

MLE.2: Architecture

[OK] Model: YOLOv8s (selected via benchmarking) [OK] Safety mechanisms: Confidence thresholding (0.7), plausibility checks (height, velocity) [OK] Fallback: Radar-only detection if camera ML fails

MLE.3: Training

[OK] Hyperparameters: AdamW, LR 0.001→0.00001 cosine, 300 epochs [OK] Convergence: Val loss plateaued at epoch 280, mAP 95.8% [OK] Experiment tracking: MLflow logged all runs

MLE.4: Validation

[OK] Test set: 25K independent images (mAP 95.8% [OK], latency 18ms [OK]) [OK] Adversarial: FGSM ε=0.01 accuracy 82.1% ([OK] exceeds 80% requirement) [OK] Edge cases: 79.3% mAP ([WARN] needs improvement, added to backlog)

MLE.5: Deployment

[OK] Optimization: TensorRT INT8 quantization (mAP 95.1%, latency 12ms) [OK] HiL testing: 1,000 scenarios passed (AEB activation validated) [OK] Monitoring: Grafana dashboard tracking latency, confidence, drift [OK] Rollout: Canary deployment to 10% fleet → full rollout after 2 weeks

Result: Model approved for production deployment (ASIL-C compliance achieved).

Tool Integration for Automated Checklist Enforcement

CI/CD Pipeline with Checklist Gates

Example: GitLab CI/CD for ML Model Validation

# .gitlab-ci.yml - ML Model CI/CD Pipeline
stages:
  - data_validation
  - training
  - model_validation
  - deployment

# MLE.1: Data Quality Checks
data_quality_check:
  stage: data_validation
  script:
    - python scripts/mle1_dataset_validation.py --dataset $DATASET_PATH
  artifacts:
    reports:
      junit: data_quality_report.xml
  rules:
    - if: '$CI_PIPELINE_SOURCE == "push" && $DATASET_VERSION != null'

# MLE.3: Training with Tracking
model_training:
  stage: training
  script:
    - python train.py --config configs/yolov8s_pedestrian.yaml
    - mlflow run . --experiment-name pedestrian_detection
  artifacts:
    paths:
      - models/best_model.pt
      - training_log.json
  only:
    - main

# MLE.4: Validation Checklist
model_validation:
  stage: model_validation
  script:
    - python scripts/mle4_validation_suite.py --model models/best_model.pt --test-set $TEST_SET
    - python scripts/adversarial_robustness_test.py --model models/best_model.pt --epsilon 0.01
  artifacts:
    reports:
      junit: validation_report.xml
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
  # Gate: Block deployment if validation fails
  allow_failure: false

# MLE.5: Deployment to Staging
deploy_to_staging:
  stage: deployment
  script:
    - python scripts/optimize_for_tensorrt.py --model models/best_model.pt --target jetson_xavier
    - ansible-playbook deploy_model.yml --limit staging_fleet
  only:
    - tags  # Only deploy on version tags (e.g., v2.1.0)
  when: manual  # Require human approval

Conclusion and Best Practices

Key Takeaways

Checklists enforce rigor: Systematic validation prevents ML-specific pitfalls (overfitting, drift, adversarial failures)
Automation is essential: CI/CD integration ensures checklists enforced on every model update
Traceability throughout: Requirements → Data → Model → Tests linkage mandatory for ASPICE compliance
Safety mechanisms are non-negotiable: ASIL-C/D models require redundancy, fallback, monitoring
Continuous monitoring: Deployment is not the end - drift detection and retraining critical for long-term safety

Adoption Roadmap

Phase 1 (Months 1-2): Implement MLE.1-3 checklists

Define ML requirements template
Establish training pipeline with MLflow tracking
Create convergence validation scripts

Phase 2 (Months 3-4): Implement MLE.4-5 checklists

Build automated validation test suite
Set up adversarial robustness testing
Deploy monitoring infrastructure (Prometheus/Grafana)

Phase 3 (Months 5-6): Full ASPICE compliance

Conduct independent V&V for ASIL-C/D models
Complete traceability matrix (requirements ↔ code ↔ tests)
Pass ASPICE assessment (target CL2)

Next Chapter: Chapter 11.1 AI-Powered Requirements Tools - Leverage AI agents to automate requirements analysis and traceability.

References

VDA: Automotive SPICE PAM 4.0 (2023) - MLE.1-5 Process Descriptions
ISO 26262-6:2018: Product Development at the Software Level (ML-specific guidance in Part 11, if applicable)
ISO/PAS 21448 (SOTIF): Safety of the Intended Functionality (ML verification requirements)
UL 4600: Standard for Safety for the Evaluation of Autonomous Products (ML safety assurance)
Goodfellow, Ian et al.: "Explaining and Harnessing Adversarial Examples" (ICLR 2015)
Guo, Chuan et al.: "On Calibration of Modern Neural Networks" (ICML 2017)
Rabanser, Stephan et al.: "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift" (NeurIPS 2019)
TensorRT Documentation: NVIDIA TensorRT INT8 Quantization Guide (2025)
MLflow: "MLflow: A Platform for the Machine Learning Lifecycle" (2023)