4.3: MLE.3 ML Training and Learning


Process Definition

Purpose

MLE.3 Purpose: To train the ML model according to the architecture and data requirements.

Outcomes

Outcome Description
O1 A ML training and validation approach is specified
O2 The data set for ML training and ML validation is created
O3 The ML model, including hyperparameter values, is optimized to meet the defined ML requirements
O4 Consistency and bidirectional traceability are established between the ML training and validation data set and the ML data requirements
O5 Results of optimization are summarized, and the trained ML model is agreed and communicated to all affected parties

Base Practices with AI Integration

BP Base Practice AI Level AI Application
BP1 Specify ML training and validation approach L1-L2 Approach definition
BP2 Create ML training and validation data set L2 Data selection, augmentation
BP3 Create and optimize ML model L2-L3 Automated training, hyperparameter tuning
BP4 Ensure consistency and establish bidirectional traceability L2 Data-to-requirements tracing
BP5 Summarize and communicate agreed trained ML model L1 Results documentation

Training Pipeline

End-to-End Training Process

The following diagram illustrates the end-to-end ML training pipeline, from data ingestion and augmentation through model training, validation, and artifact versioning.

ML Training Pipeline


Data Augmentation Strategy

Automotive-Specific Augmentation

# Data augmentation configuration
augmentation:
  geometric:
    horizontal_flip:
      enabled: true
      probability: 0.5

    scale:
      enabled: true
      range: [0.8, 1.2]

    rotation:
      enabled: true
      range: [-5, 5]  # degrees - limited for driving context

    crop:
      enabled: true
      min_area: 0.7

  photometric:
    brightness:
      enabled: true
      range: [-0.2, 0.2]

    contrast:
      enabled: true
      range: [0.8, 1.2]

    saturation:
      enabled: true
      range: [0.8, 1.2]

    hue:
      enabled: true
      range: [-0.1, 0.1]

  weather_simulation:
    rain:
      enabled: true
      intensity_range: [0.1, 0.5]

    fog:
      enabled: true
      density_range: [0.1, 0.3]

    snow:
      enabled: false  # Use real snow data instead

  automotive_specific:
    sun_flare:
      enabled: true
      probability: 0.1

    headlight_glare:
      enabled: true
      probability: 0.1  # Night scenes only

    sensor_noise:
      enabled: true
      noise_level: 0.01

  mosaic:
    enabled: true
    probability: 0.5
    grid_size: 2

  mixup:
    enabled: true
    probability: 0.1
    alpha: 0.5

Training Configuration

Hyperparameter Specification

# Training configuration
training:
  id: MLE-TRAIN-001
  model: YOLOv8-nano
  dataset: MLE-DATA-001

  # Base training parameters
  epochs: 300
  batch_size: 64
  image_size: [640, 384]

  # Optimizer configuration
  optimizer:
    type: AdamW
    lr: 0.001
    weight_decay: 0.0005
    momentum: 0.937

  # Learning rate schedule
  scheduler:
    type: cosine
    warmup_epochs: 3
    warmup_bias_lr: 0.1
    warmup_momentum: 0.8
    min_lr: 0.0001

  # Transfer learning
  transfer:
    pretrained: "yolov8n.pt"
    freeze_backbone: false
    freeze_epochs: 0

  # Loss function
  loss:
    box_loss: 7.5
    cls_loss: 0.5
    dfl_loss: 1.5

  # Early stopping
  early_stopping:
    patience: 50
    metric: "mAP@0.5"
    mode: "max"

  # Checkpointing
  checkpointing:
    save_period: 10  # epochs
    save_best: true
    save_last: true

  # Hardware
  hardware:
    device: "cuda:0"
    workers: 8
    pin_memory: true
    amp: true  # Automatic Mixed Precision

  # Reproducibility
  reproducibility:
    seed: 42
    deterministic: true

Training Monitoring

Metrics Dashboard

The diagram below shows a training monitoring dashboard, displaying real-time loss curves, accuracy metrics, and early stopping indicators that enable engineers to detect training issues promptly.

Training Monitoring Dashboard


Model Versioning

Version Control for ML

# Model version specification (illustrative example)
model_version:
  id: MLE-MODEL-001-v2.3.1
  created: "(timestamp)"
  training_run: MLE-TRAIN-001

  artifacts:
    weights: "models/yolov8n_adas_v2.3.1.pt"
    config: "configs/train_v2.3.1.yaml"
    onnx: "models/yolov8n_adas_v2.3.1.onnx"
    trt_int8: "models/yolov8n_adas_v2.3.1.engine"

  metrics:
    validation:
      mAP_50: 0.894
      precision: 0.965
      recall: 0.989
      f1: 0.977

    test:  # Held-out test set
      mAP_50: 0.887
      precision: 0.958
      recall: 0.984
      f1: 0.971

  training_data:
    dataset_version: MLE-DATA-001-v1.2
    train_samples: 700000
    val_samples: 150000
    augmentation: "aug_config_v2.yaml"

  training_params:
    epochs: 287  # Early stopped
    best_epoch: 237
    training_time: "36h 42m"
    gpu: "NVIDIA A100"

  dependencies:
    pytorch: "2.1.0"
    ultralytics: "8.0.200"
    cuda: "12.1"

  traceability:
    requirements:
      - MLE-ADAS-001
      - MLE-ADAS-002
    architecture: MLE-ARCH-001
    previous_version: MLE-MODEL-001-v2.2.0

  status: "validated"  # draft, training, validated, released
  approver: "(Approver Name)"
  approval_date: "(Approval Date)"

Quantization-Aware Training

QAT Process

Note: Python code examples are illustrative and require project-specific implementation (optimizer, loss functions, etc.).

"""
Quantization-Aware Training for Automotive Deployment
"""

import torch
from torch.quantization import QuantStub, DeQuantStub

class QuantizedModel(torch.nn.Module):
    """Model wrapper for quantization-aware training."""

    def __init__(self, base_model):
        super().__init__()
        self.quant = QuantStub()
        self.model = base_model
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.model(x)
        x = self.dequant(x)
        return x


def train_qat(model, train_loader, val_loader, config):
    """Quantization-aware training loop."""

    # Prepare model for QAT
    model.train()
    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    torch.quantization.prepare_qat(model, inplace=True)

    # Training loop
    for epoch in range(config['qat_epochs']):
        for batch in train_loader:
            images, labels = batch

            # Forward pass
            outputs = model(images)
            loss = compute_loss(outputs, labels)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Validation
        val_metrics = validate(model, val_loader)
        print(f"Epoch {epoch}: mAP={val_metrics['mAP']:.4f}")

        # Check quantization degradation
        if epoch == 0:
            baseline_map = val_metrics['mAP']
        if baseline_map - val_metrics['mAP'] > 0.01:
            print("WARNING: QAT causing >1% accuracy drop")

    # Convert to quantized model
    model.eval()
    quantized_model = torch.quantization.convert(model, inplace=False)

    return quantized_model

Work Products

WP ID Work Product AI Role
11-06 Trained model Training output
11-07 Training dataset Data preparation
13-65 Training report Metrics tracking
04-11 Training configuration Setup documentation

Summary

MLE.3 ML Training and Learning:

  • AI Level: L2-L3 (high automation for training)
  • Primary AI Value: Automated training, hyperparameter optimization
  • Human Essential: Data quality, final model selection
  • Key Outputs: Trained model, training report
  • Focus: Reproducibility, versioning, traceability