4.4: MLE.4 ML Model Testing


Process Definition

Purpose

MLE.4 Purpose: To verify that the ML model meets the ML requirements and performs within the defined operational domain.

Outcomes

Outcome Description
O1 A ML test approach is defined
O2 A ML test data set is created
O3 The trained ML model is tested
O4 The deployed ML model is derived from the trained ML model and tested
O5 Consistency and bidirectional traceability are established between the ML test approach and the ML requirements, and the ML test data set and the ML data requirements
O6 Results of the ML model testing are summarized and communicated with the deployed ML model to all affected parties

Base Practices with AI Integration

BP Base Practice AI Level AI Application
BP1 Specify an ML test approach L1-L2 Test strategy suggestions
BP2 Create ML test data set L2 Data selection, coverage analysis
BP3 Test trained ML model L2-L3 Automated test execution
BP4 Derive deployed ML model L2 Model conversion, optimization
BP5 Test deployed ML model L2-L3 Automated test execution
BP6 Ensure consistency and establish bidirectional traceability L2 Trace generation, coverage tracking
BP7 Summarize and communicate results L1 Report generation

ML Testing Framework

Test Categories

The following diagram categorizes ML model testing into functional, performance, robustness, and safety testing, showing the scope and purpose of each category.

ML Model Testing Framework


Test Specification

Statistical Validation Test

# ML Test Specification
test_specification:
  id: MLE-TEST-001
  name: Vehicle Detection Statistical Validation
  model: MLE-MODEL-001-v2.3.1
  requirement: MLE-ADAS-001

  test_dataset:
    id: MLE-DATA-002
    type: held_out_test
    samples: 150000
    never_used_in_training: true
    geographic_diversity: true

  metrics:
    primary:
      - metric: recall
        threshold: 0.999
        confidence_level: 0.95

      - metric: precision
        threshold: 0.98
        confidence_level: 0.95

      - metric: mAP_50
        threshold: 0.85
        confidence_level: 0.95

    secondary:
      - metric: f1_score
        threshold: 0.97

      - metric: inference_time_ms
        threshold: 50
        percentile: 99

  stratification:
    - by: weather
      strata: [clear, rain, fog]
      min_samples_per_stratum: 10000

    - by: lighting
      strata: [day, twilight, night]
      min_samples_per_stratum: 10000

    - by: vehicle_type
      strata: [car, truck, motorcycle, bus]
      min_samples_per_stratum: 5000

  pass_criteria:
    all_primary_metrics: true
    statistical_significance: "p < 0.05"  # Standard threshold; see statistical testing literature
    stratified_performance: "No stratum > 5% below average"

Edge Case Test Specification

# Edge Case Testing
edge_case_tests:
  id: MLE-TEST-002
  name: Vehicle Detection Edge Cases
  model: MLE-MODEL-001-v2.3.1

  scenarios:
    - id: EC-001
      name: "Partial occlusion"
      description: "Vehicle partially hidden by barrier"
      expected: "Detection with >50% visible"
      samples: 1000

    - id: EC-002
      name: "Unusual vehicle appearance"
      description: "Vehicles with unusual paint/wrap"
      expected: "Detection regardless of appearance"
      samples: 500

    - id: EC-003
      name: "Distance boundary"
      description: "Vehicles at maximum detection range"
      expected: "Detection at 100m distance"
      samples: 1000

    - id: EC-004
      name: "High-speed scenario"
      description: "Approaching vehicle at high relative speed"
      expected: "Consistent detection across frames"
      samples: 500

    - id: EC-005
      name: "Multi-vehicle cluster"
      description: "Group of vehicles close together"
      expected: "Individual detection of each vehicle"
      samples: 800

    - id: EC-006
      name: "Vehicle transition"
      description: "Vehicle entering/exiting frame"
      expected: "No false positives at boundaries"
      samples: 600

Test Results Report

The diagram below presents a structured ML model test report, summarizing pass/fail results across all test categories along with key performance metrics and confidence intervals.

ML Model Test Report


Robustness Testing

Adversarial Testing

Note: Python code examples are illustrative; evaluate_* helper functions require project-specific implementation.

"""
Adversarial robustness testing for ML model
"""

import torch
import numpy as np
from typing import Tuple

def fgsm_attack(model, image: torch.Tensor, label: torch.Tensor,
                epsilon: float = 0.03) -> torch.Tensor:
    """Fast Gradient Sign Method attack."""
    image.requires_grad = True

    output = model(image)
    loss = compute_loss(output, label)

    model.zero_grad()
    loss.backward()

    # Generate adversarial example
    perturbed = image + epsilon * image.grad.sign()
    perturbed = torch.clamp(perturbed, 0, 1)

    return perturbed


def run_robustness_tests(model, test_loader, config) -> dict:
    """Run comprehensive robustness tests."""

    results = {
        'clean_accuracy': 0,
        'fgsm_accuracy': {},
        'noise_robustness': {},
        'blur_robustness': {},
        'brightness_robustness': {}
    }

    # Clean accuracy
    results['clean_accuracy'] = evaluate_accuracy(model, test_loader)

    # FGSM attack at various epsilon values
    for epsilon in [0.01, 0.03, 0.05, 0.1]:
        perturbed_acc = evaluate_adversarial(model, test_loader, epsilon)
        results['fgsm_accuracy'][epsilon] = perturbed_acc

    # Gaussian noise robustness
    for sigma in [0.01, 0.05, 0.1]:
        noisy_acc = evaluate_with_noise(model, test_loader, sigma)
        results['noise_robustness'][sigma] = noisy_acc

    # Motion blur robustness
    for kernel_size in [3, 5, 7]:
        blur_acc = evaluate_with_blur(model, test_loader, kernel_size)
        results['blur_robustness'][kernel_size] = blur_acc

    # Brightness variation robustness
    for factor in [0.5, 0.8, 1.2, 1.5]:
        bright_acc = evaluate_with_brightness(model, test_loader, factor)
        results['brightness_robustness'][factor] = bright_acc

    return results

Work Products

WP ID Work Product AI Role
08-62 ML test specification Test generation
13-62 ML test report Result analysis
13-66 Edge case analysis Scenario identification
17-11 Traceability record Coverage tracking

Summary

MLE.4 ML Model Testing:

  • AI Level: L2 (automated testing, human validation)
  • Primary AI Value: Test generation, result analysis
  • Human Essential: Pass/fail judgment, safety assessment
  • Key Outputs: Test report, coverage analysis
  • Focus: Statistical validation, edge cases, robustness