4.4: MLE.4 ML Model Testing

Process Definition

Purpose

MLE.4 Purpose: To verify that the ML model meets the ML requirements and performs within the defined operational domain.

Outcomes

Outcome	Description
O1	A ML test approach is defined
O2	A ML test data set is created
O3	The trained ML model is tested
O4	The deployed ML model is derived from the trained ML model and tested
O5	Consistency and bidirectional traceability are established between the ML test approach and the ML requirements, and the ML test data set and the ML data requirements
O6	Results of the ML model testing are summarized and communicated with the deployed ML model to all affected parties

Base Practices with AI Integration

BP	Base Practice	AI Level	AI Application
BP1	Specify an ML test approach	L1-L2	Test strategy suggestions
BP2	Create ML test data set	L2	Data selection, coverage analysis
BP3	Test trained ML model	L2-L3	Automated test execution
BP4	Derive deployed ML model	L2	Model conversion, optimization
BP5	Test deployed ML model	L2-L3	Automated test execution
BP6	Ensure consistency and establish bidirectional traceability	L2	Trace generation, coverage tracking
BP7	Summarize and communicate results	L1	Report generation

ML Testing Framework

Test Categories

The following diagram categorizes ML model testing into functional, performance, robustness, and safety testing, showing the scope and purpose of each category.

ML Model Testing Framework

Test Specification

Statistical Validation Test

# ML Test Specification
test_specification:
  id: MLE-TEST-001
  name: Vehicle Detection Statistical Validation
  model: MLE-MODEL-001-v2.3.1
  requirement: MLE-ADAS-001

  test_dataset:
    id: MLE-DATA-002
    type: held_out_test
    samples: 150000
    never_used_in_training: true
    geographic_diversity: true

  metrics:
    primary:
      - metric: recall
        threshold: 0.999
        confidence_level: 0.95

      - metric: precision
        threshold: 0.98
        confidence_level: 0.95

      - metric: mAP_50
        threshold: 0.85
        confidence_level: 0.95

    secondary:
      - metric: f1_score
        threshold: 0.97

      - metric: inference_time_ms
        threshold: 50
        percentile: 99

  stratification:
    - by: weather
      strata: [clear, rain, fog]
      min_samples_per_stratum: 10000

    - by: lighting
      strata: [day, twilight, night]
      min_samples_per_stratum: 10000

    - by: vehicle_type
      strata: [car, truck, motorcycle, bus]
      min_samples_per_stratum: 5000

  pass_criteria:
    all_primary_metrics: true
    statistical_significance: "p < 0.05"  # Standard threshold; see statistical testing literature
    stratified_performance: "No stratum > 5% below average"

Edge Case Test Specification

# Edge Case Testing
edge_case_tests:
  id: MLE-TEST-002
  name: Vehicle Detection Edge Cases
  model: MLE-MODEL-001-v2.3.1

  scenarios:
    - id: EC-001
      name: "Partial occlusion"
      description: "Vehicle partially hidden by barrier"
      expected: "Detection with >50% visible"
      samples: 1000

    - id: EC-002
      name: "Unusual vehicle appearance"
      description: "Vehicles with unusual paint/wrap"
      expected: "Detection regardless of appearance"
      samples: 500

    - id: EC-003
      name: "Distance boundary"
      description: "Vehicles at maximum detection range"
      expected: "Detection at 100m distance"
      samples: 1000

    - id: EC-004
      name: "High-speed scenario"
      description: "Approaching vehicle at high relative speed"
      expected: "Consistent detection across frames"
      samples: 500

    - id: EC-005
      name: "Multi-vehicle cluster"
      description: "Group of vehicles close together"
      expected: "Individual detection of each vehicle"
      samples: 800

    - id: EC-006
      name: "Vehicle transition"
      description: "Vehicle entering/exiting frame"
      expected: "No false positives at boundaries"
      samples: 600

Test Results Report

The diagram below presents a structured ML model test report, summarizing pass/fail results across all test categories along with key performance metrics and confidence intervals.

ML Model Test Report

Robustness Testing

Adversarial Testing

Note: Python code examples are illustrative; evaluate_* helper functions require project-specific implementation.

"""
Adversarial robustness testing for ML model
"""

import torch
import numpy as np
from typing import Tuple

def fgsm_attack(model, image: torch.Tensor, label: torch.Tensor,
                epsilon: float = 0.03) -> torch.Tensor:
    """Fast Gradient Sign Method attack."""
    image.requires_grad = True

    output = model(image)
    loss = compute_loss(output, label)

    model.zero_grad()
    loss.backward()

    # Generate adversarial example
    perturbed = image + epsilon * image.grad.sign()
    perturbed = torch.clamp(perturbed, 0, 1)

    return perturbed


def run_robustness_tests(model, test_loader, config) -> dict:
    """Run comprehensive robustness tests."""

    results = {
        'clean_accuracy': 0,
        'fgsm_accuracy': {},
        'noise_robustness': {},
        'blur_robustness': {},
        'brightness_robustness': {}
    }

    # Clean accuracy
    results['clean_accuracy'] = evaluate_accuracy(model, test_loader)

    # FGSM attack at various epsilon values
    for epsilon in [0.01, 0.03, 0.05, 0.1]:
        perturbed_acc = evaluate_adversarial(model, test_loader, epsilon)
        results['fgsm_accuracy'][epsilon] = perturbed_acc

    # Gaussian noise robustness
    for sigma in [0.01, 0.05, 0.1]:
        noisy_acc = evaluate_with_noise(model, test_loader, sigma)
        results['noise_robustness'][sigma] = noisy_acc

    # Motion blur robustness
    for kernel_size in [3, 5, 7]:
        blur_acc = evaluate_with_blur(model, test_loader, kernel_size)
        results['blur_robustness'][kernel_size] = blur_acc

    # Brightness variation robustness
    for factor in [0.5, 0.8, 1.2, 1.5]:
        bright_acc = evaluate_with_brightness(model, test_loader, factor)
        results['brightness_robustness'][factor] = bright_acc

    return results

Work Products

WP ID	Work Product	AI Role
08-62	ML test specification	Test generation
13-62	ML test report	Result analysis
13-66	Edge case analysis	Scenario identification
17-11	Traceability record	Coverage tracking

Summary

MLE.4 ML Model Testing:

AI Level: L2 (automated testing, human validation)
Primary AI Value: Test generation, result analysis
Human Essential: Pass/fail judgment, safety assessment
Key Outputs: Test report, coverage analysis
Focus: Statistical validation, edge cases, robustness