4.4: MLE.4 ML Model Testing
Process Definition
Purpose
MLE.4 Purpose: To verify that the ML model meets the ML requirements and performs within the defined operational domain.
Outcomes
| Outcome | Description |
|---|---|
| O1 | A ML test approach is defined |
| O2 | A ML test data set is created |
| O3 | The trained ML model is tested |
| O4 | The deployed ML model is derived from the trained ML model and tested |
| O5 | Consistency and bidirectional traceability are established between the ML test approach and the ML requirements, and the ML test data set and the ML data requirements |
| O6 | Results of the ML model testing are summarized and communicated with the deployed ML model to all affected parties |
Base Practices with AI Integration
| BP | Base Practice | AI Level | AI Application |
|---|---|---|---|
| BP1 | Specify an ML test approach | L1-L2 | Test strategy suggestions |
| BP2 | Create ML test data set | L2 | Data selection, coverage analysis |
| BP3 | Test trained ML model | L2-L3 | Automated test execution |
| BP4 | Derive deployed ML model | L2 | Model conversion, optimization |
| BP5 | Test deployed ML model | L2-L3 | Automated test execution |
| BP6 | Ensure consistency and establish bidirectional traceability | L2 | Trace generation, coverage tracking |
| BP7 | Summarize and communicate results | L1 | Report generation |
ML Testing Framework
Test Categories
The following diagram categorizes ML model testing into functional, performance, robustness, and safety testing, showing the scope and purpose of each category.
Test Specification
Statistical Validation Test
# ML Test Specification
test_specification:
id: MLE-TEST-001
name: Vehicle Detection Statistical Validation
model: MLE-MODEL-001-v2.3.1
requirement: MLE-ADAS-001
test_dataset:
id: MLE-DATA-002
type: held_out_test
samples: 150000
never_used_in_training: true
geographic_diversity: true
metrics:
primary:
- metric: recall
threshold: 0.999
confidence_level: 0.95
- metric: precision
threshold: 0.98
confidence_level: 0.95
- metric: mAP_50
threshold: 0.85
confidence_level: 0.95
secondary:
- metric: f1_score
threshold: 0.97
- metric: inference_time_ms
threshold: 50
percentile: 99
stratification:
- by: weather
strata: [clear, rain, fog]
min_samples_per_stratum: 10000
- by: lighting
strata: [day, twilight, night]
min_samples_per_stratum: 10000
- by: vehicle_type
strata: [car, truck, motorcycle, bus]
min_samples_per_stratum: 5000
pass_criteria:
all_primary_metrics: true
statistical_significance: "p < 0.05" # Standard threshold; see statistical testing literature
stratified_performance: "No stratum > 5% below average"
Edge Case Test Specification
# Edge Case Testing
edge_case_tests:
id: MLE-TEST-002
name: Vehicle Detection Edge Cases
model: MLE-MODEL-001-v2.3.1
scenarios:
- id: EC-001
name: "Partial occlusion"
description: "Vehicle partially hidden by barrier"
expected: "Detection with >50% visible"
samples: 1000
- id: EC-002
name: "Unusual vehicle appearance"
description: "Vehicles with unusual paint/wrap"
expected: "Detection regardless of appearance"
samples: 500
- id: EC-003
name: "Distance boundary"
description: "Vehicles at maximum detection range"
expected: "Detection at 100m distance"
samples: 1000
- id: EC-004
name: "High-speed scenario"
description: "Approaching vehicle at high relative speed"
expected: "Consistent detection across frames"
samples: 500
- id: EC-005
name: "Multi-vehicle cluster"
description: "Group of vehicles close together"
expected: "Individual detection of each vehicle"
samples: 800
- id: EC-006
name: "Vehicle transition"
description: "Vehicle entering/exiting frame"
expected: "No false positives at boundaries"
samples: 600
Test Results Report
The diagram below presents a structured ML model test report, summarizing pass/fail results across all test categories along with key performance metrics and confidence intervals.
Robustness Testing
Adversarial Testing
Note: Python code examples are illustrative; evaluate_* helper functions require project-specific implementation.
"""
Adversarial robustness testing for ML model
"""
import torch
import numpy as np
from typing import Tuple
def fgsm_attack(model, image: torch.Tensor, label: torch.Tensor,
epsilon: float = 0.03) -> torch.Tensor:
"""Fast Gradient Sign Method attack."""
image.requires_grad = True
output = model(image)
loss = compute_loss(output, label)
model.zero_grad()
loss.backward()
# Generate adversarial example
perturbed = image + epsilon * image.grad.sign()
perturbed = torch.clamp(perturbed, 0, 1)
return perturbed
def run_robustness_tests(model, test_loader, config) -> dict:
"""Run comprehensive robustness tests."""
results = {
'clean_accuracy': 0,
'fgsm_accuracy': {},
'noise_robustness': {},
'blur_robustness': {},
'brightness_robustness': {}
}
# Clean accuracy
results['clean_accuracy'] = evaluate_accuracy(model, test_loader)
# FGSM attack at various epsilon values
for epsilon in [0.01, 0.03, 0.05, 0.1]:
perturbed_acc = evaluate_adversarial(model, test_loader, epsilon)
results['fgsm_accuracy'][epsilon] = perturbed_acc
# Gaussian noise robustness
for sigma in [0.01, 0.05, 0.1]:
noisy_acc = evaluate_with_noise(model, test_loader, sigma)
results['noise_robustness'][sigma] = noisy_acc
# Motion blur robustness
for kernel_size in [3, 5, 7]:
blur_acc = evaluate_with_blur(model, test_loader, kernel_size)
results['blur_robustness'][kernel_size] = blur_acc
# Brightness variation robustness
for factor in [0.5, 0.8, 1.2, 1.5]:
bright_acc = evaluate_with_brightness(model, test_loader, factor)
results['brightness_robustness'][factor] = bright_acc
return results
Work Products
| WP ID | Work Product | AI Role |
|---|---|---|
| 08-62 | ML test specification | Test generation |
| 13-62 | ML test report | Result analysis |
| 13-66 | Edge case analysis | Scenario identification |
| 17-11 | Traceability record | Coverage tracking |
Summary
MLE.4 ML Model Testing:
- AI Level: L2 (automated testing, human validation)
- Primary AI Value: Test generation, result analysis
- Human Essential: Pass/fail judgment, safety assessment
- Key Outputs: Test report, coverage analysis
- Focus: Statistical validation, edge cases, robustness