4.5: MLE.5 ML Model Deployment
Process Definition
Purpose
MLE.5 Purpose: To deploy the validated ML model to the target hardware platform.
Outcomes
| Outcome | Description |
|---|---|
| O1 | Deployment strategy is defined |
| O2 | Model is optimized for target |
| O3 | Model is integrated with software |
| O4 | Deployment is verified |
| O5 | Runtime monitoring is established |
| O6 | Traceability is maintained |
Deployment Pipeline
The following diagram illustrates the ML model deployment pipeline, from model export and optimization through target integration, on-device validation, and runtime monitoring setup.
Model Optimization
Optimization Techniques
| Technique | Description | Accuracy Impact | Speed Gain |
|---|---|---|---|
| INT8 Quantization | Reduce precision to 8-bit | < 1% | 2-4x |
| FP16 Quantization | Reduce precision to 16-bit | < 0.1% | 1.5-2x |
| Pruning | Remove low-weight connections | 1-3% | 1.3-2x |
| Knowledge Distillation | Train smaller model | 1-2% | 2-5x |
| Operator Fusion | Combine operations | None | 1.2-1.5x |
| Layer Optimization | Platform-specific | None | 1.1-1.3x |
TensorRT Deployment Example
Note: This example requires pycuda and cv2 imports; actual implementation is project-specific.
"""
TensorRT deployment for automotive ECU
"""
import tensorrt as trt
import numpy as np
import cv2 # Required for preprocessing
import pycuda.driver as cuda # Required for GPU memory management
class TRTInference:
"""TensorRT inference engine for NVIDIA Jetson."""
def __init__(self, engine_path: str):
self.logger = trt.Logger(trt.Logger.WARNING)
# Load serialized engine
with open(engine_path, 'rb') as f:
runtime = trt.Runtime(self.logger)
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
# Allocate buffers
self._allocate_buffers()
def _allocate_buffers(self):
"""Allocate GPU memory for input/output."""
import pycuda.driver as cuda
self.inputs = []
self.outputs = []
self.bindings = []
for binding in self.engine:
shape = self.engine.get_binding_shape(binding)
size = trt.volume(shape)
dtype = trt.nptype(self.engine.get_binding_dtype(binding))
# Allocate host and device memory
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))
if self.engine.binding_is_input(binding):
self.inputs.append({'host': host_mem, 'device': device_mem})
else:
self.outputs.append({'host': host_mem, 'device': device_mem})
def infer(self, image: np.ndarray) -> dict:
"""Run inference on input image."""
import pycuda.driver as cuda
# Preprocess
preprocessed = self._preprocess(image)
# Copy to device
np.copyto(self.inputs[0]['host'], preprocessed.ravel())
cuda.memcpy_htod(self.inputs[0]['device'], self.inputs[0]['host'])
# Execute inference
self.context.execute_v2(bindings=self.bindings)
# Copy results back
cuda.memcpy_dtoh(self.outputs[0]['host'], self.outputs[0]['device'])
# Postprocess
return self._postprocess(self.outputs[0]['host'])
def _preprocess(self, image: np.ndarray) -> np.ndarray:
"""Preprocess image for model input."""
# Resize to model input size
resized = cv2.resize(image, (640, 384))
# Normalize to [0, 1]
normalized = resized.astype(np.float32) / 255.0
# Convert HWC to CHW
transposed = normalized.transpose(2, 0, 1)
# Add batch dimension
batched = np.expand_dims(transposed, axis=0)
return batched
def _postprocess(self, output: np.ndarray) -> dict:
"""Postprocess model output to detections.
Note: Implementation is project-specific based on model output format.
"""
# Parse YOLO output format
detections = []
# Project-specific postprocessing logic here
return {'detections': detections}
def convert_to_trt(onnx_path: str, engine_path: str, config: dict):
"""Convert ONNX model to TensorRT engine."""
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX
with open(onnx_path, 'rb') as f:
parser.parse(f.read())
# Build config
build_config = builder.create_builder_config()
build_config.max_workspace_size = 1 << 30 # 1GB
# Enable INT8 quantization
if config.get('int8', False):
build_config.set_flag(trt.BuilderFlag.INT8)
calibrator = Int8Calibrator(config['calibration_data'])
build_config.int8_calibrator = calibrator
# Build engine
engine = builder.build_engine(network, build_config)
# Serialize
with open(engine_path, 'wb') as f:
f.write(engine.serialize())
return engine_path
Deployment Verification
Deployment Test Specification
# Deployment verification tests
deployment_tests:
id: MLE-DEPLOY-001
model: MLE-MODEL-001-v2.3.1
target: "NVIDIA Jetson Orin Nano"
functional_equivalence:
description: "Verify deployed model matches training model"
test_samples: 1000
tolerance:
output_difference: 0.01 # Max L2 difference
detection_iou: 0.95 # Min IoU overlap
performance_tests:
latency:
requirement: 50ms
measurement: p99
warm_up_iterations: 100
test_iterations: 1000
throughput:
requirement: 30 fps
duration: 60 seconds
memory:
peak_requirement: 128 MB
steady_state_requirement: 64 MB
stress_tests:
continuous_operation:
duration: 24 hours
expected: "No memory leaks, no crashes"
thermal_throttling:
max_temperature: 80 # Celsius, measured at GPU die
expected: "Graceful degradation, no failure"
integration_tests:
input_validation:
invalid_size: "Reject with error code"
null_pointer: "Reject with error code"
output_validation:
detection_format: "Valid bounding boxes"
confidence_range: "[0, 1]"
Runtime Monitoring
Model Performance Monitoring
/**
* @file ml_runtime_monitor.h
* @brief ML Model Runtime Monitoring
* @trace MLE-DEPLOY-001
*/
#ifndef ML_RUNTIME_MONITOR_H
#define ML_RUNTIME_MONITOR_H
#include "Std_Types.h"
/*===========================================================================*/
/* MONITORING STRUCTURES */
/*===========================================================================*/
/** @brief Inference statistics */
typedef struct {
uint32 inference_count; /**< Total inferences */
uint32 inference_time_avg_us; /**< Average inference time */
uint32 inference_time_max_us; /**< Maximum inference time */
uint32 inference_time_min_us; /**< Minimum inference time */
uint32 timeout_count; /**< Number of timeouts */
uint32 error_count; /**< Number of errors */
} ML_InferenceStats_t;
/** @brief Detection statistics */
typedef struct {
uint32 total_detections; /**< Total objects detected */
uint32 avg_detections_per_frame; /**< Average per frame */
float32 avg_confidence; /**< Average confidence */
uint32 low_confidence_count; /**< Detections below threshold */
} ML_DetectionStats_t;
/*===========================================================================*/
/* MONITORING API */
/*===========================================================================*/
/**
* @brief Get inference statistics
* @param stats Pointer to store statistics
* @return E_OK on success
*/
Std_ReturnType ML_Monitor_GetInferenceStats(ML_InferenceStats_t* stats);
/**
* @brief Get detection statistics
* @param stats Pointer to store statistics
* @return E_OK on success
*/
Std_ReturnType ML_Monitor_GetDetectionStats(ML_DetectionStats_t* stats);
/**
* @brief Check for anomalies
* @return TRUE if anomaly detected
*/
boolean ML_Monitor_CheckAnomaly(void);
/**
* @brief Reset monitoring counters
*/
void ML_Monitor_Reset(void);
#endif /* ML_RUNTIME_MONITOR_H */
Deployment Specification
# Deployment specification document
deployment:
id: MLE-DEPLOY-001
model: MLE-MODEL-001-v2.3.1
version: 1.0
target_hardware:
platform: "NVIDIA Jetson Orin Nano"
compute: "GPU (Ampere, 1024 CUDA cores)"
memory: "8GB LPDDR5"
storage: "NVMe SSD"
model_format:
source: "models/yolov8n_adas_v2.3.1.pt"
intermediate: "models/yolov8n_adas_v2.3.1.onnx"
deployed: "models/yolov8n_adas_v2.3.1.engine"
precision: INT8
runtime_configuration:
framework: TensorRT 8.6
workspace_size: 1GB
dla_enabled: false
streams: 1
resource_allocation:
gpu_memory: 512MB
system_memory: 128MB
cpu_threads: 2
integration:
sw_module: "ML_Detection"
api_header: "ml_detection_interface.h"
callback: "ML_Detection_ResultCallback"
monitoring:
latency_threshold: 50ms
error_rate_threshold: 0.001
logging: enabled
fallback:
on_timeout: "Use previous detection"
on_error: "Request driver attention"
on_anomaly: "Log and continue"
traceability:
requirements:
- MLE-ADAS-001
- MLE-ADAS-002
model: MLE-MODEL-001-v2.3.1
tests: MLE-TEST-001, MLE-TEST-002
Work Products
| WP ID | Work Product | AI Role |
|---|---|---|
| 04-09 | Deployment specification | Documentation |
| 11-08 | Deployed model binary | Optimization |
| 13-67 | Deployment verification report | Testing |
| 17-11 | Traceability record | Link generation |
Summary
MLE.5 ML Model Deployment:
- AI Level: L2 (automated optimization, human verification)
- Primary AI Value: Model optimization, performance tuning
- Human Essential: Target selection, deployment approval
- Key Outputs: Optimized model, deployment verification
- Focus: Target hardware constraints, runtime monitoring