2.6: AI Agent Error Detection and Recovery
Overview
AI agents — whether generating code, analyzing requirements, creating test cases, or reviewing designs — are powerful but fallible. Unlike traditional software tools with deterministic behavior, AI agents can produce hallucinations (confident but incorrect outputs), miss edge cases, misinterpret requirements, or fail due to tool integration issues. For safety-critical ASPICE-compliant development, robust error detection and recovery mechanisms are mandatory.
This chapter establishes a comprehensive framework for:
- Error Taxonomy: Classify agent errors by type and severity
- Detection Mechanisms: Confidence scoring, output validation, human review protocols
- Recovery Strategies: Retry logic, fallback options, escalation paths, human-in-the-loop (HITL) intervention
- Monitoring and Logging: Observability for agent operations
- HITL Error Handling Protocol: When and how to escalate to human engineers
- Real Examples: Error patterns from requirements analysis, code generation, test creation
- ASPICE Alignment: Verification and validation (SWE.4, SWE.5), error management (SUP.9)
- Metrics: Error rate, recovery success rate, false positive rate
Key Principle: Humans own decisions; AI assists execution. All agent outputs must be validated before integration into safety-critical work products.
ASPICE Processes Supported:
- SWE.4 (Software Unit Verification): Detect defects in AI-generated code
- SWE.5 (Software Integration Test): Validate AI-generated test cases
- SUP.1 (Quality Assurance): Independent review of AI outputs
- SUP.9 (Problem Resolution Management): Track and resolve agent errors
Error Taxonomy
Classification Dimensions
AI agent errors can be categorized along three dimensions:
- Error Type: What went wrong?
- Severity: Impact on development process and product quality
- Detectability: How easily can the error be caught?
Error Type Classification
| Error Type | Description | Example | Detection Method |
|---|---|---|---|
| Hallucination | Agent invents information not present in input context | Generating a requirement "System shall support 5G connectivity" when none exists in specification | Traceability check (requirement → source document) |
| Omission | Agent misses critical information from input | Code generation skips error handling for null pointer | Coverage analysis (requirements → code mapping) |
| Misinterpretation | Agent correctly extracts information but assigns wrong meaning | Interpreting "brake pressure > 100 bar" as nominal value instead of fault threshold | Domain expert review, semantic validation |
| Format Violation | Output does not match required structure | Generating C++ code when C99 required, or missing Doxygen headers | Schema validation, linting, coding standards check |
| Logic Error | Code or test case is syntactically correct but logically flawed | Test case checks speed > 0 instead of speed > 5 per requirement |
Unit test execution, assertion validation |
| Tool Failure | Agent fails to execute external tools (compiler, linter, test runner) | Git command fails due to network timeout | Return code checking, exception handling |
| Context Limit Exceeded | Input exceeds agent's token limit, causing truncation | Requirements document > 100K tokens → agent only processes first half | Token counting, chunking strategy validation |
| Knowledge Gap | Agent lacks domain-specific knowledge | Generating AUTOSAR-incompatible code due to unfamiliarity with standard | Static analysis with domain-specific rules (e.g., AUTOSAR checker) |
Severity Levels (ASPICE-Aligned)
| Severity | Impact | Example | Response Time |
|---|---|---|---|
| Critical | Safety impact, violates ASIL requirements, or prevents build | Generated code disables watchdog timer (ASIL-D violation) | Immediate HITL escalation, block PR merge |
| Major | Functional defect, violates ASPICE work product requirements | Missing bidirectional traceability for requirements | Fix before sprint end, manual correction |
| Minor | Quality issue, does not block progress | Inconsistent variable naming (violates style guide) | Fix in next iteration, automated cleanup |
| Cosmetic | Documentation or formatting issue | Missing code comment for non-critical function | Optional fix, nice-to-have |
Detectability Assessment
| Detectability | Description | Mitigation |
|---|---|---|
| High | Error detected by automated tools (linter, compiler, unit tests) | Integrate tools into CI/CD pipeline, run on every agent output |
| Medium | Error detected by human review within 1-2 iterations | Mandatory peer review, checklist-based validation |
| Low | Error only discovered in integration testing or production | Increase test coverage, add fault injection tests, independent safety review |
Detection Mechanisms
1. Confidence Scoring (Agent Self-Assessment)
Approach: Agent reports confidence level for each output.
Implementation (Claude API):
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
def generate_code_with_confidence(requirement_text):
"""
Generate code from requirement and assess confidence
"""
prompt = f"""
You are an embedded software engineer generating C code from requirements.
Requirement:
{requirement_text}
Generate C code implementing this requirement. Then assess your confidence:
- Confidence level (0-100%): How certain are you this code is correct?
- Assumptions: What assumptions did you make?
- Risks: What could go wrong with this implementation?
Format:
```c
// Code here
Confidence: X% Assumptions: [List] Risks: [List] """
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
response_text = message.content[0].text
# Parse confidence score
confidence_match = re.search(r'Confidence:\s*(\d+)%', response_text)
confidence = int(confidence_match.group(1)) if confidence_match else 50 # Default to medium
return {
"code": extract_code_block(response_text),
"confidence": confidence,
"assumptions": extract_section(response_text, "Assumptions"),
"risks": extract_section(response_text, "Risks"),
"full_response": response_text
}
Usage
result = generate_code_with_confidence("REQ-SWE-123: Brake pressure shall be monitored every 10ms")
if result["confidence"] < 70: print("[WARN] Low confidence - escalate to human review") escalate_to_human(result) else: print("[OK] High confidence - proceed with automated validation") run_automated_tests(result["code"])
**Confidence Threshold Policy**:
- **< 50%**: Automatic rejection, escalate to human immediately
- **50-70%**: Require peer review before merge
- **70-90%**: Standard automated validation (linting, unit tests)
- **> 90%**: Fast-track review (post-merge validation acceptable for non-safety code)
**Limitations**:
- Agents can be overconfident (hallucinating with high confidence)
- Confidence is not calibrated consistently across different agent models
- **Mitigation**: Combine self-assessment with objective validation (see below)
---
### 2. Output Validation (Automated Checks)
#### A. Schema Validation
**Purpose**: Ensure output matches expected structure.
**Example: Validate Generated Requirements**
```python
from pydantic import BaseModel, validator
from typing import List, Optional
class Requirement(BaseModel):
"""Schema for software requirements"""
id: str # Format: REQ-SWE-NNN
text: str # Requirement description
asil: str # ASIL level: QM, A, B, C, D
source: str # Parent system requirement ID
verification_method: str # Test, Review, Analysis
@validator('id')
def validate_id_format(cls, v):
if not re.match(r'^REQ-SWE-\d{3}$', v):
raise ValueError(f"Invalid ID format: {v} (expected REQ-SWE-NNN)")
return v
@validator('asil')
def validate_asil(cls, v):
if v not in ['QM', 'A', 'B', 'C', 'D']:
raise ValueError(f"Invalid ASIL level: {v}")
return v
# Validate agent-generated requirements
def validate_agent_requirements(agent_output: str) -> List[Requirement]:
"""Parse and validate requirements from agent output"""
requirements = []
errors = []
# Parse agent output (assuming JSON format)
try:
raw_reqs = json.loads(agent_output)
except json.JSONDecodeError as e:
raise ValueError(f"Agent output is not valid JSON: {e}")
for raw_req in raw_reqs:
try:
req = Requirement(**raw_req) # Pydantic validation
requirements.append(req)
except ValidationError as e:
errors.append({
"requirement": raw_req,
"error": str(e)
})
if errors:
# Log errors for debugging
log_validation_errors(errors)
# Escalate to human if critical fields missing
if any("id" in e["error"] or "asil" in e["error"] for e in errors):
escalate_to_human(f"Schema validation failed: {len(errors)} requirements have critical errors")
return requirements
B. Static Analysis
Purpose: Detect code defects, security vulnerabilities, coding standard violations.
Tools:
- MISRA C Checker (PC-lint, LDRA, Polyspace): Automotive coding standards
- Cppcheck: Open-source static analyzer for C/C++
- SonarQube: Code quality and security analysis
- Coverity: Commercial static analysis (safety-certified)
Example: Validate AI-Generated Code
import subprocess
def static_analysis_check(code_file_path):
"""
Run static analysis on AI-generated code
Returns (pass: bool, violations: List[dict])
"""
violations = []
# Run MISRA C checker (PC-lint)
result = subprocess.run(
["lint-nt", "-w3", "misra_required.lnt", code_file_path],
capture_output=True,
text=True
)
if result.returncode != 0:
# Parse violations from lint output
for line in result.stdout.split('\n'):
if "error" in line.lower() or "warning" in line.lower():
violations.append({
"file": code_file_path,
"line": extract_line_number(line),
"rule": extract_rule_id(line),
"message": line.strip()
})
# Classification
critical_violations = [v for v in violations if "error" in v["message"].lower()]
warning_violations = [v for v in violations if "warning" in v["message"].lower()]
if critical_violations:
print(f"[FAIL] Critical violations found: {len(critical_violations)}")
print("Escalating to human review...")
escalate_to_human({
"code_file": code_file_path,
"violations": critical_violations
})
return False, violations
elif len(warning_violations) > 10:
print(f"[WARN] Many warnings ({len(warning_violations)}), recommend review")
notify_reviewer(violations)
return True, violations
C. Unit Test Execution
Purpose: Validate functional correctness of AI-generated code.
Example: Auto-Generated Test Execution
def validate_generated_code_with_tests(code_file, test_file):
"""
Compile and run unit tests for AI-generated code
"""
# Step 1: Compile code
compile_result = subprocess.run(
["gcc", "-Wall", "-Werror", "-c", code_file, "-o", "temp.o"],
capture_output=True,
text=True
)
if compile_result.returncode != 0:
print("[FAIL] Compilation failed:")
print(compile_result.stderr)
return {
"status": "FAIL",
"stage": "compilation",
"error": compile_result.stderr
}
# Step 2: Compile and link tests
test_compile = subprocess.run(
["gcc", "-Wall", "temp.o", test_file, "-o", "test_runner", "-lcheck"],
capture_output=True,
text=True
)
if test_compile.returncode != 0:
print("[FAIL] Test compilation failed:")
print(test_compile.stderr)
return {
"status": "FAIL",
"stage": "test_compilation",
"error": test_compile.stderr
}
# Step 3: Run tests
test_run = subprocess.run(
["./test_runner"],
capture_output=True,
text=True,
timeout=30 # Prevent infinite loops
)
# Parse test results (assuming Check framework output)
passed = test_run.stdout.count("PASSED")
failed = test_run.stdout.count("FAILED")
if failed > 0:
print(f"[FAIL] Tests failed: {failed}/{passed + failed}")
print(test_run.stdout)
return {
"status": "FAIL",
"stage": "test_execution",
"passed": passed,
"failed": failed,
"output": test_run.stdout
}
print(f"[OK] All tests passed: {passed}/{passed}")
return {
"status": "PASS",
"tests_run": passed
}
3. Human Review Protocols
When to Trigger Human Review:
- Agent confidence < 70%
- Static analysis finds critical violations
- Unit tests fail
- Output modifies safety-critical code (ASIL C/D)
- Traceability gaps detected (requirement → code link missing)
Review Checklist Template:
# AI-Generated Code Review Checklist
## Meta Information
- Agent: [Claude Sonnet 4.6 / GPT-4o / Custom Agent]
- Task: [Code generation / Test creation / Requirement analysis]
- Confidence Score: [X%]
- Auto-Validation Results: [PASS / FAIL with details]
## Functional Review
- [ ] Code implements all requirements (check traceability)
- [ ] Edge cases handled (null pointers, boundary values, timeouts)
- [ ] Error handling complete (return codes checked, exceptions caught)
- [ ] ASIL requirements met (safety mechanisms, redundancy for ASIL C/D)
## Code Quality Review
- [ ] MISRA C compliance (no critical violations)
- [ ] Naming conventions followed (project coding standard)
- [ ] Comments adequate (Doxygen headers, complex logic explained)
- [ ] Cyclomatic complexity acceptable (< 15 per function)
## Safety & Security Review (if ASIL > QM)
- [ ] Memory safety (no buffer overflows, use-after-free)
- [ ] Integer safety (no overflows, division by zero checks)
- [ ] Timing determinism (no unbounded loops, WCET analysis compatible)
- [ ] Security considerations (input validation, no hardcoded secrets)
## Traceability Review
- [ ] Bidirectional trace established (requirement ↔ code)
- [ ] Test cases linked to requirements
- [ ] Design rationale documented
## Recommendation
- [ ] Approve as-is
- [ ] Approve with minor changes (list below)
- [ ] Request major revisions (specify issues)
- [ ] Reject (escalate to architect/safety manager)
---
Reviewer: [Name]
Date: [YYYY-MM-DD]
Time Spent: [HH:MM]
Recovery Strategies
Strategy 1: Retry with Refined Prompt
Use Case: Agent misunderstands vague requirement.
Example:
def generate_code_with_retry(requirement, max_retries=3):
"""
Generate code with automatic retry on validation failure
"""
for attempt in range(max_retries):
print(f"Attempt {attempt + 1}/{max_retries}")
# Generate code
result = agent.generate_code(requirement)
# Validate
validation = validate_generated_code(result["code"])
if validation["status"] == "PASS":
return result
else:
# Refine prompt with error feedback
error_context = f"""
Previous attempt failed validation:
{validation['error']}
Common mistakes to avoid:
- Ensure all edge cases are handled (null pointers, buffer boundaries)
- Follow MISRA C rules (avoid pointer arithmetic, use explicit casts)
- Add Doxygen comments for all functions
Please regenerate the code addressing these issues.
"""
requirement = requirement + "\n\n" + error_context
# All retries exhausted
print("[FAIL] Max retries exceeded - escalating to human")
escalate_to_human({
"requirement": requirement,
"attempts": max_retries,
"last_error": validation["error"]
})
return None
Strategy 2: Fallback to Conservative Implementation
Use Case: Agent struggles with complex requirement → generate simpler, safer version.
Example:
def generate_code_with_fallback(requirement, asil_level):
"""
If agent fails to generate optimized code, fall back to conservative implementation
"""
# Attempt 1: Optimized code
prompt_optimized = f"""
Generate highly optimized C code for: {requirement}
Minimize CPU cycles and memory usage.
"""
result = agent.generate_code(prompt_optimized)
if validate_generated_code(result["code"])["status"] == "PASS":
return result
# Fallback: Conservative, safety-focused code
print("[WARN] Optimized generation failed, falling back to conservative approach")
prompt_conservative = f"""
Generate C code for: {requirement}
CRITICAL SAFETY REQUIREMENTS:
- ASIL {asil_level}: Prioritize correctness over performance
- Use defensive programming (check all inputs, validate ranges)
- Avoid pointer arithmetic (use array indexing)
- Add assertions for all preconditions
- Use static memory allocation (no malloc)
- Implement timeout for all loops
"""
result_conservative = agent.generate_code(prompt_conservative)
if validate_generated_code(result_conservative["code"])["status"] == "PASS":
return result_conservative
else:
escalate_to_human("Both optimized and conservative code generation failed")
return None
Strategy 3: Partial Acceptance with Manual Completion
Use Case: Agent generates 80% correct code, but specific section needs human expertise.
Example:
def partial_acceptance_workflow(requirement):
"""
Accept AI-generated code skeleton, flag sections needing human input
"""
result = agent.generate_code(requirement)
# Parse code for uncertainty markers
uncertain_sections = extract_todo_comments(result["code"])
if uncertain_sections:
print(f"[WARN] Agent flagged {len(uncertain_sections)} sections for human completion:")
for section in uncertain_sections:
print(f" - Line {section['line']}: {section['comment']}")
# Create task for engineer
create_jira_task(
title=f"Complete AI-generated code for {requirement['id']}",
description=f"""
Agent generated partial implementation but needs human expertise for:
{format_uncertain_sections(uncertain_sections)}
Code file: {result['file_path']}
Requirement: {requirement['text']}
""",
assignee=get_domain_expert(requirement['module']),
priority="High"
)
return {
"status": "PARTIAL",
"code": result["code"],
"pending_tasks": uncertain_sections
}
else:
return {
"status": "COMPLETE",
"code": result["code"]
}
# Example agent output with uncertainty markers
"""
void brake_pressure_monitor(void) {
// TODO(AI): Verify correct sensor scaling factor (assumed 0.1 bar/bit)
float pressure_bar = read_adc_sensor() * 0.1;
// TODO(AI-HUMAN): Confirm threshold with safety engineer (currently 100 bar from spec)
if (pressure_bar > 100.0) {
trigger_fault_handler(FAULT_OVERPRESSURE);
}
}
"""
Strategy 4: Human-in-the-Loop Escalation
Use Case: Agent error is critical or unrecoverable.
Escalation Triggers:
- Safety violation (ASIL C/D code modifies safety mechanism)
- Repeated validation failures (3+ retry attempts)
- Traceability break (cannot link generated code to requirement)
- Tool crash (agent cannot execute compiler/linter)
Escalation Workflow:
def escalate_to_human(error_context):
"""
Notify human engineer and pause agent workflow
"""
# Log error
logger.error(f"Agent error requiring human intervention: {error_context}")
# Create incident ticket
incident_id = create_jira_incident(
summary=f"AI Agent Error: {error_context['type']}",
description=f"""
**Error Type**: {error_context['type']}
**Severity**: {error_context['severity']}
**Agent Task**: {error_context['task']}
**Failure Details**:
{error_context['details']}
**Recommended Actions**:
{error_context['recommendations']}
**Context**:
- Requirement ID: {error_context.get('requirement_id', 'N/A')}
- Code File: {error_context.get('file_path', 'N/A')}
- Agent Confidence: {error_context.get('confidence', 'N/A')}%
""",
priority="Critical" if error_context['severity'] == "Critical" else "High",
labels=["ai-agent-error", "hitl-required"]
)
# Notify on-call engineer
send_slack_alert(
channel="#ai-agents-escalation",
message=f"[ESCALATION] Agent escalation: {incident_id}\nRequires human review within 2 hours.",
mention=get_oncall_engineer()
)
# Pause agent workflow
return {
"status": "ESCALATED",
"incident_id": incident_id,
"awaiting_human": True
}
Monitoring and Logging
Observability Requirements
Key Metrics:
- Error Rate: Errors per 100 agent invocations
- Recovery Success Rate: % of errors resolved by automated retry/fallback
- Mean Time to Recovery (MTTR): Time from error detection to resolution
- False Positive Rate: % of escalations that were not actual errors
- Human Intervention Rate: % of tasks requiring HITL
Logging Framework:
import logging
from datetime import datetime
class AgentLogger:
"""Structured logging for AI agent operations"""
def __init__(self, agent_name):
self.agent_name = agent_name
self.logger = logging.getLogger(f"agent.{agent_name}")
def log_task_start(self, task_id, task_type, input_data):
"""Log agent task initiation"""
self.logger.info({
"event": "task_start",
"timestamp": datetime.utcnow().isoformat(),
"agent": self.agent_name,
"task_id": task_id,
"task_type": task_type,
"input_size": len(str(input_data))
})
def log_task_complete(self, task_id, output_data, confidence, duration_ms):
"""Log successful task completion"""
self.logger.info({
"event": "task_complete",
"timestamp": datetime.utcnow().isoformat(),
"agent": self.agent_name,
"task_id": task_id,
"output_size": len(str(output_data)),
"confidence": confidence,
"duration_ms": duration_ms
})
def log_error(self, task_id, error_type, error_details, severity):
"""Log agent error"""
self.logger.error({
"event": "task_error",
"timestamp": datetime.utcnow().isoformat(),
"agent": self.agent_name,
"task_id": task_id,
"error_type": error_type,
"severity": severity,
"details": error_details
})
def log_recovery(self, task_id, recovery_strategy, success):
"""Log recovery attempt"""
self.logger.warning({
"event": "recovery_attempt",
"timestamp": datetime.utcnow().isoformat(),
"agent": self.agent_name,
"task_id": task_id,
"strategy": recovery_strategy,
"success": success
})
def log_escalation(self, task_id, escalation_reason, incident_id):
"""Log human escalation"""
self.logger.critical({
"event": "human_escalation",
"timestamp": datetime.utcnow().isoformat(),
"agent": self.agent_name,
"task_id": task_id,
"reason": escalation_reason,
"incident_id": incident_id
})
# Usage example
logger = AgentLogger("code_generator")
task_id = "TASK-12345"
logger.log_task_start(task_id, "code_generation", requirement)
try:
result = generate_code(requirement)
logger.log_task_complete(task_id, result["code"], result["confidence"], duration_ms=1234)
except ValidationError as e:
logger.log_error(task_id, "validation_failure", str(e), "Major")
logger.log_recovery(task_id, "retry_with_refined_prompt", success=False)
incident = escalate_to_human(error_context)
logger.log_escalation(task_id, "repeated_validation_failure", incident["incident_id"])
Metrics Dashboard
Example: Grafana Dashboard Panels
-
Error Rate Over Time (Line chart)
- Query:
count(log_event == "task_error") / count(log_event == "task_start") * 100 - Alert: Error rate > 10% for 1 hour
- Query:
-
Recovery Success Rate (Gauge)
- Query:
count(recovery_success == true) / count(recovery_attempt) * 100 - Target: > 80%
- Query:
-
Top Error Types (Bar chart)
- Query:
count(log_event == "task_error") group by error_type
- Query:
-
Mean Time to Recovery (Single stat)
- Query:
avg(escalation_timestamp - error_timestamp) - Target: < 2 hours
- Query:
-
Human Intervention Rate (Pie chart)
- Query:
count(log_event == "human_escalation") / count(log_event == "task_start") * 100 - Target: < 5%
- Query:
Real-World Error Examples
Example 1: Code Generation Hallucination
Scenario: Agent generates code referencing non-existent API.
Requirement:
REQ-SWE-234: Motor controller shall read encoder position every 1ms
AI-Generated Code (Incorrect):
// HALLUCINATION: read_encoder_position() does not exist in HAL
void motor_control_task(void) {
int32_t position = read_encoder_position(); // [FAIL] Undefined function
update_pid_controller(position);
}
Detection:
- Compilation Error:
undefined reference to 'read_encoder_position' - Static Analysis: Function not declared in any header
Recovery:
- Retry with Context: Provide HAL API documentation to agent
prompt_retry = f"""
{requirement}
Available HAL APIs:
- HAL_Encoder_Init(encoder_id)
- HAL_Encoder_GetPosition(encoder_id) -> returns int32_t
- HAL_Encoder_GetSpeed(encoder_id) -> returns int32_t
Generate code using ONLY these APIs.
"""
- Validation: Verify function calls against HAL header file
def validate_function_calls(code, allowed_apis):
"""Check that all function calls are in allowed API list"""
function_calls = extract_function_calls(code) # Regex or AST parsing
invalid_calls = [f for f in function_calls if f not in allowed_apis]
if invalid_calls:
return {
"status": "FAIL",
"error": f"Undefined functions: {invalid_calls}"
}
return {"status": "PASS"}
Example 2: Test Case Omission
Scenario: Agent generates test cases but misses critical edge case.
Requirement:
REQ-SWE-456: Door lock shall remain engaged if speed sensor fails while vehicle is moving
AI-Generated Tests (Incomplete):
// Test 1: Normal operation
void test_door_lock_normal(void) {
set_speed(50); // 50 km/h
door_lock_control();
assert(get_lock_state() == LOCKED);
}
// [FAIL] MISSING: Test for sensor fault during motion
Detection:
- Coverage Analysis: Requirement REQ-SWE-456 not covered by any test case
- Traceability Check: No test linked to safety requirement
Recovery:
- Gap Analysis: Identify missing test scenarios
def analyze_test_coverage(requirements, test_cases):
"""Map requirements to test cases, flag gaps"""
coverage_map = {}
for req in requirements:
# Extract test cases that reference this requirement
linked_tests = [t for t in test_cases if req["id"] in t["requirement_refs"]]
if not linked_tests:
coverage_map[req["id"]] = {
"status": "NOT_COVERED",
"tests": []
}
else:
coverage_map[req["id"]] = {
"status": "COVERED",
"tests": [t["id"] for t in linked_tests]
}
# Report gaps
gaps = [req_id for req_id, cov in coverage_map.items() if cov["status"] == "NOT_COVERED"]
return gaps
# Generate missing tests
gaps = analyze_test_coverage(requirements, generated_tests)
for req_id in gaps:
print(f"[WARN] No test coverage for {req_id}, regenerating...")
additional_test = agent.generate_test_case(get_requirement(req_id))
generated_tests.append(additional_test)
Example 3: Requirements Misinterpretation
Scenario: Agent interprets ambiguous requirement incorrectly.
Requirement (Ambiguous):
REQ-SYS-789: System shall respond to brake pedal input quickly
AI Interpretation:
Software Requirement: Brake control loop shall execute every 10ms
Correct Interpretation (From safety engineer):
Software Requirement: Brake control loop shall execute every 1ms (latency < 5ms per ISO 26262 brake-by-wire requirements)
Detection:
- Domain Expert Review: Safety engineer flags incorrect timing requirement
- Semantic Validation: Cross-check against ISO 26262 guidelines (automated checker)
Recovery:
- Clarification Request: Agent asks for missing details
prompt_clarify = """
Requirement: "System shall respond to brake pedal input quickly"
This requirement is ambiguous. Please clarify:
1. What is the acceptable latency? (e.g., < 5ms, < 50ms)
2. What is the execution frequency? (e.g., 1kHz, 100Hz, 10Hz)
3. What is the ASIL level? (determines criticality)
4. Are there regulatory constraints? (ISO 26262, UN R13H)
If information is not available, I will flag this requirement for human review.
"""
- Escalation to Requirement Owner: Create Jira task for requirement clarification
ASPICE Alignment
SWE.4 (Software Unit Verification)
ASPICE Requirement: Verify that software units meet requirements (BP1, BP2, BP3)
Agent Error Detection Integration:
- BP1: Develop unit test cases → AI agent generates tests, human reviews for completeness
- BP2: Test software units → Automated execution of AI-generated tests, coverage analysis
- BP3: Achieve consistency → Traceability checks (requirement ↔ code ↔ test)
Work Products:
- Unit test report (13-04): Include AI agent test generation metadata (confidence, validation status)
- Test coverage report: Flag gaps detected by agent error analysis
SUP.9 (Problem Resolution Management)
ASPICE Requirement: Track and resolve problems/defects (BP1-BP5)
Agent Error as "Problem":
- BP1: Define problem management strategy → Treat agent errors as defects, assign severity
- BP2: Record problems → Log all agent errors in Jira/ALM tool
- BP3: Implement corrective actions → Retry, fallback, or escalate per recovery strategy
- BP4: Track problems to closure → Monitor incident resolution
- BP5: Analyze trends → Monthly review of agent error metrics (top error types, recovery success rate)
Example: Jira Workflow for Agent Errors
Agent Error Detected
↓
[Create Jira Issue: "AI Agent Error"]
↓
Assign to: AI Agent Team Lead
↓
Priority: Based on severity (Critical/Major/Minor)
↓
Resolution Actions:
- Update agent prompt template (prevent recurrence)
- Enhance validation rules
- Improve training data
- Document in agent error knowledge base
↓
Close Issue (track resolution time)
Metrics and KPIs
Error Detection Metrics
| Metric | Definition | Target | Measurement Frequency |
|---|---|---|---|
| Error Rate | (Agent errors / Total tasks) × 100 | < 5% | Daily |
| Critical Error Rate | (Critical errors / Total errors) × 100 | < 10% | Weekly |
| Detection Latency | Time from error occurrence to detection | < 5 minutes | Per incident |
| False Positive Rate | (False escalations / Total escalations) × 100 | < 15% | Monthly |
Recovery Metrics
| Metric | Definition | Target | Measurement Frequency |
|---|---|---|---|
| Recovery Success Rate | (Successful auto-recoveries / Total errors) × 100 | > 80% | Weekly |
| Mean Time to Recovery (MTTR) | Avg time from error detection to resolution | < 2 hours | Monthly |
| Retry Effectiveness | (Errors fixed by retry / Retry attempts) × 100 | > 60% | Monthly |
| Human Intervention Rate | (HITL escalations / Total tasks) × 100 | < 5% | Weekly |
Quality Impact Metrics
| Metric | Definition | Target | Measurement Frequency |
|---|---|---|---|
| Agent-Introduced Defects | Defects found in agent output during QA | < 2% of total defects | Sprint retrospective |
| Requirement Traceability Gap Rate | (Missing traces / Total requirements) × 100 | 0% | Before release |
| Test Coverage Gap Rate | (Uncovered requirements / Total requirements) × 100 | < 5% | Sprint |
Best Practices and Lessons Learned
What Works
- Confidence scoring + automated validation: Combine agent self-assessment with objective checks
- Progressive validation: Fast checks first (schema, linting), expensive checks later (compilation, unit tests)
- Graceful degradation: Fallback to conservative implementations when optimization fails
- Traceability-first: Always validate requirement ↔ code ↔ test linkage
- Transparent logging: Structured logs enable root cause analysis and process improvement
Common Pitfalls
- Over-trusting high confidence: Agents can hallucinate with 95%+ confidence — always validate
- Ignoring partial failures: Accepting 80% correct code without flagging gaps leads to integration bugs
- Alert fatigue: Too many low-severity escalations cause engineers to ignore critical alerts
- Insufficient context: Agent errors are often caused by missing domain knowledge — provide comprehensive context
- No feedback loop: Not tracking error patterns allows the same mistakes to recur
Conclusion and Recommendations
Key Takeaways
- AI agents are assistants, not oracles: Always validate outputs before integration
- Layered validation: Combine confidence scoring, automated checks, and human review
- Fast feedback loops: Detect errors immediately (CI/CD integration)
- HITL is mandatory for safety-critical code: ASIL C/D requires human sign-off
- Continuous improvement: Track error patterns, refine prompts, enhance validation rules
Implementation Roadmap
Phase 1 (Month 1): Basic error detection
- Implement schema validation for agent outputs
- Set up structured logging (AgentLogger class)
- Define escalation workflow (Jira integration)
Phase 2 (Months 2-3): Automated validation
- Integrate static analysis (MISRA C checker)
- Add unit test auto-execution
- Implement confidence threshold policies
Phase 3 (Months 4-6): Recovery automation
- Develop retry with refined prompt logic
- Build fallback strategies (conservative code generation)
- Create metrics dashboard (Grafana)
Phase 4 (Ongoing): Optimization
- Analyze error trends, update prompt templates
- Reduce false positive rate (tune validation rules)
- Improve recovery success rate (better fallback heuristics)
Next Chapter: Chapter 31 - Workflow Instructions (Git, Pull Requests, Testing, Releases)
References
- VDA: Automotive SPICE PAM 4.0 (2023) - SUP.1, SUP.9, SWE.4, SWE.5
- ISO 26262-6:2018: Software verification and validation requirements
- Anthropic: "Claude Model Documentation" (2025) - Confidence calibration, prompt engineering
- OpenAI: "GPT-4 System Card" (2023) - Known limitations and mitigation strategies
- MISRA: "MISRA C:2012 Guidelines for the use of the C language in critical systems"
- Lewis, Chris et al.: "Measuring Calibration in Neural Networks" (NeurIPS 2019)
- Amodei, Dario et al.: "Concrete Problems in AI Safety" (2016) - Safe exploration, robustness to distributional shift