2.6: AI Agent Error Detection and Recovery

Overview

AI agents — whether generating code, analyzing requirements, creating test cases, or reviewing designs — are powerful but fallible. Unlike traditional software tools with deterministic behavior, AI agents can produce hallucinations (confident but incorrect outputs), miss edge cases, misinterpret requirements, or fail due to tool integration issues. For safety-critical ASPICE-compliant development, robust error detection and recovery mechanisms are mandatory.

This chapter establishes a comprehensive framework for:

Error Taxonomy: Classify agent errors by type and severity
Detection Mechanisms: Confidence scoring, output validation, human review protocols
Recovery Strategies: Retry logic, fallback options, escalation paths, human-in-the-loop (HITL) intervention
Monitoring and Logging: Observability for agent operations
HITL Error Handling Protocol: When and how to escalate to human engineers
Real Examples: Error patterns from requirements analysis, code generation, test creation
ASPICE Alignment: Verification and validation (SWE.4, SWE.5), error management (SUP.9)
Metrics: Error rate, recovery success rate, false positive rate

Key Principle: Humans own decisions; AI assists execution. All agent outputs must be validated before integration into safety-critical work products.

ASPICE Processes Supported:

SWE.4 (Software Unit Verification): Detect defects in AI-generated code
SWE.5 (Software Integration Test): Validate AI-generated test cases
SUP.1 (Quality Assurance): Independent review of AI outputs
SUP.9 (Problem Resolution Management): Track and resolve agent errors

Error Taxonomy

Classification Dimensions

AI agent errors can be categorized along three dimensions:

Error Type: What went wrong?
Severity: Impact on development process and product quality
Detectability: How easily can the error be caught?

Error Type Classification

Error Type	Description	Example	Detection Method
Hallucination	Agent invents information not present in input context	Generating a requirement "System shall support 5G connectivity" when none exists in specification	Traceability check (requirement → source document)
Omission	Agent misses critical information from input	Code generation skips error handling for null pointer	Coverage analysis (requirements → code mapping)
Misinterpretation	Agent correctly extracts information but assigns wrong meaning	Interpreting "brake pressure > 100 bar" as nominal value instead of fault threshold	Domain expert review, semantic validation
Format Violation	Output does not match required structure	Generating C++ code when C99 required, or missing Doxygen headers	Schema validation, linting, coding standards check
Logic Error	Code or test case is syntactically correct but logically flawed	Test case checks `speed > 0` instead of `speed > 5` per requirement	Unit test execution, assertion validation
Tool Failure	Agent fails to execute external tools (compiler, linter, test runner)	Git command fails due to network timeout	Return code checking, exception handling
Context Limit Exceeded	Input exceeds agent's token limit, causing truncation	Requirements document > 100K tokens → agent only processes first half	Token counting, chunking strategy validation
Knowledge Gap	Agent lacks domain-specific knowledge	Generating AUTOSAR-incompatible code due to unfamiliarity with standard	Static analysis with domain-specific rules (e.g., AUTOSAR checker)

Severity Levels (ASPICE-Aligned)

Severity	Impact	Example	Response Time
Critical	Safety impact, violates ASIL requirements, or prevents build	Generated code disables watchdog timer (ASIL-D violation)	Immediate HITL escalation, block PR merge
Major	Functional defect, violates ASPICE work product requirements	Missing bidirectional traceability for requirements	Fix before sprint end, manual correction
Minor	Quality issue, does not block progress	Inconsistent variable naming (violates style guide)	Fix in next iteration, automated cleanup
Cosmetic	Documentation or formatting issue	Missing code comment for non-critical function	Optional fix, nice-to-have

Detectability Assessment

Detectability	Description	Mitigation
High	Error detected by automated tools (linter, compiler, unit tests)	Integrate tools into CI/CD pipeline, run on every agent output
Medium	Error detected by human review within 1-2 iterations	Mandatory peer review, checklist-based validation
Low	Error only discovered in integration testing or production	Increase test coverage, add fault injection tests, independent safety review

Detection Mechanisms

1. Confidence Scoring (Agent Self-Assessment)

Approach: Agent reports confidence level for each output.

Implementation (Claude API):

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

def generate_code_with_confidence(requirement_text):
    """
    Generate code from requirement and assess confidence
    """
    prompt = f"""
You are an embedded software engineer generating C code from requirements.

Requirement:
{requirement_text}

Generate C code implementing this requirement. Then assess your confidence:
- Confidence level (0-100%): How certain are you this code is correct?
- Assumptions: What assumptions did you make?
- Risks: What could go wrong with this implementation?

Format:
```c
// Code here

Confidence: X% Assumptions: [List] Risks: [List] """

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}]
)

response_text = message.content[0].text

# Parse confidence score
confidence_match = re.search(r'Confidence:\s*(\d+)%', response_text)
confidence = int(confidence_match.group(1)) if confidence_match else 50  # Default to medium

return {
    "code": extract_code_block(response_text),
    "confidence": confidence,
    "assumptions": extract_section(response_text, "Assumptions"),
    "risks": extract_section(response_text, "Risks"),
    "full_response": response_text
}

Usage

result = generate_code_with_confidence("REQ-SWE-123: Brake pressure shall be monitored every 10ms")

if result["confidence"] < 70: print("[WARN] Low confidence - escalate to human review") escalate_to_human(result) else: print("[OK] High confidence - proceed with automated validation") run_automated_tests(result["code"])


**Confidence Threshold Policy**:
- **< 50%**: Automatic rejection, escalate to human immediately
- **50-70%**: Require peer review before merge
- **70-90%**: Standard automated validation (linting, unit tests)
- **> 90%**: Fast-track review (post-merge validation acceptable for non-safety code)

**Limitations**:
- Agents can be overconfident (hallucinating with high confidence)
- Confidence is not calibrated consistently across different agent models
- **Mitigation**: Combine self-assessment with objective validation (see below)

---

### 2. Output Validation (Automated Checks)

#### A. Schema Validation

**Purpose**: Ensure output matches expected structure.

**Example: Validate Generated Requirements**
```python
from pydantic import BaseModel, validator
from typing import List, Optional

class Requirement(BaseModel):
    """Schema for software requirements"""
    id: str  # Format: REQ-SWE-NNN
    text: str  # Requirement description
    asil: str  # ASIL level: QM, A, B, C, D
    source: str  # Parent system requirement ID
    verification_method: str  # Test, Review, Analysis

    @validator('id')
    def validate_id_format(cls, v):
        if not re.match(r'^REQ-SWE-\d{3}$', v):
            raise ValueError(f"Invalid ID format: {v} (expected REQ-SWE-NNN)")
        return v

    @validator('asil')
    def validate_asil(cls, v):
        if v not in ['QM', 'A', 'B', 'C', 'D']:
            raise ValueError(f"Invalid ASIL level: {v}")
        return v

# Validate agent-generated requirements
def validate_agent_requirements(agent_output: str) -> List[Requirement]:
    """Parse and validate requirements from agent output"""
    requirements = []
    errors = []

    # Parse agent output (assuming JSON format)
    try:
        raw_reqs = json.loads(agent_output)
    except json.JSONDecodeError as e:
        raise ValueError(f"Agent output is not valid JSON: {e}")

    for raw_req in raw_reqs:
        try:
            req = Requirement(**raw_req)  # Pydantic validation
            requirements.append(req)
        except ValidationError as e:
            errors.append({
                "requirement": raw_req,
                "error": str(e)
            })

    if errors:
        # Log errors for debugging
        log_validation_errors(errors)
        # Escalate to human if critical fields missing
        if any("id" in e["error"] or "asil" in e["error"] for e in errors):
            escalate_to_human(f"Schema validation failed: {len(errors)} requirements have critical errors")

    return requirements

B. Static Analysis

Purpose: Detect code defects, security vulnerabilities, coding standard violations.

Tools:

MISRA C Checker (PC-lint, LDRA, Polyspace): Automotive coding standards
Cppcheck: Open-source static analyzer for C/C++
SonarQube: Code quality and security analysis
Coverity: Commercial static analysis (safety-certified)

Example: Validate AI-Generated Code

import subprocess

def static_analysis_check(code_file_path):
    """
    Run static analysis on AI-generated code
    Returns (pass: bool, violations: List[dict])
    """
    violations = []

    # Run MISRA C checker (PC-lint)
    result = subprocess.run(
        ["lint-nt", "-w3", "misra_required.lnt", code_file_path],
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        # Parse violations from lint output
        for line in result.stdout.split('\n'):
            if "error" in line.lower() or "warning" in line.lower():
                violations.append({
                    "file": code_file_path,
                    "line": extract_line_number(line),
                    "rule": extract_rule_id(line),
                    "message": line.strip()
                })

    # Classification
    critical_violations = [v for v in violations if "error" in v["message"].lower()]
    warning_violations = [v for v in violations if "warning" in v["message"].lower()]

    if critical_violations:
        print(f"[FAIL] Critical violations found: {len(critical_violations)}")
        print("Escalating to human review...")
        escalate_to_human({
            "code_file": code_file_path,
            "violations": critical_violations
        })
        return False, violations
    elif len(warning_violations) > 10:
        print(f"[WARN] Many warnings ({len(warning_violations)}), recommend review")
        notify_reviewer(violations)

    return True, violations

C. Unit Test Execution

Purpose: Validate functional correctness of AI-generated code.

Example: Auto-Generated Test Execution

def validate_generated_code_with_tests(code_file, test_file):
    """
    Compile and run unit tests for AI-generated code
    """
    # Step 1: Compile code
    compile_result = subprocess.run(
        ["gcc", "-Wall", "-Werror", "-c", code_file, "-o", "temp.o"],
        capture_output=True,
        text=True
    )

    if compile_result.returncode != 0:
        print("[FAIL] Compilation failed:")
        print(compile_result.stderr)
        return {
            "status": "FAIL",
            "stage": "compilation",
            "error": compile_result.stderr
        }

    # Step 2: Compile and link tests
    test_compile = subprocess.run(
        ["gcc", "-Wall", "temp.o", test_file, "-o", "test_runner", "-lcheck"],
        capture_output=True,
        text=True
    )

    if test_compile.returncode != 0:
        print("[FAIL] Test compilation failed:")
        print(test_compile.stderr)
        return {
            "status": "FAIL",
            "stage": "test_compilation",
            "error": test_compile.stderr
        }

    # Step 3: Run tests
    test_run = subprocess.run(
        ["./test_runner"],
        capture_output=True,
        text=True,
        timeout=30  # Prevent infinite loops
    )

    # Parse test results (assuming Check framework output)
    passed = test_run.stdout.count("PASSED")
    failed = test_run.stdout.count("FAILED")

    if failed > 0:
        print(f"[FAIL] Tests failed: {failed}/{passed + failed}")
        print(test_run.stdout)
        return {
            "status": "FAIL",
            "stage": "test_execution",
            "passed": passed,
            "failed": failed,
            "output": test_run.stdout
        }

    print(f"[OK] All tests passed: {passed}/{passed}")
    return {
        "status": "PASS",
        "tests_run": passed
    }

3. Human Review Protocols

When to Trigger Human Review:

Agent confidence < 70%
Static analysis finds critical violations
Unit tests fail
Output modifies safety-critical code (ASIL C/D)
Traceability gaps detected (requirement → code link missing)

Review Checklist Template:

# AI-Generated Code Review Checklist

## Meta Information
- Agent: [Claude Sonnet 4.6 / GPT-4o / Custom Agent]
- Task: [Code generation / Test creation / Requirement analysis]
- Confidence Score: [X%]
- Auto-Validation Results: [PASS / FAIL with details]

## Functional Review
- [ ] Code implements all requirements (check traceability)
- [ ] Edge cases handled (null pointers, boundary values, timeouts)
- [ ] Error handling complete (return codes checked, exceptions caught)
- [ ] ASIL requirements met (safety mechanisms, redundancy for ASIL C/D)

## Code Quality Review
- [ ] MISRA C compliance (no critical violations)
- [ ] Naming conventions followed (project coding standard)
- [ ] Comments adequate (Doxygen headers, complex logic explained)
- [ ] Cyclomatic complexity acceptable (< 15 per function)

## Safety & Security Review (if ASIL > QM)
- [ ] Memory safety (no buffer overflows, use-after-free)
- [ ] Integer safety (no overflows, division by zero checks)
- [ ] Timing determinism (no unbounded loops, WCET analysis compatible)
- [ ] Security considerations (input validation, no hardcoded secrets)

## Traceability Review
- [ ] Bidirectional trace established (requirement ↔ code)
- [ ] Test cases linked to requirements
- [ ] Design rationale documented

## Recommendation
- [ ] Approve as-is
- [ ] Approve with minor changes (list below)
- [ ] Request major revisions (specify issues)
- [ ] Reject (escalate to architect/safety manager)

---
Reviewer: [Name]
Date: [YYYY-MM-DD]
Time Spent: [HH:MM]

Recovery Strategies

Strategy 1: Retry with Refined Prompt

Use Case: Agent misunderstands vague requirement.

Example:

def generate_code_with_retry(requirement, max_retries=3):
    """
    Generate code with automatic retry on validation failure
    """
    for attempt in range(max_retries):
        print(f"Attempt {attempt + 1}/{max_retries}")

        # Generate code
        result = agent.generate_code(requirement)

        # Validate
        validation = validate_generated_code(result["code"])

        if validation["status"] == "PASS":
            return result
        else:
            # Refine prompt with error feedback
            error_context = f"""
Previous attempt failed validation:
{validation['error']}

Common mistakes to avoid:
- Ensure all edge cases are handled (null pointers, buffer boundaries)
- Follow MISRA C rules (avoid pointer arithmetic, use explicit casts)
- Add Doxygen comments for all functions

Please regenerate the code addressing these issues.
"""
            requirement = requirement + "\n\n" + error_context

    # All retries exhausted
    print("[FAIL] Max retries exceeded - escalating to human")
    escalate_to_human({
        "requirement": requirement,
        "attempts": max_retries,
        "last_error": validation["error"]
    })
    return None

Strategy 2: Fallback to Conservative Implementation

Use Case: Agent struggles with complex requirement → generate simpler, safer version.

Example:

def generate_code_with_fallback(requirement, asil_level):
    """
    If agent fails to generate optimized code, fall back to conservative implementation
    """
    # Attempt 1: Optimized code
    prompt_optimized = f"""
Generate highly optimized C code for: {requirement}
Minimize CPU cycles and memory usage.
"""
    result = agent.generate_code(prompt_optimized)

    if validate_generated_code(result["code"])["status"] == "PASS":
        return result

    # Fallback: Conservative, safety-focused code
    print("[WARN] Optimized generation failed, falling back to conservative approach")

    prompt_conservative = f"""
Generate C code for: {requirement}

CRITICAL SAFETY REQUIREMENTS:
- ASIL {asil_level}: Prioritize correctness over performance
- Use defensive programming (check all inputs, validate ranges)
- Avoid pointer arithmetic (use array indexing)
- Add assertions for all preconditions
- Use static memory allocation (no malloc)
- Implement timeout for all loops
"""
    result_conservative = agent.generate_code(prompt_conservative)

    if validate_generated_code(result_conservative["code"])["status"] == "PASS":
        return result_conservative
    else:
        escalate_to_human("Both optimized and conservative code generation failed")
        return None

Strategy 3: Partial Acceptance with Manual Completion

Use Case: Agent generates 80% correct code, but specific section needs human expertise.

Example:

def partial_acceptance_workflow(requirement):
    """
    Accept AI-generated code skeleton, flag sections needing human input
    """
    result = agent.generate_code(requirement)

    # Parse code for uncertainty markers
    uncertain_sections = extract_todo_comments(result["code"])

    if uncertain_sections:
        print(f"[WARN] Agent flagged {len(uncertain_sections)} sections for human completion:")
        for section in uncertain_sections:
            print(f"  - Line {section['line']}: {section['comment']}")

        # Create task for engineer
        create_jira_task(
            title=f"Complete AI-generated code for {requirement['id']}",
            description=f"""
Agent generated partial implementation but needs human expertise for:
{format_uncertain_sections(uncertain_sections)}

Code file: {result['file_path']}
Requirement: {requirement['text']}
""",
            assignee=get_domain_expert(requirement['module']),
            priority="High"
        )

        return {
            "status": "PARTIAL",
            "code": result["code"],
            "pending_tasks": uncertain_sections
        }
    else:
        return {
            "status": "COMPLETE",
            "code": result["code"]
        }

# Example agent output with uncertainty markers
"""
void brake_pressure_monitor(void) {
    // TODO(AI): Verify correct sensor scaling factor (assumed 0.1 bar/bit)
    float pressure_bar = read_adc_sensor() * 0.1;

    // TODO(AI-HUMAN): Confirm threshold with safety engineer (currently 100 bar from spec)
    if (pressure_bar > 100.0) {
        trigger_fault_handler(FAULT_OVERPRESSURE);
    }
}
"""

Strategy 4: Human-in-the-Loop Escalation

Use Case: Agent error is critical or unrecoverable.

Escalation Triggers:

Safety violation (ASIL C/D code modifies safety mechanism)
Repeated validation failures (3+ retry attempts)
Traceability break (cannot link generated code to requirement)
Tool crash (agent cannot execute compiler/linter)

Escalation Workflow:

def escalate_to_human(error_context):
    """
    Notify human engineer and pause agent workflow
    """
    # Log error
    logger.error(f"Agent error requiring human intervention: {error_context}")

    # Create incident ticket
    incident_id = create_jira_incident(
        summary=f"AI Agent Error: {error_context['type']}",
        description=f"""
**Error Type**: {error_context['type']}
**Severity**: {error_context['severity']}
**Agent Task**: {error_context['task']}
**Failure Details**:
{error_context['details']}

**Recommended Actions**:
{error_context['recommendations']}

**Context**:
- Requirement ID: {error_context.get('requirement_id', 'N/A')}
- Code File: {error_context.get('file_path', 'N/A')}
- Agent Confidence: {error_context.get('confidence', 'N/A')}%
""",
        priority="Critical" if error_context['severity'] == "Critical" else "High",
        labels=["ai-agent-error", "hitl-required"]
    )

    # Notify on-call engineer
    send_slack_alert(
        channel="#ai-agents-escalation",
        message=f"[ESCALATION] Agent escalation: {incident_id}\nRequires human review within 2 hours.",
        mention=get_oncall_engineer()
    )

    # Pause agent workflow
    return {
        "status": "ESCALATED",
        "incident_id": incident_id,
        "awaiting_human": True
    }

Monitoring and Logging

Observability Requirements

Key Metrics:

Error Rate: Errors per 100 agent invocations
Recovery Success Rate: % of errors resolved by automated retry/fallback
Mean Time to Recovery (MTTR): Time from error detection to resolution
False Positive Rate: % of escalations that were not actual errors
Human Intervention Rate: % of tasks requiring HITL

Logging Framework:

import logging
from datetime import datetime

class AgentLogger:
    """Structured logging for AI agent operations"""

    def __init__(self, agent_name):
        self.agent_name = agent_name
        self.logger = logging.getLogger(f"agent.{agent_name}")

    def log_task_start(self, task_id, task_type, input_data):
        """Log agent task initiation"""
        self.logger.info({
            "event": "task_start",
            "timestamp": datetime.utcnow().isoformat(),
            "agent": self.agent_name,
            "task_id": task_id,
            "task_type": task_type,
            "input_size": len(str(input_data))
        })

    def log_task_complete(self, task_id, output_data, confidence, duration_ms):
        """Log successful task completion"""
        self.logger.info({
            "event": "task_complete",
            "timestamp": datetime.utcnow().isoformat(),
            "agent": self.agent_name,
            "task_id": task_id,
            "output_size": len(str(output_data)),
            "confidence": confidence,
            "duration_ms": duration_ms
        })

    def log_error(self, task_id, error_type, error_details, severity):
        """Log agent error"""
        self.logger.error({
            "event": "task_error",
            "timestamp": datetime.utcnow().isoformat(),
            "agent": self.agent_name,
            "task_id": task_id,
            "error_type": error_type,
            "severity": severity,
            "details": error_details
        })

    def log_recovery(self, task_id, recovery_strategy, success):
        """Log recovery attempt"""
        self.logger.warning({
            "event": "recovery_attempt",
            "timestamp": datetime.utcnow().isoformat(),
            "agent": self.agent_name,
            "task_id": task_id,
            "strategy": recovery_strategy,
            "success": success
        })

    def log_escalation(self, task_id, escalation_reason, incident_id):
        """Log human escalation"""
        self.logger.critical({
            "event": "human_escalation",
            "timestamp": datetime.utcnow().isoformat(),
            "agent": self.agent_name,
            "task_id": task_id,
            "reason": escalation_reason,
            "incident_id": incident_id
        })

# Usage example
logger = AgentLogger("code_generator")

task_id = "TASK-12345"
logger.log_task_start(task_id, "code_generation", requirement)

try:
    result = generate_code(requirement)
    logger.log_task_complete(task_id, result["code"], result["confidence"], duration_ms=1234)
except ValidationError as e:
    logger.log_error(task_id, "validation_failure", str(e), "Major")
    logger.log_recovery(task_id, "retry_with_refined_prompt", success=False)
    incident = escalate_to_human(error_context)
    logger.log_escalation(task_id, "repeated_validation_failure", incident["incident_id"])

Metrics Dashboard

Example: Grafana Dashboard Panels

Error Rate Over Time (Line chart)
- Query: count(log_event == "task_error") / count(log_event == "task_start") * 100
- Alert: Error rate > 10% for 1 hour
Recovery Success Rate (Gauge)
- Query: count(recovery_success == true) / count(recovery_attempt) * 100
- Target: > 80%
Top Error Types (Bar chart)
- Query: count(log_event == "task_error") group by error_type
Mean Time to Recovery (Single stat)
- Query: avg(escalation_timestamp - error_timestamp)
- Target: < 2 hours
Human Intervention Rate (Pie chart)
- Query: count(log_event == "human_escalation") / count(log_event == "task_start") * 100
- Target: < 5%

Real-World Error Examples

Example 1: Code Generation Hallucination

Scenario: Agent generates code referencing non-existent API.

Requirement:

REQ-SWE-234: Motor controller shall read encoder position every 1ms

AI-Generated Code (Incorrect):

// HALLUCINATION: read_encoder_position() does not exist in HAL
void motor_control_task(void) {
    int32_t position = read_encoder_position();  // [FAIL] Undefined function
    update_pid_controller(position);
}

Detection:

Compilation Error: undefined reference to 'read_encoder_position'
Static Analysis: Function not declared in any header

Recovery:

Retry with Context: Provide HAL API documentation to agent

prompt_retry = f"""
{requirement}

Available HAL APIs:
- HAL_Encoder_Init(encoder_id)
- HAL_Encoder_GetPosition(encoder_id) -> returns int32_t
- HAL_Encoder_GetSpeed(encoder_id) -> returns int32_t

Generate code using ONLY these APIs.
"""

Validation: Verify function calls against HAL header file

def validate_function_calls(code, allowed_apis):
    """Check that all function calls are in allowed API list"""
    function_calls = extract_function_calls(code)  # Regex or AST parsing
    invalid_calls = [f for f in function_calls if f not in allowed_apis]

    if invalid_calls:
        return {
            "status": "FAIL",
            "error": f"Undefined functions: {invalid_calls}"
        }
    return {"status": "PASS"}

Example 2: Test Case Omission

Scenario: Agent generates test cases but misses critical edge case.

Requirement:

REQ-SWE-456: Door lock shall remain engaged if speed sensor fails while vehicle is moving

AI-Generated Tests (Incomplete):

// Test 1: Normal operation
void test_door_lock_normal(void) {
    set_speed(50);  // 50 km/h
    door_lock_control();
    assert(get_lock_state() == LOCKED);
}

// [FAIL] MISSING: Test for sensor fault during motion

Detection:

Coverage Analysis: Requirement REQ-SWE-456 not covered by any test case
Traceability Check: No test linked to safety requirement

Recovery:

Gap Analysis: Identify missing test scenarios

def analyze_test_coverage(requirements, test_cases):
    """Map requirements to test cases, flag gaps"""
    coverage_map = {}

    for req in requirements:
        # Extract test cases that reference this requirement
        linked_tests = [t for t in test_cases if req["id"] in t["requirement_refs"]]

        if not linked_tests:
            coverage_map[req["id"]] = {
                "status": "NOT_COVERED",
                "tests": []
            }
        else:
            coverage_map[req["id"]] = {
                "status": "COVERED",
                "tests": [t["id"] for t in linked_tests]
            }

    # Report gaps
    gaps = [req_id for req_id, cov in coverage_map.items() if cov["status"] == "NOT_COVERED"]
    return gaps

# Generate missing tests
gaps = analyze_test_coverage(requirements, generated_tests)
for req_id in gaps:
    print(f"[WARN] No test coverage for {req_id}, regenerating...")
    additional_test = agent.generate_test_case(get_requirement(req_id))
    generated_tests.append(additional_test)

Example 3: Requirements Misinterpretation

Scenario: Agent interprets ambiguous requirement incorrectly.

Requirement (Ambiguous):

REQ-SYS-789: System shall respond to brake pedal input quickly

AI Interpretation:

Software Requirement: Brake control loop shall execute every 10ms

Correct Interpretation (From safety engineer):

Software Requirement: Brake control loop shall execute every 1ms (latency < 5ms per ISO 26262 brake-by-wire requirements)

Detection:

Domain Expert Review: Safety engineer flags incorrect timing requirement
Semantic Validation: Cross-check against ISO 26262 guidelines (automated checker)

Recovery:

Clarification Request: Agent asks for missing details

prompt_clarify = """
Requirement: "System shall respond to brake pedal input quickly"

This requirement is ambiguous. Please clarify:
1. What is the acceptable latency? (e.g., < 5ms, < 50ms)
2. What is the execution frequency? (e.g., 1kHz, 100Hz, 10Hz)
3. What is the ASIL level? (determines criticality)
4. Are there regulatory constraints? (ISO 26262, UN R13H)

If information is not available, I will flag this requirement for human review.
"""

Escalation to Requirement Owner: Create Jira task for requirement clarification

ASPICE Alignment

SWE.4 (Software Unit Verification)

ASPICE Requirement: Verify that software units meet requirements (BP1, BP2, BP3)

Agent Error Detection Integration:

BP1: Develop unit test cases → AI agent generates tests, human reviews for completeness
BP2: Test software units → Automated execution of AI-generated tests, coverage analysis
BP3: Achieve consistency → Traceability checks (requirement ↔ code ↔ test)

Work Products:

Unit test report (13-04): Include AI agent test generation metadata (confidence, validation status)
Test coverage report: Flag gaps detected by agent error analysis

SUP.9 (Problem Resolution Management)

ASPICE Requirement: Track and resolve problems/defects (BP1-BP5)

Agent Error as "Problem":

BP1: Define problem management strategy → Treat agent errors as defects, assign severity
BP2: Record problems → Log all agent errors in Jira/ALM tool
BP3: Implement corrective actions → Retry, fallback, or escalate per recovery strategy
BP4: Track problems to closure → Monitor incident resolution
BP5: Analyze trends → Monthly review of agent error metrics (top error types, recovery success rate)

Example: Jira Workflow for Agent Errors

Agent Error Detected
        ↓
[Create Jira Issue: "AI Agent Error"]
        ↓
Assign to: AI Agent Team Lead
        ↓
Priority: Based on severity (Critical/Major/Minor)
        ↓
Resolution Actions:
  - Update agent prompt template (prevent recurrence)
  - Enhance validation rules
  - Improve training data
  - Document in agent error knowledge base
        ↓
Close Issue (track resolution time)

Metrics and KPIs

Error Detection Metrics

Metric	Definition	Target	Measurement Frequency
Error Rate	(Agent errors / Total tasks) × 100	< 5%	Daily
Critical Error Rate	(Critical errors / Total errors) × 100	< 10%	Weekly
Detection Latency	Time from error occurrence to detection	< 5 minutes	Per incident
False Positive Rate	(False escalations / Total escalations) × 100	< 15%	Monthly

Recovery Metrics

Metric	Definition	Target	Measurement Frequency
Recovery Success Rate	(Successful auto-recoveries / Total errors) × 100	> 80%	Weekly
Mean Time to Recovery (MTTR)	Avg time from error detection to resolution	< 2 hours	Monthly
Retry Effectiveness	(Errors fixed by retry / Retry attempts) × 100	> 60%	Monthly
Human Intervention Rate	(HITL escalations / Total tasks) × 100	< 5%	Weekly

Quality Impact Metrics

Metric	Definition	Target	Measurement Frequency
Agent-Introduced Defects	Defects found in agent output during QA	< 2% of total defects	Sprint retrospective
Requirement Traceability Gap Rate	(Missing traces / Total requirements) × 100	0%	Before release
Test Coverage Gap Rate	(Uncovered requirements / Total requirements) × 100	< 5%	Sprint

Best Practices and Lessons Learned

What Works

Confidence scoring + automated validation: Combine agent self-assessment with objective checks
Progressive validation: Fast checks first (schema, linting), expensive checks later (compilation, unit tests)
Graceful degradation: Fallback to conservative implementations when optimization fails
Traceability-first: Always validate requirement ↔ code ↔ test linkage
Transparent logging: Structured logs enable root cause analysis and process improvement

Common Pitfalls

Over-trusting high confidence: Agents can hallucinate with 95%+ confidence — always validate
Ignoring partial failures: Accepting 80% correct code without flagging gaps leads to integration bugs
Alert fatigue: Too many low-severity escalations cause engineers to ignore critical alerts
Insufficient context: Agent errors are often caused by missing domain knowledge — provide comprehensive context
No feedback loop: Not tracking error patterns allows the same mistakes to recur

Conclusion and Recommendations

Key Takeaways

AI agents are assistants, not oracles: Always validate outputs before integration
Layered validation: Combine confidence scoring, automated checks, and human review
Fast feedback loops: Detect errors immediately (CI/CD integration)
HITL is mandatory for safety-critical code: ASIL C/D requires human sign-off
Continuous improvement: Track error patterns, refine prompts, enhance validation rules

Implementation Roadmap

Phase 1 (Month 1): Basic error detection

Implement schema validation for agent outputs
Set up structured logging (AgentLogger class)
Define escalation workflow (Jira integration)

Phase 2 (Months 2-3): Automated validation

Integrate static analysis (MISRA C checker)
Add unit test auto-execution
Implement confidence threshold policies

Phase 3 (Months 4-6): Recovery automation

Develop retry with refined prompt logic
Build fallback strategies (conservative code generation)
Create metrics dashboard (Grafana)

Phase 4 (Ongoing): Optimization

Analyze error trends, update prompt templates
Reduce false positive rate (tune validation rules)
Improve recovery success rate (better fallback heuristics)

Next Chapter: Chapter 31 - Workflow Instructions (Git, Pull Requests, Testing, Releases)

References

VDA: Automotive SPICE PAM 4.0 (2023) - SUP.1, SUP.9, SWE.4, SWE.5
ISO 26262-6:2018: Software verification and validation requirements
Anthropic: "Claude Model Documentation" (2025) - Confidence calibration, prompt engineering
OpenAI: "GPT-4 System Card" (2023) - Known limitations and mitigation strategies
MISRA: "MISRA C:2012 Guidelines for the use of the C language in critical systems"
Lewis, Chris et al.: "Measuring Calibration in Neural Networks" (NeurIPS 2019)
Amodei, Dario et al.: "Concrete Problems in AI Safety" (2016) - Safe exploration, robustness to distributional shift