5.5: SUP.9 Problem Resolution Management
Process Definition
Purpose
SUP.9 Purpose: To ensure that problems are identified, analyzed, managed, and controlled to resolution. The process establishes a disciplined approach for recording every problem detected across all lifecycle phases, performing systematic root cause analysis, implementing verified corrective actions, and feeding lessons learned back into the development process to prevent recurrence.
AI Value Proposition: AI transforms problem resolution from a reactive, labor-intensive activity into a proactive, pattern-driven discipline. Machine learning models trained on historical problem repositories can suggest root causes within minutes rather than days, detect duplicates before investigation effort is wasted, and predict emerging problem clusters before they escalate into release blockers.
Outcomes
| Outcome | Description | AI Contribution |
|---|---|---|
| O1 | A problem resolution management strategy is developed | AI recommends strategy parameters based on project risk profile and historical defect density |
| O2 | Problems are recorded, uniquely identified, and classified | AI auto-classifies severity, category, and affected component with confidence scores |
| O3 | Problems are analyzed to determine root causes | AI performs pattern matching against historical problem repositories to suggest probable root causes |
| O4 | A resolution strategy is determined and implemented for each problem | AI recommends resolution approaches drawn from similar past resolutions |
| O5 | Problems are tracked to closure and status communicated to affected parties | AI monitors resolution progress and escalates stalled items automatically |
| O6 | Trends in problem reports are analyzed to prevent future occurrences | AI runs statistical and ML-based trend analysis to predict emerging problem categories |
Base Practices with AI Integration
| BP | Base Practice | AI Level | AI Application | Human Responsibility |
|---|---|---|---|---|
| BP1 | Develop a problem resolution management strategy | L1 | Template generation, strategy parameter suggestions based on project type | Approve strategy, define escalation paths |
| BP2 | Identify and record the problem | L2 | Auto-populate fields from test logs, classify category and component, detect duplicates | Verify classification, confirm non-duplicate status |
| BP3 | Analyze problems for their root cause | L2 | Historical pattern matching, similarity search across prior RCA records, suggest probable root causes | Validate suggested root cause, perform physical investigation where needed |
| BP4 | Determine a resolution strategy | L2 | Recommend fix approaches from historical resolution database, estimate effort and risk | Select resolution approach, approve implementation plan |
| BP5 | Implement problem resolution | L1 | Generate fix templates, link to affected configuration items, update traceability | Implement the actual code or design change |
| BP6 | Track problems to closure | L2-L3 | Automated status monitoring, regression verification tracking, closure criteria checking | Approve closure, sign-off on verification evidence |
| BP7 | Analyze problem trends | L2-L3 | Statistical trend analysis, anomaly detection, predictive clustering | Interpret trends, initiate preventive actions |
AI Level Definitions: L1 = AI assists with templates and suggestions; L2 = AI performs analysis, human validates; L3 = AI executes autonomously with human oversight on exceptions.
AI-Assisted Problem Resolution
The following diagram illustrates the AI-assisted problem resolution workflow, from initial detection and classification through root cause analysis, corrective action, and closure verification.
Problem Report Template
Note: This automotive example demonstrates temperature-related timing investigation; adapt for project-specific domains.
# Problem Report (illustrative automotive example)
problem:
id: PR-(year)-(number)
title: "Door lock timing exceeds requirement at cold temperature"
status: in_progress
created: (creation date)
reporter: Test Engineer
classification:
category: timing
component: DoorLockControl
severity: high
priority: P1
ai_confidence: 0.85
description: |
During HIL testing at -40C, door lock timing measured at 11.2ms,
exceeding the 10ms requirement specified in SWE-BCM-103.
reproduction:
steps:
- Set climate chamber to -40C
- Wait for ECU temperature stabilization (30 min)
- Send lock command via CAN
- Measure actuator output timing
rate: "100% reproducible at -40C"
environment: "HIL system with climate chamber"
ai_analysis:
root_cause_suggestion: |
Pattern match with 3 similar historical issues indicates
motor driver slew rate degradation at cold temperature.
Likely cause: GPIO driver configuration does not account
for temperature-dependent propagation delay.
Related issues: PR-2023-156, PR-2024-089
confidence: medium
suggested_investigation:
- "Check motor driver IC datasheet for cold temp specs"
- "Review GPIO driver strength configuration"
- "Analyze timing margin at component level"
investigation:
actual_cause: |
Motor driver transistor turn-on time increases from 50ns
to 120ns at -40C due to reduced carrier mobility.
Combined with 4 sequential actuator commands, adds ~0.3ms.
root_cause_type: design
verified: true
verified_by: HW Engineer
resolution:
approach: "Increase GPIO drive strength for cold temperature"
fix_description: |
Added temperature-dependent GPIO configuration:
- T > -20C: Standard drive strength (8mA)
- T <= -20C: High drive strength (16mA)
changed_files:
- src/driver/gpio_driver.c
- config/gpio_config.c
change_request: CR-2025-015
verification:
test_cases:
- id: SWE-QT-BCM-005
result: pass
measured: 9.8ms
regression:
status: pass
scope: "Full DoorLock test suite"
closure:
closed_date: 2025-01-22
closed_by: SW Lead
lessons_learned: |
Temperature compensation should be considered for all
timing-critical functions during design phase.
AI-Powered Root Cause Analysis
ML Techniques for Root Cause Identification
AI-powered root cause analysis draws on multiple machine learning techniques applied to the corpus of historical problem reports. The objective is to reduce mean time to root cause identification by surfacing the most probable causes before an engineer begins manual investigation.
| ML Technique | Application to RCA | Typical Accuracy | Training Data Required |
|---|---|---|---|
| TF-IDF + Cosine Similarity | Match new problem descriptions against historical reports to find textually similar past problems | 70-80% top-5 recall | 500+ resolved problem reports |
| Sentence Embeddings (SBERT) | Semantic similarity search that captures meaning beyond keyword overlap | 75-85% top-5 recall | 500+ resolved reports with root cause fields |
| Random Forest Classifier | Classify root cause category (design, requirements, tooling, process) from structured fields | 78-88% accuracy | 1,000+ labeled reports |
| Gradient Boosted Trees (XGBoost) | Predict affected component and likely fix location from symptom description and test context | 72-82% accuracy | 1,000+ reports with component labels |
| Clustering (HDBSCAN) | Group problem reports into clusters to reveal systemic issues affecting multiple components | N/A (unsupervised) | 200+ reports for meaningful clusters |
| Bayesian Networks | Model causal chains between symptoms, root causes, and environmental conditions | 65-75% causal accuracy | Expert-defined structure + 500+ data points |
Data Quality Warning: ML model accuracy depends directly on the quality and consistency of historical problem data. Organizations beginning AI-assisted RCA should invest in structured problem report templates and retrospective labeling of legacy data before expecting reliable predictions.
AI-Assisted RCA
AI Root Cause Analysis Report:
------------------------------
Problem: PR-2025-042 (Door lock timing at cold temp)
Historical Pattern Analysis:
----------------------------
Searching historical issues with similar characteristics...
Match 1: PR-2023-156 (87% similarity)
- Project: Window Controller
- Symptom: Timing violation at cold temperature
- Root Cause: Transistor switching time degradation
- Resolution: Increased driver current
Match 2: PR-2024-089 (72% similarity)
- Project: Seat Controller
- Symptom: Motor response delay at -30C
- Root Cause: Power transistor thermal characteristics
- Resolution: Temperature compensation
Match 3: PR-2022-234 (65% similarity)
- Project: Mirror Controller
- Symptom: Actuator timing inconsistent
- Root Cause: PWM timing affected by temperature
- Resolution: Temperature-adjusted timing
Common Pattern Identified:
--------------------------
Category: Temperature-dependent timing degradation
Component: Power driver / Output stage
Physics: Semiconductor carrier mobility reduction at cold
Suggested Investigation Order:
------------------------------
1. [HIGH PROBABILITY] Check output driver temperature specs
2. [MEDIUM PROBABILITY] Review timing margin analysis
3. [LOW PROBABILITY] Verify power supply stability at cold
Confidence: 82%
Human Action Required:
[ ] Validate AI analysis against actual investigation
[ ] Confirm root cause
[ ] Approve resolution approach
Problem Classification
AI-Assisted Severity, Type, and Component Classification
When a new problem report is submitted, the AI classification engine analyzes the description text, attached logs, and structured metadata to automatically assign severity, problem type, and affected component. The engineer reviews the AI assignments and overrides where necessary; every override is fed back into the model as a training signal.
| Classification Dimension | AI Method | Input Features | Confidence Threshold |
|---|---|---|---|
| Severity (critical / high / medium / low) | Multi-class text classifier (fine-tuned transformer) | Description, test type, failure mode, affected requirement ASIL level | >= 0.80 for auto-assignment; below 0.80 requires human review |
| Problem Type (functional / timing / resource / interface / safety) | Keyword-boosted gradient classifier | Description, component, test environment, error codes | >= 0.75 for auto-assignment |
| Affected Component | Named entity recognition + component taxonomy lookup | Description, file paths in stack traces, test case IDs | >= 0.70 for auto-assignment |
| Priority (P1-P4) | Rule engine combining severity + project phase + customer impact | Severity, release proximity, customer-facing flag | Deterministic (rule-based); no confidence score |
Override Feedback Loop: Every human override of an AI classification is stored as a labeled training sample. Models are retrained on a quarterly cadence or when override rate exceeds 25% for any classification dimension.
# AI Classification Output (illustrative)
classification_result:
problem_id: PR-2025-042
model_version: "cls-v3.2.1"
timestamp: "2025-01-10T08:32:00Z"
severity:
predicted: high
confidence: 0.87
reasoning: "Timing violation on safety-relevant function at boundary temp"
human_override: null
problem_type:
predicted: timing
confidence: 0.91
reasoning: "Keywords 'timing', 'exceeds', 'ms requirement' with HIL context"
human_override: null
component:
predicted: DoorLockControl
confidence: 0.85
reasoning: "Test case SWE-QT-BCM-005 mapped to DoorLockControl module"
alternatives:
- component: GPIODriver
confidence: 0.62
- component: ActuatorControl
confidence: 0.41
human_override: null
priority:
assigned: P1
rule_trace: "severity=high AND release_phase=integration AND customer_facing=true -> P1"
Duplicate Detection
NLP-Based Duplicate Problem Report Detection
Duplicate problem reports waste investigation effort and obscure true defect counts. The duplicate detection engine compares each incoming report against all open and recently closed reports using a two-stage pipeline.
Stage 1 -- Candidate Retrieval: A fast retrieval model (TF-IDF or BM25) identifies the top-50 most textually similar existing reports.
Stage 2 -- Semantic Re-ranking: A cross-encoder transformer model re-ranks candidates by semantic similarity, accounting for paraphrasing and domain synonyms (e.g., "timing violation" vs. "deadline exceeded").
| Detection Parameter | Value | Rationale |
|---|---|---|
| Similarity threshold (definite duplicate) | >= 0.92 | Reports above this threshold are auto-linked as duplicates pending human confirmation |
| Similarity threshold (potential duplicate) | 0.75 - 0.91 | Reports in this range are flagged for human review |
| Search scope | All open reports + reports closed within last 180 days | Balances recall against false positives from aged reports |
| Feature inputs | Title, description, component, failure mode, test environment | Multi-field comparison reduces false matches from generic descriptions |
| Retraining trigger | Precision drops below 85% on monthly validation set | Ensures model adapts to evolving project vocabulary |
Duplicate Detection Report:
----------------------------
Incoming: PR-2025-042 "Door lock timing exceeds requirement at cold temperature"
Candidate 1: PR-2025-038 (similarity: 0.68) -- NOT DUPLICATE
Title: "Window motor timing exceeds spec at -40C"
Verdict: Different component (WindowControl vs DoorLockControl),
similar symptom pattern. Linked as RELATED, not duplicate.
Candidate 2: PR-2025-029 (similarity: 0.54) -- NOT DUPLICATE
Title: "Door lock fails to engage intermittently"
Verdict: Same component but different failure mode
(functional vs timing). No link.
Candidate 3: PR-2024-089 (similarity: 0.71) -- RELATED
Title: "Seat motor response delay at cold temperature"
Verdict: Different project, similar root cause pattern.
Linked as RELATED for cross-project learning.
Result: No duplicate found. PR-2025-042 confirmed as NEW.
Related issues linked: PR-2025-038, PR-2024-089
Human Confirmation Required: Even when the AI declares a definite duplicate (similarity >= 0.92), a human must confirm the linkage before the incoming report is merged. Auto-closure of duplicates without human review is prohibited in safety-relevant projects.
Resolution Recommendation
AI-Suggested Fixes Based on Historical Resolutions
Once a root cause category is established, the resolution recommendation engine searches the historical resolution database for verified fixes applied to similar problems. Recommendations are ranked by relevance, success rate, and applicability to the current project context.
| Recommendation Factor | Weight | Description |
|---|---|---|
| Root cause similarity | 0.35 | How closely the root cause of the historical problem matches the current one |
| Component similarity | 0.25 | Whether the fix was applied to the same or analogous component |
| Resolution success rate | 0.20 | Percentage of times this fix type resolved the problem without recurrence |
| Project context match | 0.10 | Same MCU family, same RTOS, similar safety level |
| Recency | 0.10 | More recent resolutions weighted higher to reflect current codebase state |
# Resolution Recommendation Output (illustrative)
resolution_recommendations:
problem_id: PR-2025-042
root_cause_category: "temperature-dependent-timing-degradation"
recommendations:
- rank: 1
approach: "Temperature-dependent drive strength adjustment"
confidence: 0.88
source_problems: [PR-2023-156, PR-2024-089]
success_rate: "2/2 (100%) -- no recurrence in 12 months"
estimated_effort: "16-24 hours"
risk: low
description: |
Implement conditional GPIO drive strength configuration
based on temperature sensor reading. Switch to high drive
strength below a calibrated threshold (typically -20C).
files_likely_affected:
- "src/driver/gpio_driver.c"
- "config/gpio_config.c"
- rank: 2
approach: "Timing margin increase via requirement relaxation"
confidence: 0.52
source_problems: [PR-2022-234]
success_rate: "1/1 (100%) -- but different context"
estimated_effort: "8-12 hours"
risk: medium
description: |
Negotiate requirement relaxation from 10ms to 12ms at
extreme cold temperatures. Requires customer agreement
and safety impact analysis.
files_likely_affected:
- "docs/requirements/SWE-BCM-103.md"
- rank: 3
approach: "Hardware modification -- driver IC substitution"
confidence: 0.31
source_problems: []
success_rate: "N/A -- inferred from datasheet analysis"
estimated_effort: "80-120 hours (HW change)"
risk: high
description: |
Replace current motor driver IC with automotive-grade
part rated for extended cold temperature operation.
Requires HW redesign and requalification.
ai_recommendation: |
Rank 1 approach (temperature-dependent drive strength) is
strongly recommended based on 100% historical success rate,
low implementation risk, and direct applicability to the
current DoorLockControl architecture.
human_action_required:
- "Review recommended approach for technical feasibility"
- "Approve resolution strategy before implementation"
- "Create change request CR linked to this problem report"
Trend Analysis
Predicting Future Problems from Patterns
Trend analysis moves problem resolution from reactive firefighting to proactive prevention. The AI trend engine operates on three time horizons.
| Time Horizon | Technique | Output | Action Trigger |
|---|---|---|---|
| Short-term (1-4 weeks) | Moving average of daily problem inflow by category; spike detection via z-score | Alert when problem inflow exceeds 2 standard deviations above rolling mean | Immediate investigation of spike category |
| Medium-term (1-3 months) | Regression analysis of defect density per component over release cycles | Predicted defect count per component for next release | Targeted code review and testing for high-density components |
| Long-term (6-12 months) | Seasonal decomposition + ARIMA forecasting of problem volumes by type | Forecasted problem load for capacity planning | Staff allocation adjustments, process improvement initiatives |
Trend Indicators and Thresholds
| Trend Indicator | Green | Yellow | Red |
|---|---|---|---|
| Problem inflow rate (per week) | Stable or declining | Rising > 15% week-over-week for 2 consecutive weeks | Rising > 30% or absolute count exceeds capacity threshold |
| Recurrence rate (same root cause) | < 5% | 5-15% | > 15% -- systemic fix ineffective |
| Mean age of open problems | < 10 days | 10-20 days | > 20 days -- resolution bottleneck |
| Component concentration (top component share) | < 30% of total problems | 30-50% | > 50% -- single component driving majority of problems |
| Escape rate (problems found post-release) | < 2% of total problems | 2-5% | > 5% -- verification process gap |
Trend Analysis Report (Q4 2024):
---------------------------------
Analysis period: October - December 2024
Total problems: 147 (vs 112 in Q3 -- +31%)
Category Breakdown:
Timing: 42 (29%) -- UP from 18% in Q3 [ALERT]
Functional: 38 (26%) -- stable
Interface: 29 (20%) -- stable
Resource: 21 (14%) -- DOWN from 19%
Safety: 17 (12%) -- stable
Component Hot Spots:
DoorLockControl: 34 problems (23%) [ALERT -- concentration]
WindowControl: 22 problems (15%)
SeatController: 19 problems (13%)
Prediction for Q1 2025:
Expected problem count: 155-175 (continued upward trend)
Highest risk category: Timing (predicted 35-40% share)
Highest risk component: DoorLockControl
Recommended Preventive Actions:
1. Conduct timing margin review across all DoorLockControl functions
2. Add temperature boundary tests to CI pipeline
3. Schedule architectural review of timing-critical paths
4. Increase code review depth for timing-related changes
Confidence: 76%
Knowledge Base Building
Automatic Knowledge Extraction from Resolved Problems
Every resolved and closed problem report contains valuable engineering knowledge. The knowledge extraction engine automatically distills lessons learned, reusable fix patterns, and design guidelines from closed reports and organizes them into a searchable knowledge base.
| Extraction Type | Method | Knowledge Base Article Structure |
|---|---|---|
| Root Cause Pattern | Cluster resolved problems by root cause category; extract common causal chain | Title, symptom pattern, causal mechanism, affected component types, verification approach |
| Fix Pattern | Group successful resolutions by approach type; generalize into reusable templates | Fix category, applicability conditions, implementation steps, expected effort, success rate |
| Design Guideline | Aggregate lessons learned from multiple related problems into preventive rules | Guideline statement, rationale, applicability scope, source problem IDs |
| Test Gap | Identify problem categories not caught by existing test suites; generate test recommendations | Gap description, recommended test type, priority, estimated coverage improvement |
| Environmental Sensitivity | Detect problems correlated with specific environmental conditions (temperature, voltage, load) | Environmental factor, affected parameters, safe operating range, boundary test recommendations |
# Auto-Generated Knowledge Base Article (illustrative)
knowledge_article:
id: KB-2025-017
title: "Temperature-Dependent Timing Degradation in Motor Drivers"
generated_from: [PR-2025-042, PR-2024-089, PR-2023-156, PR-2022-234]
auto_generated: true
human_reviewed: true
reviewed_by: "SW Architect"
review_date: "2025-02-01"
symptom_pattern: |
Actuator or motor timing measurements exceed requirements at
temperatures below -20C. Degradation is proportional to
temperature drop and affects all output stages using standard
MOSFET drivers without temperature compensation.
root_cause_mechanism: |
Semiconductor carrier mobility decreases at low temperatures,
increasing transistor turn-on and turn-off times. For MOSFET
drivers, Rds(on) temperature coefficient is positive but
switching time coefficient is negative, leading to slower
edge rates at cold extremes.
recommended_design_practice: |
For all timing-critical output stages:
1. Include temperature sensor reading in driver configuration
2. Implement at least two drive strength levels (standard / high)
3. Define temperature threshold for switching (typically -20C)
4. Add timing margin analysis at -40C during design review
5. Include cold temperature boundary tests in verification plan
applicability:
domains: [body-control, seat-control, window-control, mirror-control]
mcu_families: [all]
safety_levels: [QM, ASIL-A, ASIL-B]
related_articles: [KB-2024-008, KB-2023-042]
tags: [timing, temperature, motor-driver, GPIO, cold-start]
Knowledge Base Maintenance: Articles are automatically flagged for review when new problem reports match the article pattern but the documented fix did not prevent recurrence. This feedback loop ensures the knowledge base remains current and trustworthy.
Root Cause Analysis
AI-Assisted RCA
AI Root Cause Analysis Report:
------------------------------
Problem: PR-2025-042 (Door lock timing at cold temp)
Historical Pattern Analysis:
----------------------------
Searching historical issues with similar characteristics...
Match 1: PR-2023-156 (87% similarity)
- Project: Window Controller
- Symptom: Timing violation at cold temperature
- Root Cause: Transistor switching time degradation
- Resolution: Increased driver current
Match 2: PR-2024-089 (72% similarity)
- Project: Seat Controller
- Symptom: Motor response delay at -30C
- Root Cause: Power transistor thermal characteristics
- Resolution: Temperature compensation
Match 3: PR-2022-234 (65% similarity)
- Project: Mirror Controller
- Symptom: Actuator timing inconsistent
- Root Cause: PWM timing affected by temperature
- Resolution: Temperature-adjusted timing
Common Pattern Identified:
--------------------------
Category: Temperature-dependent timing degradation
Component: Power driver / Output stage
Physics: Semiconductor carrier mobility reduction at cold
Suggested Investigation Order:
------------------------------
1. [HIGH PROBABILITY] Check output driver temperature specs
2. [MEDIUM PROBABILITY] Review timing margin analysis
3. [LOW PROBABILITY] Verify power supply stability at cold
Confidence: 82%
Human Action Required:
[ ] Validate AI analysis against actual investigation
[ ] Confirm root cause
[ ] Approve resolution approach
HITL Protocol for Problem Resolution Decisions
Human-in-the-Loop Decision Points
Every AI action in the problem resolution process has a defined HITL gate. The table below specifies which decisions AI can make autonomously, which require human confirmation, and which are exclusively human.
| Decision Point | AI Authority | Human Authority | Escalation Rule |
|---|---|---|---|
| Problem recording | AI may auto-create report from test failure logs | Human confirms report is valid and not a test environment issue | If AI confidence < 0.70, flag for manual triage |
| Classification (severity) | AI assigns severity when confidence >= 0.80 | Human reviews all critical/safety-related severity assignments | All ASIL-related problems require human classification regardless of AI confidence |
| Duplicate detection | AI links definite duplicates (>= 0.92 similarity) for confirmation | Human confirms or rejects all duplicate links | Auto-merged duplicates prohibited; human confirmation mandatory |
| Root cause suggestion | AI provides ranked suggestions with confidence scores | Human validates root cause through investigation | If no suggestion exceeds 0.50 confidence, report is queued for senior engineer |
| Resolution recommendation | AI recommends approaches ranked by historical success | Human selects approach and approves implementation plan | Safety-related fixes require safety manager approval |
| Closure | AI verifies closure criteria (tests passed, regression clean, traceability updated) | Human approves closure and signs off on verification evidence | Closure without human sign-off is blocked by tooling |
| Trend escalation | AI generates trend alerts automatically | Human decides whether to initiate preventive action | Red-level trends auto-escalate to project management |
Guiding Principle: Humans own decisions; AI accelerates analysis. No problem report may be closed, reclassified to a lower severity, or marked as duplicate without explicit human approval. AI outputs are advisory inputs to human decision-making, not autonomous actions.
Override and Feedback Protocol
HITL Override Workflow:
-----------------------
1. AI presents classification / RCA / resolution recommendation
2. Engineer reviews AI output:
a. ACCEPT --> AI assignment stands; logged as confirmed
b. MODIFY --> Engineer adjusts; delta logged as training feedback
c. REJECT --> Engineer provides correct value; logged as override
3. All overrides stored in feedback database:
- Problem ID
- AI prediction (field, value, confidence)
- Human decision (field, value, rationale)
- Timestamp, engineer ID
4. Quarterly model retraining incorporates override data
5. Override rate monitored per classification dimension:
- Target: < 15% override rate
- Action trigger: > 25% override rate initiates model review
Problem Metrics Dashboard
The diagram below shows the problem resolution dashboard, presenting open/closed problem counts, resolution time trends, severity distribution, and aging analysis.
Metrics and KPIs
Core Problem Resolution Metrics
| Metric | Definition | Target | Measurement Method |
|---|---|---|---|
| MTTR (Mean Time to Resolve) | Average elapsed time from problem report creation to verified closure | < 5 business days for P1; < 10 for P2; < 20 for P3 | Calculated from issue tracker timestamps |
| MTTRC (Mean Time to Root Cause) | Average elapsed time from report creation to confirmed root cause | < 2 business days for P1; < 5 for P2 | Timestamp delta: created to root_cause_confirmed |
| First-Time Fix Rate | Percentage of problems resolved without reopening | > 90% | (Problems closed once) / (Total problems closed) |
| Recurrence Rate | Percentage of problems with the same root cause as a previously resolved problem | < 5% | Root cause pattern matching against closed problems |
| Escape Rate | Percentage of problems found after release vs. total problems | < 2% for safety-relevant; < 5% overall | (Post-release problems) / (Total problems in release) |
| Problem Backlog Age | Average age of open problem reports | < 10 days | Mean of (today - created_date) for all open reports |
| Duplicate Rate | Percentage of reports identified as duplicates | < 10% (lower indicates better initial triage) | (Duplicates detected) / (Total reports submitted) |
AI-Specific Performance Metrics
| Metric | Definition | Target | Measurement Method |
|---|---|---|---|
| RCA Suggestion Accuracy | Percentage of AI root cause suggestions confirmed as correct by human investigation | > 75% (top-1); > 90% (top-3) | (Confirmed suggestions) / (Total suggestions provided) |
| Classification Accuracy | Percentage of AI severity/type/component classifications accepted without override | > 85% per dimension | (Accepted classifications) / (Total classifications) |
| Duplicate Detection Precision | Percentage of AI-flagged duplicates confirmed as true duplicates | > 90% | (True duplicates) / (Flagged duplicates) |
| Duplicate Detection Recall | Percentage of actual duplicates caught by AI | > 80% | (Caught duplicates) / (Total actual duplicates) |
| Resolution Recommendation Relevance | Percentage of AI-recommended resolution approaches selected by engineer | > 60% (top-1); > 85% (top-3) | (Selected recommendations) / (Total recommendations) |
| Trend Prediction Accuracy | Percentage of AI trend alerts that corresponded to actual problem spikes | > 70% | (True alerts) / (Total alerts) |
| Override Rate | Percentage of AI decisions overridden by humans (lower is better after initial calibration) | < 15% per dimension | (Overrides) / (Total AI decisions) |
Calibration Period: During the first 6 months of AI deployment, accuracy targets are relaxed by 10 percentage points to allow model calibration. Monthly accuracy reviews determine when full targets apply.
Tool Integration
Issue Tracking and AI Plugin Architecture
The problem resolution process relies on tight integration between the issue tracking system and AI analysis services. The following table maps supported tools to their AI integration points.
| Tool | AI Integration Method | Supported AI Features | Configuration Notes |
|---|---|---|---|
| Jira | Atlassian Intelligence + custom webhook to AI service | Auto-classification on issue create; duplicate detection via JQL + embedding search; RCA suggestion panel; trend dashboard widget | Requires Jira Cloud Premium or Data Center with Atlassian Intelligence enabled; custom fields for AI confidence scores |
| Bugzilla | REST API webhook to external AI microservice | Classification on bug submission; duplicate search via Bug.search API + AI re-ranking; RCA suggestion as comment attachment |
Webhook extension required; AI service consumes Bugzilla REST API; results posted as structured comments |
| Polarion | LiveDoc extension + external AI REST service | Embedded classification widget in work item form; traceability-aware RCA (leverages Polarion link graph); trend analysis integrated into Polarion dashboard | Requires Polarion ALM 2024 or later; extension deployed via Polarion SDK; OSLC links feed AI context |
| Azure DevOps | Azure ML endpoint + custom pipeline task | Classification via service hook on work item create; RCA via Azure Cognitive Search over historical items; trend Power BI integration | Requires Azure ML workspace; service connection configured in project settings |
| GitLab Issues | GitLab webhook + external AI service container | Classification on issue open; duplicate detection via GitLab search API + AI; RCA suggestion as issue note | AI service deployed as GitLab CI service; results posted via GitLab API |
Integration Architecture Pattern
# AI Problem Analysis Service Configuration (illustrative)
problem_analysis_service:
name: "ai-problem-analyzer"
version: "2.1.0"
endpoints:
classify:
path: "/api/v1/classify"
method: POST
input: problem_report_json
output: classification_result_json
timeout: 10s
detect_duplicates:
path: "/api/v1/duplicates"
method: POST
input: problem_report_json
output: duplicate_candidates_json
timeout: 15s
suggest_root_cause:
path: "/api/v1/rca"
method: POST
input: problem_report_json
output: rca_suggestions_json
timeout: 30s
recommend_resolution:
path: "/api/v1/resolution"
method: POST
input: problem_with_rca_json
output: resolution_recommendations_json
timeout: 20s
analyze_trends:
path: "/api/v1/trends"
method: GET
params: [project, date_range, categories]
output: trend_report_json
timeout: 60s
issue_tracker_webhooks:
jira:
event: "jira:issue_created"
actions: [classify, detect_duplicates, suggest_root_cause]
bugzilla:
event: "bug.create"
actions: [classify, detect_duplicates, suggest_root_cause]
polarion:
event: "workitem.created"
actions: [classify, detect_duplicates, suggest_root_cause]
authentication:
method: "OAuth2 client credentials"
token_endpoint: "https://auth.example.com/oauth/token"
data_privacy:
pii_scrubbing: enabled
data_retention: "24 months for analysis; 7 years for audit trail"
gdpr_compliance: true
Work Products
| WP ID | Work Product | ASPICE Reference | AI Role | AI Level | Human Sign-Off Required |
|---|---|---|---|---|---|
| 08-27 | Problem report | SUP.9 BP2 | Auto-classification, duplicate detection, RCA suggestion | L2 | Yes -- reporter confirms classification |
| 08-28 | Root cause analysis record | SUP.9 BP3 | Historical pattern matching, causal chain suggestion | L2 | Yes -- investigator validates root cause |
| 08-29 | Resolution record | SUP.9 BP4-BP5 | Resolution recommendation, fix template generation | L2 | Yes -- SW lead approves resolution approach |
| 13-07 | Problem status report | SUP.9 BP6 | Automated status aggregation, stale report detection | L2-L3 | Yes -- project manager reviews before distribution |
| 13-26 | Trend analysis report | SUP.9 BP7 | Statistical trend computation, anomaly detection, forecasting | L2-L3 | Yes -- QA lead validates and interprets |
| 15-03 | Problem resolution strategy | SUP.9 BP1 | Strategy template generation, parameter recommendation | L1 | Yes -- process owner approves |
| 13-27 | Knowledge base article | Derived from SUP.9 | Auto-extraction from resolved problems | L2 | Yes -- domain expert reviews before publication |
Implementation Checklist
Phase 1: Foundation (Months 1-3)
| Step | Action | Responsible | Deliverable | Status |
|---|---|---|---|---|
| 1.1 | Define problem resolution strategy aligned with ASPICE SUP.9 | Process Owner | Problem Resolution Plan (WP 15-03) | [ ] |
| 1.2 | Configure issue tracker with structured problem report template | DevOps / Tool Admin | Configured Jira/Bugzilla/Polarion template | [ ] |
| 1.3 | Define severity, type, and component taxonomies | QA Lead + SW Architect | Classification taxonomy document | [ ] |
| 1.4 | Establish HITL decision matrix and approval workflows | Process Owner | HITL protocol document | [ ] |
| 1.5 | Label historical problem reports (minimum 500) with root cause category, component, resolution type | Engineering team | Labeled training dataset | [ ] |
| 1.6 | Define KPI targets and dashboard layout | QA Lead | Metrics specification | [ ] |
Phase 2: AI Deployment (Months 4-6)
| Step | Action | Responsible | Deliverable | Status |
|---|---|---|---|---|
| 2.1 | Deploy AI classification service (severity, type, component) | ML Engineer | Classification microservice v1.0 | [ ] |
| 2.2 | Deploy duplicate detection service | ML Engineer | Duplicate detection endpoint v1.0 | [ ] |
| 2.3 | Integrate AI services with issue tracker via webhooks | DevOps | Webhook configuration, integration tests | [ ] |
| 2.4 | Train RCA suggestion model on labeled historical data | ML Engineer | RCA model v1.0, validation report | [ ] |
| 2.5 | Deploy resolution recommendation engine | ML Engineer | Resolution recommendation endpoint v1.0 | [ ] |
| 2.6 | Build metrics dashboard (MTTR, accuracy, override rate) | DevOps / QA Lead | Live dashboard | [ ] |
| 2.7 | Conduct pilot with 1-2 engineering teams; collect feedback | QA Lead | Pilot evaluation report | [ ] |
Phase 3: Optimization (Months 7-12)
| Step | Action | Responsible | Deliverable | Status |
|---|---|---|---|---|
| 3.1 | Retrain models with pilot feedback and override data | ML Engineer | Models v2.0, accuracy comparison report | [ ] |
| 3.2 | Deploy trend analysis engine (short/medium/long-term) | ML Engineer | Trend analysis service v1.0 | [ ] |
| 3.3 | Activate automatic knowledge base extraction from closed reports | ML Engineer + QA Lead | Knowledge base seeded with initial articles | [ ] |
| 3.4 | Roll out AI-assisted problem resolution to all teams | Process Owner | Organization-wide deployment confirmation | [ ] |
| 3.5 | Establish quarterly model retraining cadence | ML Engineer | Retraining schedule and automation pipeline | [ ] |
| 3.6 | Conduct first ASPICE SUP.9 internal assessment with AI evidence | QA Lead | Assessment report demonstrating AI integration | [ ] |
| 3.7 | Review and update KPI targets based on 6 months of operational data | QA Lead + Process Owner | Updated metrics specification | [ ] |
Continuous Improvement: After Phase 3, the implementation enters a continuous improvement cycle. Quarterly reviews assess model accuracy, override rates, and KPI trends. Annual process audits verify that AI integration continues to satisfy ASPICE SUP.9 outcomes.
Summary
SUP.9 Problem Resolution:
- AI Level: L2 (AI analysis, human validation)
- Primary AI Value: Root cause suggestion, pattern matching, duplicate detection, resolution recommendation
- Human Essential: Cause validation, fix implementation, closure approval
- Key Outputs: Problem reports, RCA records, resolution records, trend reports, knowledge base articles
- AI Accuracy: ~78% root cause suggestion accuracy (illustrative target; calibrate based on historical data quality)
- Critical HITL Gates: Classification override, RCA validation, duplicate confirmation, closure sign-off