5.5: SUP.9 Problem Resolution Management


Process Definition

Purpose

SUP.9 Purpose: To ensure that problems are identified, analyzed, managed, and controlled to resolution. The process establishes a disciplined approach for recording every problem detected across all lifecycle phases, performing systematic root cause analysis, implementing verified corrective actions, and feeding lessons learned back into the development process to prevent recurrence.

AI Value Proposition: AI transforms problem resolution from a reactive, labor-intensive activity into a proactive, pattern-driven discipline. Machine learning models trained on historical problem repositories can suggest root causes within minutes rather than days, detect duplicates before investigation effort is wasted, and predict emerging problem clusters before they escalate into release blockers.

Outcomes

Outcome Description AI Contribution
O1 A problem resolution management strategy is developed AI recommends strategy parameters based on project risk profile and historical defect density
O2 Problems are recorded, uniquely identified, and classified AI auto-classifies severity, category, and affected component with confidence scores
O3 Problems are analyzed to determine root causes AI performs pattern matching against historical problem repositories to suggest probable root causes
O4 A resolution strategy is determined and implemented for each problem AI recommends resolution approaches drawn from similar past resolutions
O5 Problems are tracked to closure and status communicated to affected parties AI monitors resolution progress and escalates stalled items automatically
O6 Trends in problem reports are analyzed to prevent future occurrences AI runs statistical and ML-based trend analysis to predict emerging problem categories

Base Practices with AI Integration

BP Base Practice AI Level AI Application Human Responsibility
BP1 Develop a problem resolution management strategy L1 Template generation, strategy parameter suggestions based on project type Approve strategy, define escalation paths
BP2 Identify and record the problem L2 Auto-populate fields from test logs, classify category and component, detect duplicates Verify classification, confirm non-duplicate status
BP3 Analyze problems for their root cause L2 Historical pattern matching, similarity search across prior RCA records, suggest probable root causes Validate suggested root cause, perform physical investigation where needed
BP4 Determine a resolution strategy L2 Recommend fix approaches from historical resolution database, estimate effort and risk Select resolution approach, approve implementation plan
BP5 Implement problem resolution L1 Generate fix templates, link to affected configuration items, update traceability Implement the actual code or design change
BP6 Track problems to closure L2-L3 Automated status monitoring, regression verification tracking, closure criteria checking Approve closure, sign-off on verification evidence
BP7 Analyze problem trends L2-L3 Statistical trend analysis, anomaly detection, predictive clustering Interpret trends, initiate preventive actions

AI Level Definitions: L1 = AI assists with templates and suggestions; L2 = AI performs analysis, human validates; L3 = AI executes autonomously with human oversight on exceptions.


AI-Assisted Problem Resolution

The following diagram illustrates the AI-assisted problem resolution workflow, from initial detection and classification through root cause analysis, corrective action, and closure verification.

Problem Resolution Flow


Problem Report Template

Note: This automotive example demonstrates temperature-related timing investigation; adapt for project-specific domains.

# Problem Report (illustrative automotive example)
problem:
  id: PR-(year)-(number)
  title: "Door lock timing exceeds requirement at cold temperature"
  status: in_progress
  created: (creation date)
  reporter: Test Engineer

  classification:
    category: timing
    component: DoorLockControl
    severity: high
    priority: P1
    ai_confidence: 0.85

  description: |
    During HIL testing at -40C, door lock timing measured at 11.2ms,
    exceeding the 10ms requirement specified in SWE-BCM-103.

  reproduction:
    steps:
      - Set climate chamber to -40C
      - Wait for ECU temperature stabilization (30 min)
      - Send lock command via CAN
      - Measure actuator output timing
    rate: "100% reproducible at -40C"
    environment: "HIL system with climate chamber"

  ai_analysis:
    root_cause_suggestion: |
      Pattern match with 3 similar historical issues indicates
      motor driver slew rate degradation at cold temperature.

      Likely cause: GPIO driver configuration does not account
      for temperature-dependent propagation delay.

      Related issues: PR-2023-156, PR-2024-089
    confidence: medium
    suggested_investigation:
      - "Check motor driver IC datasheet for cold temp specs"
      - "Review GPIO driver strength configuration"
      - "Analyze timing margin at component level"

  investigation:
    actual_cause: |
      Motor driver transistor turn-on time increases from 50ns
      to 120ns at -40C due to reduced carrier mobility.
      Combined with 4 sequential actuator commands, adds ~0.3ms.
    root_cause_type: design
    verified: true
    verified_by: HW Engineer

  resolution:
    approach: "Increase GPIO drive strength for cold temperature"
    fix_description: |
      Added temperature-dependent GPIO configuration:
      - T > -20C: Standard drive strength (8mA)
      - T <= -20C: High drive strength (16mA)
    changed_files:
      - src/driver/gpio_driver.c
      - config/gpio_config.c
    change_request: CR-2025-015

  verification:
    test_cases:
      - id: SWE-QT-BCM-005
        result: pass
        measured: 9.8ms
    regression:
      status: pass
      scope: "Full DoorLock test suite"

  closure:
    closed_date: 2025-01-22
    closed_by: SW Lead
    lessons_learned: |
      Temperature compensation should be considered for all
      timing-critical functions during design phase.

AI-Powered Root Cause Analysis

ML Techniques for Root Cause Identification

AI-powered root cause analysis draws on multiple machine learning techniques applied to the corpus of historical problem reports. The objective is to reduce mean time to root cause identification by surfacing the most probable causes before an engineer begins manual investigation.

ML Technique Application to RCA Typical Accuracy Training Data Required
TF-IDF + Cosine Similarity Match new problem descriptions against historical reports to find textually similar past problems 70-80% top-5 recall 500+ resolved problem reports
Sentence Embeddings (SBERT) Semantic similarity search that captures meaning beyond keyword overlap 75-85% top-5 recall 500+ resolved reports with root cause fields
Random Forest Classifier Classify root cause category (design, requirements, tooling, process) from structured fields 78-88% accuracy 1,000+ labeled reports
Gradient Boosted Trees (XGBoost) Predict affected component and likely fix location from symptom description and test context 72-82% accuracy 1,000+ reports with component labels
Clustering (HDBSCAN) Group problem reports into clusters to reveal systemic issues affecting multiple components N/A (unsupervised) 200+ reports for meaningful clusters
Bayesian Networks Model causal chains between symptoms, root causes, and environmental conditions 65-75% causal accuracy Expert-defined structure + 500+ data points

Data Quality Warning: ML model accuracy depends directly on the quality and consistency of historical problem data. Organizations beginning AI-assisted RCA should invest in structured problem report templates and retrospective labeling of legacy data before expecting reliable predictions.

AI-Assisted RCA

AI Root Cause Analysis Report:
------------------------------

Problem: PR-2025-042 (Door lock timing at cold temp)

Historical Pattern Analysis:
----------------------------
Searching historical issues with similar characteristics...

Match 1: PR-2023-156 (87% similarity)
- Project: Window Controller
- Symptom: Timing violation at cold temperature
- Root Cause: Transistor switching time degradation
- Resolution: Increased driver current

Match 2: PR-2024-089 (72% similarity)
- Project: Seat Controller
- Symptom: Motor response delay at -30C
- Root Cause: Power transistor thermal characteristics
- Resolution: Temperature compensation

Match 3: PR-2022-234 (65% similarity)
- Project: Mirror Controller
- Symptom: Actuator timing inconsistent
- Root Cause: PWM timing affected by temperature
- Resolution: Temperature-adjusted timing

Common Pattern Identified:
--------------------------
Category: Temperature-dependent timing degradation
Component: Power driver / Output stage
Physics: Semiconductor carrier mobility reduction at cold

Suggested Investigation Order:
------------------------------
1. [HIGH PROBABILITY] Check output driver temperature specs
2. [MEDIUM PROBABILITY] Review timing margin analysis
3. [LOW PROBABILITY] Verify power supply stability at cold

Confidence: 82%

Human Action Required:
[ ] Validate AI analysis against actual investigation
[ ] Confirm root cause
[ ] Approve resolution approach

Problem Classification

AI-Assisted Severity, Type, and Component Classification

When a new problem report is submitted, the AI classification engine analyzes the description text, attached logs, and structured metadata to automatically assign severity, problem type, and affected component. The engineer reviews the AI assignments and overrides where necessary; every override is fed back into the model as a training signal.

Classification Dimension AI Method Input Features Confidence Threshold
Severity (critical / high / medium / low) Multi-class text classifier (fine-tuned transformer) Description, test type, failure mode, affected requirement ASIL level >= 0.80 for auto-assignment; below 0.80 requires human review
Problem Type (functional / timing / resource / interface / safety) Keyword-boosted gradient classifier Description, component, test environment, error codes >= 0.75 for auto-assignment
Affected Component Named entity recognition + component taxonomy lookup Description, file paths in stack traces, test case IDs >= 0.70 for auto-assignment
Priority (P1-P4) Rule engine combining severity + project phase + customer impact Severity, release proximity, customer-facing flag Deterministic (rule-based); no confidence score

Override Feedback Loop: Every human override of an AI classification is stored as a labeled training sample. Models are retrained on a quarterly cadence or when override rate exceeds 25% for any classification dimension.

# AI Classification Output (illustrative)
classification_result:
  problem_id: PR-2025-042
  model_version: "cls-v3.2.1"
  timestamp: "2025-01-10T08:32:00Z"

  severity:
    predicted: high
    confidence: 0.87
    reasoning: "Timing violation on safety-relevant function at boundary temp"
    human_override: null

  problem_type:
    predicted: timing
    confidence: 0.91
    reasoning: "Keywords 'timing', 'exceeds', 'ms requirement' with HIL context"
    human_override: null

  component:
    predicted: DoorLockControl
    confidence: 0.85
    reasoning: "Test case SWE-QT-BCM-005 mapped to DoorLockControl module"
    alternatives:
      - component: GPIODriver
        confidence: 0.62
      - component: ActuatorControl
        confidence: 0.41
    human_override: null

  priority:
    assigned: P1
    rule_trace: "severity=high AND release_phase=integration AND customer_facing=true -> P1"

Duplicate Detection

NLP-Based Duplicate Problem Report Detection

Duplicate problem reports waste investigation effort and obscure true defect counts. The duplicate detection engine compares each incoming report against all open and recently closed reports using a two-stage pipeline.

Stage 1 -- Candidate Retrieval: A fast retrieval model (TF-IDF or BM25) identifies the top-50 most textually similar existing reports.

Stage 2 -- Semantic Re-ranking: A cross-encoder transformer model re-ranks candidates by semantic similarity, accounting for paraphrasing and domain synonyms (e.g., "timing violation" vs. "deadline exceeded").

Detection Parameter Value Rationale
Similarity threshold (definite duplicate) >= 0.92 Reports above this threshold are auto-linked as duplicates pending human confirmation
Similarity threshold (potential duplicate) 0.75 - 0.91 Reports in this range are flagged for human review
Search scope All open reports + reports closed within last 180 days Balances recall against false positives from aged reports
Feature inputs Title, description, component, failure mode, test environment Multi-field comparison reduces false matches from generic descriptions
Retraining trigger Precision drops below 85% on monthly validation set Ensures model adapts to evolving project vocabulary
Duplicate Detection Report:
----------------------------
Incoming: PR-2025-042 "Door lock timing exceeds requirement at cold temperature"

Candidate 1: PR-2025-038 (similarity: 0.68) -- NOT DUPLICATE
  Title: "Window motor timing exceeds spec at -40C"
  Verdict: Different component (WindowControl vs DoorLockControl),
           similar symptom pattern. Linked as RELATED, not duplicate.

Candidate 2: PR-2025-029 (similarity: 0.54) -- NOT DUPLICATE
  Title: "Door lock fails to engage intermittently"
  Verdict: Same component but different failure mode
           (functional vs timing). No link.

Candidate 3: PR-2024-089 (similarity: 0.71) -- RELATED
  Title: "Seat motor response delay at cold temperature"
  Verdict: Different project, similar root cause pattern.
           Linked as RELATED for cross-project learning.

Result: No duplicate found. PR-2025-042 confirmed as NEW.
Related issues linked: PR-2025-038, PR-2024-089

Human Confirmation Required: Even when the AI declares a definite duplicate (similarity >= 0.92), a human must confirm the linkage before the incoming report is merged. Auto-closure of duplicates without human review is prohibited in safety-relevant projects.


Resolution Recommendation

AI-Suggested Fixes Based on Historical Resolutions

Once a root cause category is established, the resolution recommendation engine searches the historical resolution database for verified fixes applied to similar problems. Recommendations are ranked by relevance, success rate, and applicability to the current project context.

Recommendation Factor Weight Description
Root cause similarity 0.35 How closely the root cause of the historical problem matches the current one
Component similarity 0.25 Whether the fix was applied to the same or analogous component
Resolution success rate 0.20 Percentage of times this fix type resolved the problem without recurrence
Project context match 0.10 Same MCU family, same RTOS, similar safety level
Recency 0.10 More recent resolutions weighted higher to reflect current codebase state
# Resolution Recommendation Output (illustrative)
resolution_recommendations:
  problem_id: PR-2025-042
  root_cause_category: "temperature-dependent-timing-degradation"

  recommendations:
    - rank: 1
      approach: "Temperature-dependent drive strength adjustment"
      confidence: 0.88
      source_problems: [PR-2023-156, PR-2024-089]
      success_rate: "2/2 (100%) -- no recurrence in 12 months"
      estimated_effort: "16-24 hours"
      risk: low
      description: |
        Implement conditional GPIO drive strength configuration
        based on temperature sensor reading. Switch to high drive
        strength below a calibrated threshold (typically -20C).
      files_likely_affected:
        - "src/driver/gpio_driver.c"
        - "config/gpio_config.c"

    - rank: 2
      approach: "Timing margin increase via requirement relaxation"
      confidence: 0.52
      source_problems: [PR-2022-234]
      success_rate: "1/1 (100%) -- but different context"
      estimated_effort: "8-12 hours"
      risk: medium
      description: |
        Negotiate requirement relaxation from 10ms to 12ms at
        extreme cold temperatures. Requires customer agreement
        and safety impact analysis.
      files_likely_affected:
        - "docs/requirements/SWE-BCM-103.md"

    - rank: 3
      approach: "Hardware modification -- driver IC substitution"
      confidence: 0.31
      source_problems: []
      success_rate: "N/A -- inferred from datasheet analysis"
      estimated_effort: "80-120 hours (HW change)"
      risk: high
      description: |
        Replace current motor driver IC with automotive-grade
        part rated for extended cold temperature operation.
        Requires HW redesign and requalification.

  ai_recommendation: |
    Rank 1 approach (temperature-dependent drive strength) is
    strongly recommended based on 100% historical success rate,
    low implementation risk, and direct applicability to the
    current DoorLockControl architecture.

  human_action_required:
    - "Review recommended approach for technical feasibility"
    - "Approve resolution strategy before implementation"
    - "Create change request CR linked to this problem report"

Trend Analysis

Predicting Future Problems from Patterns

Trend analysis moves problem resolution from reactive firefighting to proactive prevention. The AI trend engine operates on three time horizons.

Time Horizon Technique Output Action Trigger
Short-term (1-4 weeks) Moving average of daily problem inflow by category; spike detection via z-score Alert when problem inflow exceeds 2 standard deviations above rolling mean Immediate investigation of spike category
Medium-term (1-3 months) Regression analysis of defect density per component over release cycles Predicted defect count per component for next release Targeted code review and testing for high-density components
Long-term (6-12 months) Seasonal decomposition + ARIMA forecasting of problem volumes by type Forecasted problem load for capacity planning Staff allocation adjustments, process improvement initiatives

Trend Indicators and Thresholds

Trend Indicator Green Yellow Red
Problem inflow rate (per week) Stable or declining Rising > 15% week-over-week for 2 consecutive weeks Rising > 30% or absolute count exceeds capacity threshold
Recurrence rate (same root cause) < 5% 5-15% > 15% -- systemic fix ineffective
Mean age of open problems < 10 days 10-20 days > 20 days -- resolution bottleneck
Component concentration (top component share) < 30% of total problems 30-50% > 50% -- single component driving majority of problems
Escape rate (problems found post-release) < 2% of total problems 2-5% > 5% -- verification process gap
Trend Analysis Report (Q4 2024):
---------------------------------
Analysis period: October - December 2024
Total problems: 147 (vs 112 in Q3 -- +31%)

Category Breakdown:
  Timing:        42 (29%) -- UP from 18% in Q3 [ALERT]
  Functional:    38 (26%) -- stable
  Interface:     29 (20%) -- stable
  Resource:      21 (14%) -- DOWN from 19%
  Safety:        17 (12%) -- stable

Component Hot Spots:
  DoorLockControl:   34 problems (23%) [ALERT -- concentration]
  WindowControl:     22 problems (15%)
  SeatController:    19 problems (13%)

Prediction for Q1 2025:
  Expected problem count: 155-175 (continued upward trend)
  Highest risk category: Timing (predicted 35-40% share)
  Highest risk component: DoorLockControl

Recommended Preventive Actions:
  1. Conduct timing margin review across all DoorLockControl functions
  2. Add temperature boundary tests to CI pipeline
  3. Schedule architectural review of timing-critical paths
  4. Increase code review depth for timing-related changes

Confidence: 76%

Knowledge Base Building

Automatic Knowledge Extraction from Resolved Problems

Every resolved and closed problem report contains valuable engineering knowledge. The knowledge extraction engine automatically distills lessons learned, reusable fix patterns, and design guidelines from closed reports and organizes them into a searchable knowledge base.

Extraction Type Method Knowledge Base Article Structure
Root Cause Pattern Cluster resolved problems by root cause category; extract common causal chain Title, symptom pattern, causal mechanism, affected component types, verification approach
Fix Pattern Group successful resolutions by approach type; generalize into reusable templates Fix category, applicability conditions, implementation steps, expected effort, success rate
Design Guideline Aggregate lessons learned from multiple related problems into preventive rules Guideline statement, rationale, applicability scope, source problem IDs
Test Gap Identify problem categories not caught by existing test suites; generate test recommendations Gap description, recommended test type, priority, estimated coverage improvement
Environmental Sensitivity Detect problems correlated with specific environmental conditions (temperature, voltage, load) Environmental factor, affected parameters, safe operating range, boundary test recommendations
# Auto-Generated Knowledge Base Article (illustrative)
knowledge_article:
  id: KB-2025-017
  title: "Temperature-Dependent Timing Degradation in Motor Drivers"
  generated_from: [PR-2025-042, PR-2024-089, PR-2023-156, PR-2022-234]
  auto_generated: true
  human_reviewed: true
  reviewed_by: "SW Architect"
  review_date: "2025-02-01"

  symptom_pattern: |
    Actuator or motor timing measurements exceed requirements at
    temperatures below -20C. Degradation is proportional to
    temperature drop and affects all output stages using standard
    MOSFET drivers without temperature compensation.

  root_cause_mechanism: |
    Semiconductor carrier mobility decreases at low temperatures,
    increasing transistor turn-on and turn-off times. For MOSFET
    drivers, Rds(on) temperature coefficient is positive but
    switching time coefficient is negative, leading to slower
    edge rates at cold extremes.

  recommended_design_practice: |
    For all timing-critical output stages:
    1. Include temperature sensor reading in driver configuration
    2. Implement at least two drive strength levels (standard / high)
    3. Define temperature threshold for switching (typically -20C)
    4. Add timing margin analysis at -40C during design review
    5. Include cold temperature boundary tests in verification plan

  applicability:
    domains: [body-control, seat-control, window-control, mirror-control]
    mcu_families: [all]
    safety_levels: [QM, ASIL-A, ASIL-B]

  related_articles: [KB-2024-008, KB-2023-042]
  tags: [timing, temperature, motor-driver, GPIO, cold-start]

Knowledge Base Maintenance: Articles are automatically flagged for review when new problem reports match the article pattern but the documented fix did not prevent recurrence. This feedback loop ensures the knowledge base remains current and trustworthy.


Root Cause Analysis

AI-Assisted RCA

AI Root Cause Analysis Report:
------------------------------

Problem: PR-2025-042 (Door lock timing at cold temp)

Historical Pattern Analysis:
----------------------------
Searching historical issues with similar characteristics...

Match 1: PR-2023-156 (87% similarity)
- Project: Window Controller
- Symptom: Timing violation at cold temperature
- Root Cause: Transistor switching time degradation
- Resolution: Increased driver current

Match 2: PR-2024-089 (72% similarity)
- Project: Seat Controller
- Symptom: Motor response delay at -30C
- Root Cause: Power transistor thermal characteristics
- Resolution: Temperature compensation

Match 3: PR-2022-234 (65% similarity)
- Project: Mirror Controller
- Symptom: Actuator timing inconsistent
- Root Cause: PWM timing affected by temperature
- Resolution: Temperature-adjusted timing

Common Pattern Identified:
--------------------------
Category: Temperature-dependent timing degradation
Component: Power driver / Output stage
Physics: Semiconductor carrier mobility reduction at cold

Suggested Investigation Order:
------------------------------
1. [HIGH PROBABILITY] Check output driver temperature specs
2. [MEDIUM PROBABILITY] Review timing margin analysis
3. [LOW PROBABILITY] Verify power supply stability at cold

Confidence: 82%

Human Action Required:
[ ] Validate AI analysis against actual investigation
[ ] Confirm root cause
[ ] Approve resolution approach

HITL Protocol for Problem Resolution Decisions

Human-in-the-Loop Decision Points

Every AI action in the problem resolution process has a defined HITL gate. The table below specifies which decisions AI can make autonomously, which require human confirmation, and which are exclusively human.

Decision Point AI Authority Human Authority Escalation Rule
Problem recording AI may auto-create report from test failure logs Human confirms report is valid and not a test environment issue If AI confidence < 0.70, flag for manual triage
Classification (severity) AI assigns severity when confidence >= 0.80 Human reviews all critical/safety-related severity assignments All ASIL-related problems require human classification regardless of AI confidence
Duplicate detection AI links definite duplicates (>= 0.92 similarity) for confirmation Human confirms or rejects all duplicate links Auto-merged duplicates prohibited; human confirmation mandatory
Root cause suggestion AI provides ranked suggestions with confidence scores Human validates root cause through investigation If no suggestion exceeds 0.50 confidence, report is queued for senior engineer
Resolution recommendation AI recommends approaches ranked by historical success Human selects approach and approves implementation plan Safety-related fixes require safety manager approval
Closure AI verifies closure criteria (tests passed, regression clean, traceability updated) Human approves closure and signs off on verification evidence Closure without human sign-off is blocked by tooling
Trend escalation AI generates trend alerts automatically Human decides whether to initiate preventive action Red-level trends auto-escalate to project management

Guiding Principle: Humans own decisions; AI accelerates analysis. No problem report may be closed, reclassified to a lower severity, or marked as duplicate without explicit human approval. AI outputs are advisory inputs to human decision-making, not autonomous actions.

Override and Feedback Protocol

HITL Override Workflow:
-----------------------

1. AI presents classification / RCA / resolution recommendation
2. Engineer reviews AI output:
   a. ACCEPT  --> AI assignment stands; logged as confirmed
   b. MODIFY  --> Engineer adjusts; delta logged as training feedback
   c. REJECT  --> Engineer provides correct value; logged as override

3. All overrides stored in feedback database:
   - Problem ID
   - AI prediction (field, value, confidence)
   - Human decision (field, value, rationale)
   - Timestamp, engineer ID

4. Quarterly model retraining incorporates override data
5. Override rate monitored per classification dimension:
   - Target: < 15% override rate
   - Action trigger: > 25% override rate initiates model review

Problem Metrics Dashboard

The diagram below shows the problem resolution dashboard, presenting open/closed problem counts, resolution time trends, severity distribution, and aging analysis.

Problem Resolution Dashboard


Metrics and KPIs

Core Problem Resolution Metrics

Metric Definition Target Measurement Method
MTTR (Mean Time to Resolve) Average elapsed time from problem report creation to verified closure < 5 business days for P1; < 10 for P2; < 20 for P3 Calculated from issue tracker timestamps
MTTRC (Mean Time to Root Cause) Average elapsed time from report creation to confirmed root cause < 2 business days for P1; < 5 for P2 Timestamp delta: created to root_cause_confirmed
First-Time Fix Rate Percentage of problems resolved without reopening > 90% (Problems closed once) / (Total problems closed)
Recurrence Rate Percentage of problems with the same root cause as a previously resolved problem < 5% Root cause pattern matching against closed problems
Escape Rate Percentage of problems found after release vs. total problems < 2% for safety-relevant; < 5% overall (Post-release problems) / (Total problems in release)
Problem Backlog Age Average age of open problem reports < 10 days Mean of (today - created_date) for all open reports
Duplicate Rate Percentage of reports identified as duplicates < 10% (lower indicates better initial triage) (Duplicates detected) / (Total reports submitted)

AI-Specific Performance Metrics

Metric Definition Target Measurement Method
RCA Suggestion Accuracy Percentage of AI root cause suggestions confirmed as correct by human investigation > 75% (top-1); > 90% (top-3) (Confirmed suggestions) / (Total suggestions provided)
Classification Accuracy Percentage of AI severity/type/component classifications accepted without override > 85% per dimension (Accepted classifications) / (Total classifications)
Duplicate Detection Precision Percentage of AI-flagged duplicates confirmed as true duplicates > 90% (True duplicates) / (Flagged duplicates)
Duplicate Detection Recall Percentage of actual duplicates caught by AI > 80% (Caught duplicates) / (Total actual duplicates)
Resolution Recommendation Relevance Percentage of AI-recommended resolution approaches selected by engineer > 60% (top-1); > 85% (top-3) (Selected recommendations) / (Total recommendations)
Trend Prediction Accuracy Percentage of AI trend alerts that corresponded to actual problem spikes > 70% (True alerts) / (Total alerts)
Override Rate Percentage of AI decisions overridden by humans (lower is better after initial calibration) < 15% per dimension (Overrides) / (Total AI decisions)

Calibration Period: During the first 6 months of AI deployment, accuracy targets are relaxed by 10 percentage points to allow model calibration. Monthly accuracy reviews determine when full targets apply.


Tool Integration

Issue Tracking and AI Plugin Architecture

The problem resolution process relies on tight integration between the issue tracking system and AI analysis services. The following table maps supported tools to their AI integration points.

Tool AI Integration Method Supported AI Features Configuration Notes
Jira Atlassian Intelligence + custom webhook to AI service Auto-classification on issue create; duplicate detection via JQL + embedding search; RCA suggestion panel; trend dashboard widget Requires Jira Cloud Premium or Data Center with Atlassian Intelligence enabled; custom fields for AI confidence scores
Bugzilla REST API webhook to external AI microservice Classification on bug submission; duplicate search via Bug.search API + AI re-ranking; RCA suggestion as comment attachment Webhook extension required; AI service consumes Bugzilla REST API; results posted as structured comments
Polarion LiveDoc extension + external AI REST service Embedded classification widget in work item form; traceability-aware RCA (leverages Polarion link graph); trend analysis integrated into Polarion dashboard Requires Polarion ALM 2024 or later; extension deployed via Polarion SDK; OSLC links feed AI context
Azure DevOps Azure ML endpoint + custom pipeline task Classification via service hook on work item create; RCA via Azure Cognitive Search over historical items; trend Power BI integration Requires Azure ML workspace; service connection configured in project settings
GitLab Issues GitLab webhook + external AI service container Classification on issue open; duplicate detection via GitLab search API + AI; RCA suggestion as issue note AI service deployed as GitLab CI service; results posted via GitLab API

Integration Architecture Pattern

# AI Problem Analysis Service Configuration (illustrative)
problem_analysis_service:
  name: "ai-problem-analyzer"
  version: "2.1.0"

  endpoints:
    classify:
      path: "/api/v1/classify"
      method: POST
      input: problem_report_json
      output: classification_result_json
      timeout: 10s

    detect_duplicates:
      path: "/api/v1/duplicates"
      method: POST
      input: problem_report_json
      output: duplicate_candidates_json
      timeout: 15s

    suggest_root_cause:
      path: "/api/v1/rca"
      method: POST
      input: problem_report_json
      output: rca_suggestions_json
      timeout: 30s

    recommend_resolution:
      path: "/api/v1/resolution"
      method: POST
      input: problem_with_rca_json
      output: resolution_recommendations_json
      timeout: 20s

    analyze_trends:
      path: "/api/v1/trends"
      method: GET
      params: [project, date_range, categories]
      output: trend_report_json
      timeout: 60s

  issue_tracker_webhooks:
    jira:
      event: "jira:issue_created"
      actions: [classify, detect_duplicates, suggest_root_cause]
    bugzilla:
      event: "bug.create"
      actions: [classify, detect_duplicates, suggest_root_cause]
    polarion:
      event: "workitem.created"
      actions: [classify, detect_duplicates, suggest_root_cause]

  authentication:
    method: "OAuth2 client credentials"
    token_endpoint: "https://auth.example.com/oauth/token"

  data_privacy:
    pii_scrubbing: enabled
    data_retention: "24 months for analysis; 7 years for audit trail"
    gdpr_compliance: true

Work Products

WP ID Work Product ASPICE Reference AI Role AI Level Human Sign-Off Required
08-27 Problem report SUP.9 BP2 Auto-classification, duplicate detection, RCA suggestion L2 Yes -- reporter confirms classification
08-28 Root cause analysis record SUP.9 BP3 Historical pattern matching, causal chain suggestion L2 Yes -- investigator validates root cause
08-29 Resolution record SUP.9 BP4-BP5 Resolution recommendation, fix template generation L2 Yes -- SW lead approves resolution approach
13-07 Problem status report SUP.9 BP6 Automated status aggregation, stale report detection L2-L3 Yes -- project manager reviews before distribution
13-26 Trend analysis report SUP.9 BP7 Statistical trend computation, anomaly detection, forecasting L2-L3 Yes -- QA lead validates and interprets
15-03 Problem resolution strategy SUP.9 BP1 Strategy template generation, parameter recommendation L1 Yes -- process owner approves
13-27 Knowledge base article Derived from SUP.9 Auto-extraction from resolved problems L2 Yes -- domain expert reviews before publication

Implementation Checklist

Phase 1: Foundation (Months 1-3)

Step Action Responsible Deliverable Status
1.1 Define problem resolution strategy aligned with ASPICE SUP.9 Process Owner Problem Resolution Plan (WP 15-03) [ ]
1.2 Configure issue tracker with structured problem report template DevOps / Tool Admin Configured Jira/Bugzilla/Polarion template [ ]
1.3 Define severity, type, and component taxonomies QA Lead + SW Architect Classification taxonomy document [ ]
1.4 Establish HITL decision matrix and approval workflows Process Owner HITL protocol document [ ]
1.5 Label historical problem reports (minimum 500) with root cause category, component, resolution type Engineering team Labeled training dataset [ ]
1.6 Define KPI targets and dashboard layout QA Lead Metrics specification [ ]

Phase 2: AI Deployment (Months 4-6)

Step Action Responsible Deliverable Status
2.1 Deploy AI classification service (severity, type, component) ML Engineer Classification microservice v1.0 [ ]
2.2 Deploy duplicate detection service ML Engineer Duplicate detection endpoint v1.0 [ ]
2.3 Integrate AI services with issue tracker via webhooks DevOps Webhook configuration, integration tests [ ]
2.4 Train RCA suggestion model on labeled historical data ML Engineer RCA model v1.0, validation report [ ]
2.5 Deploy resolution recommendation engine ML Engineer Resolution recommendation endpoint v1.0 [ ]
2.6 Build metrics dashboard (MTTR, accuracy, override rate) DevOps / QA Lead Live dashboard [ ]
2.7 Conduct pilot with 1-2 engineering teams; collect feedback QA Lead Pilot evaluation report [ ]

Phase 3: Optimization (Months 7-12)

Step Action Responsible Deliverable Status
3.1 Retrain models with pilot feedback and override data ML Engineer Models v2.0, accuracy comparison report [ ]
3.2 Deploy trend analysis engine (short/medium/long-term) ML Engineer Trend analysis service v1.0 [ ]
3.3 Activate automatic knowledge base extraction from closed reports ML Engineer + QA Lead Knowledge base seeded with initial articles [ ]
3.4 Roll out AI-assisted problem resolution to all teams Process Owner Organization-wide deployment confirmation [ ]
3.5 Establish quarterly model retraining cadence ML Engineer Retraining schedule and automation pipeline [ ]
3.6 Conduct first ASPICE SUP.9 internal assessment with AI evidence QA Lead Assessment report demonstrating AI integration [ ]
3.7 Review and update KPI targets based on 6 months of operational data QA Lead + Process Owner Updated metrics specification [ ]

Continuous Improvement: After Phase 3, the implementation enters a continuous improvement cycle. Quarterly reviews assess model accuracy, override rates, and KPI trends. Annual process audits verify that AI integration continues to satisfy ASPICE SUP.9 outcomes.


Summary

SUP.9 Problem Resolution:

  • AI Level: L2 (AI analysis, human validation)
  • Primary AI Value: Root cause suggestion, pattern matching, duplicate detection, resolution recommendation
  • Human Essential: Cause validation, fix implementation, closure approval
  • Key Outputs: Problem reports, RCA records, resolution records, trend reports, knowledge base articles
  • AI Accuracy: ~78% root cause suggestion accuracy (illustrative target; calibrate based on historical data quality)
  • Critical HITL Gates: Classification override, RCA validation, duplicate confirmation, closure sign-off