4.3: Flaky Test Detection

What You'll Learn

Understand what flaky tests are and why they are problematic.
Learn how AI and ML techniques can be used to detect and manage flaky tests.
Explore practical tools and CI/CD integration patterns for flaky test detection.

15.05.1 Overview

In safety-critical systems governed by ASPICE 4.0 and ISO 26262, flaky tests directly undermine evidence integrity and traceability requirements. Undetected flakiness compromises test certification and creates liability exposure during audits.

What Are Flaky Tests?

Flaky tests are tests that exhibit non-deterministic behavior, passing or failing without code changes. They represent a significant threat to verification processes required by:

Standard	Impact of Flaky Tests
ASPICE SWE.4/5/6	Invalidates unit, integration, and qualification test evidence
ISO 26262-6 Clause 9	Compromises verification of safety requirements
ISO 26262-8 Clause 11	Undermines tool confidence level (TCL) claims
DO-178C Objective A-7	Prevents "Testing of High-Level Requirements" compliance

ML techniques can identify and categorize flaky tests, protecting evidence integrity.

15.05.2 Tools and Frameworks

Google's Flaky Test Detection

Approach:

Analyzes test execution history
Statistical modeling of failure patterns
Confidence scoring for flakiness

Implementation:

# Example using historical test data
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

class FlakyTestDetector:
    """
    Note: The flake_score calculation is a simplified heuristic.
    Calibrate thresholds based on your project's historical data and
    acceptable false positive/negative rates.
    """
    def __init__(self):
        self.model = RandomForestClassifier(n_estimators=100)

    def extract_features(self, test_history):
        """Extract features from test execution history"""
        features = {
            'failure_rate': test_history['failed'].mean(),
            'pass_fail_alternations': self._count_alternations(test_history),
            'time_variance': test_history['duration'].std(),
            'flake_score': self._calculate_flake_score(test_history)
        }
        return features

    def _count_alternations(self, history):
        """Count how often test result changes"""
        changes = 0
        for i in range(1, len(history)):
            if history['status'][i] != history['status'][i-1]:
                changes += 1
        return changes

    def _calculate_flake_score(self, history):
        """Calculate flakiness probability"""
        # Tests that fail occasionally but not consistently
        fail_rate = history['failed'].mean()
        if 0.01 < fail_rate < 0.99:
            return fail_rate * (1 - fail_rate) * 4  # Peaks at 50%
        return 0

Official Resource: https://testing.googleblog.com/

Microsoft's Flakiness Detection (Azure Test Plans)

Features:

Automatic flaky test detection in Azure DevOps
Historical analysis of test runs
Flakiness percentage calculation
Integration with Azure Pipelines

Configuration:

# Azure Pipeline with flaky test detection
trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

steps:
  - task: UsePythonVersion@0
    inputs:
      versionSpec: '3.9'

  - script: |
      pip install pytest pytest-azurepipelines
      pytest --junitxml=test-results.xml --repeat=5
    displayName: 'Run tests with repetition'

  - task: PublishTestResults@2
    inputs:
      testResultsFiles: 'test-results.xml'
      testRunTitle: 'Flaky Test Detection'
      publishRunAttachments: true
      enableFlakinessPlatform: true  # Enable flaky detection

DeflakeML (Open Source)

Overview: Machine learning-based flaky test predictor

Features:

Predicts flakiness before test execution
Uses code metrics and test history
Provides confidence scores

Installation and Usage:

# Install
pip install deflakeml

# Train model on historical data
deflakeml train --data test_history.csv --output model.pkl

# Predict flakiness
deflakeml predict --model model.pkl --test-suite ./tests/

GitHub: https://github.com/TestingResearch/DeflakeML

Facebook's Predictive Test Selection

Approach:

Predicts which tests are likely to be flaky
Uses code change analysis and test history
Integrates with continuous testing infrastructure

Key Techniques:

Bayesian inference for flakiness probability
Time-series analysis of test results
Code coverage correlation

15.05.3 ML Techniques for Flaky Test Detection

Supervised Learning Approach

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

class FlakyTestClassifier:
    """
    ML model to classify tests as flaky or stable.

    Note: GradientBoostingClassifier hyperparameters are reasonable defaults.
    For production use, tune parameters via cross-validation on your dataset.
    """

    def __init__(self):
        self.model = GradientBoostingClassifier(
            n_estimators=200,
            learning_rate=0.1,
            max_depth=5
        )

    def prepare_features(self, test_data):
        """Prepare features for ML model"""
        features = []
        for test in test_data:
            feature_vector = [
                test['execution_time_variance'],
                test['failure_rate'],
                test['consecutive_passes'],
                test['consecutive_failures'],
                test['dependency_count'],
                test['uses_threading'],
                test['uses_network'],
                test['uses_filesystem'],
                test['test_age_days'],
                test['code_churn_rate']
            ]
            features.append(feature_vector)
        return np.array(features)

    def train(self, test_data, labels):
        """Train the classifier"""
        X = self.prepare_features(test_data)
        X_train, X_test, y_train, y_test = train_test_split(
            X, labels, test_size=0.2, random_state=42
        )

        self.model.fit(X_train, y_train)
        predictions = self.model.predict(X_test)

        print(classification_report(y_test, predictions))
        return self.model

    def predict_flaky(self, test_data):
        """Predict if tests are flaky"""
        X = self.prepare_features(test_data)
        probabilities = self.model.predict_proba(X)
        return probabilities[:, 1]  # Probability of being flaky

Feature Engineering for Flaky Detection

Common features used in ML models:

Execution Metrics: Duration variance, timeout frequency
Code Characteristics: Threading, async operations, external dependencies
Historical Patterns: Failure rate, pass/fail transitions
Environmental Factors: Time of day effects, resource contention
Dependency Analysis: Shared state, execution order dependencies

15.05.4 CI/CD Integration for Flaky Detection

Jenkins Configuration

pipeline {
    agent any

    stages {
        stage('Test with Flaky Detection') {
            steps {
                script {
                    // Run tests multiple times
                    def results = []
                    for (int i = 0; i < 5; i++) {
                        def result = sh(
                            script: 'pytest --json-report --json-report-file=results_${i}.json',
                            returnStatus: true
                        )
                        results.add(result)
                    }

                    // Analyze for flakiness
                    sh '''
                        python3 << EOF
import json
import glob
from collections import defaultdict

# Expected JSON schema (pytest --json-report format):
# {
#   "tests": [
#     {"nodeid": "test_module::test_name", "outcome": "passed", "duration": 0.1},
#     ...
#   ]
# }

# Aggregate results from multiple runs
test_outcomes = defaultdict(list)

for file in sorted(glob.glob('results_*.json')):
    try:
        with open(file) as f:
            data = json.load(f)

        # Handle pytest-json-report format (list of tests)
        tests = data.get('tests', [])
        if isinstance(tests, list):
            for test in tests:
                test_name = test.get('nodeid', test.get('name', 'unknown'))
                outcome = test.get('outcome', 'unknown')
                test_outcomes[test_name].append(outcome)
    except (json.JSONDecodeError, KeyError) as e:
        print(f"Warning: Could not parse {file}: {e}")
        continue

# Identify flaky tests
flaky_tests = []
for test_name, outcomes in test_outcomes.items():
    if len(set(outcomes)) > 1:  # Different outcomes across runs
        flaky_tests.append({
            'test': test_name,
            'outcomes': outcomes,
            'flake_rate': outcomes.count('failed') / len(outcomes)
        })

if flaky_tests:
    print(f"Found {len(flaky_tests)} flaky tests:")
    for test in flaky_tests:
        print(f"  - {test['test']}: {test['outcomes']}")
else:
    print("No flaky tests detected.")
EOF
                    '''
                }
            }
        }
    }
}

15.05.5 Effectiveness and ROI

Research Findings:

ML-based flaky detection: 80-95% accuracy
False positive rate: 5-15%
Time to identify flaky tests: 70-90% reduction
Impact on CI/CD stability: 40-60% reduction in false failures

ROI Metrics:

Developer time saved: 15-25% (not investigating false failures)
CI/CD reliability: 50-70% improvement
Build time optimization: 20-30% reduction (quarantine flaky tests)

15.05.6 Implementation Examples

Python Script: `detect_flaky_tests.py`

# scripts/detect_flaky_tests.py
import argparse
import json
from pathlib import Path
from collections import defaultdict

class FlakyTestDetector:
    def __init__(self, threshold=0.2):
        self.threshold = threshold
        self.test_results = defaultdict(list)

    def add_results(self, result_file):
        """
        Add test results from a JSON file.

        Expected JSON schema (pytest --json-report format):
        {
            "tests": [
                {"nodeid": "test_module::test_name", "outcome": "passed", "duration": 0.1},
                ...
            ]
        }
        """
        with open(result_file) as f:
            data = json.load(f)

        tests = data.get('tests', [])
        if not isinstance(tests, list):
            raise ValueError(f"Expected 'tests' to be a list, got {type(tests)}")

        for test in tests:
            # Handle different test result formats
            test_name = test.get('nodeid') or test.get('name', 'unknown')
            outcome = test.get('outcome', 'unknown')  # 'passed', 'failed', 'skipped'

            self.test_results[test_name].append({
                'outcome': outcome,
                'duration': test.get('duration', 0)
            })

    def analyze(self):
        """Analyze test results for flakiness"""
        flaky_tests = []

        for test_name, results in self.test_results.items():
            if len(results) < 2:
                continue

            # Calculate flake rate
            outcomes = [r['outcome'] for r in results]
            unique_outcomes = set(outcomes)

            if len(unique_outcomes) > 1:
                # Test has inconsistent results
                failure_rate = outcomes.count('failed') / len(outcomes)

                # Only consider flaky if not always failing/passing
                if 0 < failure_rate < 1:
                    flake_score = failure_rate * (1 - failure_rate) * 4

                    if flake_score >= self.threshold:
                        flaky_tests.append({
                            'name': test_name,
                            'flake_rate': failure_rate,
                            'flake_score': flake_score,
                            'total_runs': len(results),
                            'failures': outcomes.count('failed'),
                            'passes': outcomes.count('passed'),
                            'outcomes': outcomes
                        })

        # Sort by flake score
        flaky_tests.sort(key=lambda x: x['flake_score'], reverse=True)

        return flaky_tests


def main():
    parser = argparse.ArgumentParser(description='Detect flaky tests')
    parser.add_argument('--results', nargs='+', help='Result JSON files')
    parser.add_argument('--output', default='flaky_tests.json', help='Output file')
    parser.add_argument('--threshold', type=float, default=0.2, help='Flake threshold')

    args = parser.parse_args()

    detector = FlakyTestDetector(threshold=args.threshold)

    # Load all result files with error handling
    files_loaded = 0
    for result_file in args.results:
        for path in Path('.').glob(result_file):
            try:
                if not path.exists():
                    print(f"Warning: File not found: {path}")
                    continue
                print(f"Loading {path}")
                detector.add_results(path)
                files_loaded += 1
            except json.JSONDecodeError as e:
                print(f"Error: Malformed JSON in {path}: {e}")
            except Exception as e:
                print(f"Error loading {path}: {e}")

    if files_loaded == 0:
        print("Error: No valid result files found. Exiting.")
        return 1

    # Analyze
    flaky = detector.analyze()

    # Save results
    with open(args.output, 'w') as f:
        json.dump(flaky, f, indent=2)

    # Print summary
    print(f"\nFound {len(flaky)} flaky tests:")
    for test in flaky[:10]:  # Top 10
        print(f"  {test['name']}")
        print(f"    Flake rate: {test['flake_rate']:.1%}")
        print(f"    Score: {test['flake_score']:.2f}")
        print(f"    Runs: {test['total_runs']} ({test['failures']}F / {test['passes']}P)")


if __name__ == '__main__':
    main()

GitHub Actions Integration

  flaky-test-detection:
    name: Flaky Test Detection
    needs: unit-tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: pip install pytest

      - name: Run tests multiple times
        run: |
          for i in {1..5}; do
            pytest tests/ --json-report --json-report-file=results_$i.json || true
          done

      - name: Analyze for flaky tests
        run: |
          python scripts/detect_flaky_tests.py \
            --results results_*.json \
            --output flaky_tests.json \
            --threshold 0.2

      - name: Comment on PR if flaky tests found
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const flaky = JSON.parse(fs.readFileSync('flaky_tests.json', 'utf8'));

            if (flaky.length > 0) {
              // Note: flake_rate is stored as decimal (0.0-1.0), multiply by 100 for percentage
              const comment = `## ⚠️ Flaky Tests Detected\n\n` +
                `The following tests showed flaky behavior:\n\n` +
                flaky.map(t => `- \`${t.name}\` (${(t.flake_rate * 100).toFixed(1)}% flake rate)`).join('\n');

              github.rest.issues.createComment({
                issue_number: context.issue.number,
                owner: context.repo.owner,
                repo: context.repo.repo,
                body: comment
              });
            }

GitLab CI Integration

flaky-detection:
  stage: test
  extends: .python-base
  needs: [unit-tests]
  script:
    - |
      # Run tests 5 times
      for i in {1..5}; do
        pytest tests/ --json-report --json-report-file=results_$i.json || true
      done

      # Analyze for flakiness
      python scripts/detect_flaky_tests.py \
        --results results_*.json \
        --output flaky_tests.json \
        --threshold 0.2

      # Create MR comment if flaky tests found
      if [ -s flaky_tests.json ]; then
        python scripts/create_mr_comment.py \
          --flaky-tests flaky_tests.json \
          --mr-iid $CI_MERGE_REQUEST_IID
      fi
  artifacts:
    paths:
      - flaky_tests.json
    expire_in: 7 days
  only:
    - merge_requests

Summary

Flaky tests pose a significant challenge to the reliability and efficiency of continuous integration pipelines. By leveraging AI and machine learning techniques, development teams can effectively detect, analyze, and manage these inconsistent tests. Approaches range from statistical analysis of test history to supervised learning models that predict flakiness based on various code and execution features. Integrating flaky test detection into CI/CD pipelines through custom scripts or platform-specific features (like Azure DevOps) helps automate the process, providing early warnings and enabling teams to quarantine or address flaky tests proactively. This ultimately leads to more stable and trustworthy test suites, saving developer time and improving overall release confidence.

References

Google Testing Blog: https://testing.googleblog.com/
TestingResearch/DeflakeML GitHub: https://github.com/TestingResearch/DeflakeML
Relevant documentation for pytest and CI/CD platforms (GitHub Actions, GitLab CI).