4.3: Flaky Test Detection
What You'll Learn
- Understand what flaky tests are and why they are problematic.
- Learn how AI and ML techniques can be used to detect and manage flaky tests.
- Explore practical tools and CI/CD integration patterns for flaky test detection.
15.05.1 Overview
In safety-critical systems governed by ASPICE 4.0 and ISO 26262, flaky tests directly undermine evidence integrity and traceability requirements. Undetected flakiness compromises test certification and creates liability exposure during audits.
What Are Flaky Tests?
Flaky tests are tests that exhibit non-deterministic behavior, passing or failing without code changes. They represent a significant threat to verification processes required by:
| Standard | Impact of Flaky Tests |
|---|---|
| ASPICE SWE.4/5/6 | Invalidates unit, integration, and qualification test evidence |
| ISO 26262-6 Clause 9 | Compromises verification of safety requirements |
| ISO 26262-8 Clause 11 | Undermines tool confidence level (TCL) claims |
| DO-178C Objective A-7 | Prevents "Testing of High-Level Requirements" compliance |
ML techniques can identify and categorize flaky tests, protecting evidence integrity.
15.05.2 Tools and Frameworks
Google's Flaky Test Detection
Approach:
- Analyzes test execution history
- Statistical modeling of failure patterns
- Confidence scoring for flakiness
Implementation:
# Example using historical test data
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
class FlakyTestDetector:
"""
Note: The flake_score calculation is a simplified heuristic.
Calibrate thresholds based on your project's historical data and
acceptable false positive/negative rates.
"""
def __init__(self):
self.model = RandomForestClassifier(n_estimators=100)
def extract_features(self, test_history):
"""Extract features from test execution history"""
features = {
'failure_rate': test_history['failed'].mean(),
'pass_fail_alternations': self._count_alternations(test_history),
'time_variance': test_history['duration'].std(),
'flake_score': self._calculate_flake_score(test_history)
}
return features
def _count_alternations(self, history):
"""Count how often test result changes"""
changes = 0
for i in range(1, len(history)):
if history['status'][i] != history['status'][i-1]:
changes += 1
return changes
def _calculate_flake_score(self, history):
"""Calculate flakiness probability"""
# Tests that fail occasionally but not consistently
fail_rate = history['failed'].mean()
if 0.01 < fail_rate < 0.99:
return fail_rate * (1 - fail_rate) * 4 # Peaks at 50%
return 0
Official Resource: https://testing.googleblog.com/
Microsoft's Flakiness Detection (Azure Test Plans)
Features:
- Automatic flaky test detection in Azure DevOps
- Historical analysis of test runs
- Flakiness percentage calculation
- Integration with Azure Pipelines
Configuration:
# Azure Pipeline with flaky test detection
trigger:
- main
pool:
vmImage: 'ubuntu-latest'
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.9'
- script: |
pip install pytest pytest-azurepipelines
pytest --junitxml=test-results.xml --repeat=5
displayName: 'Run tests with repetition'
- task: PublishTestResults@2
inputs:
testResultsFiles: 'test-results.xml'
testRunTitle: 'Flaky Test Detection'
publishRunAttachments: true
enableFlakinessPlatform: true # Enable flaky detection
DeflakeML (Open Source)
Overview: Machine learning-based flaky test predictor
Features:
- Predicts flakiness before test execution
- Uses code metrics and test history
- Provides confidence scores
Installation and Usage:
# Install
pip install deflakeml
# Train model on historical data
deflakeml train --data test_history.csv --output model.pkl
# Predict flakiness
deflakeml predict --model model.pkl --test-suite ./tests/
GitHub: https://github.com/TestingResearch/DeflakeML
Facebook's Predictive Test Selection
Approach:
- Predicts which tests are likely to be flaky
- Uses code change analysis and test history
- Integrates with continuous testing infrastructure
Key Techniques:
- Bayesian inference for flakiness probability
- Time-series analysis of test results
- Code coverage correlation
15.05.3 ML Techniques for Flaky Test Detection
Supervised Learning Approach
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
class FlakyTestClassifier:
"""
ML model to classify tests as flaky or stable.
Note: GradientBoostingClassifier hyperparameters are reasonable defaults.
For production use, tune parameters via cross-validation on your dataset.
"""
def __init__(self):
self.model = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=5
)
def prepare_features(self, test_data):
"""Prepare features for ML model"""
features = []
for test in test_data:
feature_vector = [
test['execution_time_variance'],
test['failure_rate'],
test['consecutive_passes'],
test['consecutive_failures'],
test['dependency_count'],
test['uses_threading'],
test['uses_network'],
test['uses_filesystem'],
test['test_age_days'],
test['code_churn_rate']
]
features.append(feature_vector)
return np.array(features)
def train(self, test_data, labels):
"""Train the classifier"""
X = self.prepare_features(test_data)
X_train, X_test, y_train, y_test = train_test_split(
X, labels, test_size=0.2, random_state=42
)
self.model.fit(X_train, y_train)
predictions = self.model.predict(X_test)
print(classification_report(y_test, predictions))
return self.model
def predict_flaky(self, test_data):
"""Predict if tests are flaky"""
X = self.prepare_features(test_data)
probabilities = self.model.predict_proba(X)
return probabilities[:, 1] # Probability of being flaky
Feature Engineering for Flaky Detection
Common features used in ML models:
- Execution Metrics: Duration variance, timeout frequency
- Code Characteristics: Threading, async operations, external dependencies
- Historical Patterns: Failure rate, pass/fail transitions
- Environmental Factors: Time of day effects, resource contention
- Dependency Analysis: Shared state, execution order dependencies
15.05.4 CI/CD Integration for Flaky Detection
Jenkins Configuration
pipeline {
agent any
stages {
stage('Test with Flaky Detection') {
steps {
script {
// Run tests multiple times
def results = []
for (int i = 0; i < 5; i++) {
def result = sh(
script: 'pytest --json-report --json-report-file=results_${i}.json',
returnStatus: true
)
results.add(result)
}
// Analyze for flakiness
sh '''
python3 << EOF
import json
import glob
from collections import defaultdict
# Expected JSON schema (pytest --json-report format):
# {
# "tests": [
# {"nodeid": "test_module::test_name", "outcome": "passed", "duration": 0.1},
# ...
# ]
# }
# Aggregate results from multiple runs
test_outcomes = defaultdict(list)
for file in sorted(glob.glob('results_*.json')):
try:
with open(file) as f:
data = json.load(f)
# Handle pytest-json-report format (list of tests)
tests = data.get('tests', [])
if isinstance(tests, list):
for test in tests:
test_name = test.get('nodeid', test.get('name', 'unknown'))
outcome = test.get('outcome', 'unknown')
test_outcomes[test_name].append(outcome)
except (json.JSONDecodeError, KeyError) as e:
print(f"Warning: Could not parse {file}: {e}")
continue
# Identify flaky tests
flaky_tests = []
for test_name, outcomes in test_outcomes.items():
if len(set(outcomes)) > 1: # Different outcomes across runs
flaky_tests.append({
'test': test_name,
'outcomes': outcomes,
'flake_rate': outcomes.count('failed') / len(outcomes)
})
if flaky_tests:
print(f"Found {len(flaky_tests)} flaky tests:")
for test in flaky_tests:
print(f" - {test['test']}: {test['outcomes']}")
else:
print("No flaky tests detected.")
EOF
'''
}
}
}
}
}
15.05.5 Effectiveness and ROI
Research Findings:
- ML-based flaky detection: 80-95% accuracy
- False positive rate: 5-15%
- Time to identify flaky tests: 70-90% reduction
- Impact on CI/CD stability: 40-60% reduction in false failures
ROI Metrics:
- Developer time saved: 15-25% (not investigating false failures)
- CI/CD reliability: 50-70% improvement
- Build time optimization: 20-30% reduction (quarantine flaky tests)
15.05.6 Implementation Examples
Python Script: detect_flaky_tests.py
# scripts/detect_flaky_tests.py
import argparse
import json
from pathlib import Path
from collections import defaultdict
class FlakyTestDetector:
def __init__(self, threshold=0.2):
self.threshold = threshold
self.test_results = defaultdict(list)
def add_results(self, result_file):
"""
Add test results from a JSON file.
Expected JSON schema (pytest --json-report format):
{
"tests": [
{"nodeid": "test_module::test_name", "outcome": "passed", "duration": 0.1},
...
]
}
"""
with open(result_file) as f:
data = json.load(f)
tests = data.get('tests', [])
if not isinstance(tests, list):
raise ValueError(f"Expected 'tests' to be a list, got {type(tests)}")
for test in tests:
# Handle different test result formats
test_name = test.get('nodeid') or test.get('name', 'unknown')
outcome = test.get('outcome', 'unknown') # 'passed', 'failed', 'skipped'
self.test_results[test_name].append({
'outcome': outcome,
'duration': test.get('duration', 0)
})
def analyze(self):
"""Analyze test results for flakiness"""
flaky_tests = []
for test_name, results in self.test_results.items():
if len(results) < 2:
continue
# Calculate flake rate
outcomes = [r['outcome'] for r in results]
unique_outcomes = set(outcomes)
if len(unique_outcomes) > 1:
# Test has inconsistent results
failure_rate = outcomes.count('failed') / len(outcomes)
# Only consider flaky if not always failing/passing
if 0 < failure_rate < 1:
flake_score = failure_rate * (1 - failure_rate) * 4
if flake_score >= self.threshold:
flaky_tests.append({
'name': test_name,
'flake_rate': failure_rate,
'flake_score': flake_score,
'total_runs': len(results),
'failures': outcomes.count('failed'),
'passes': outcomes.count('passed'),
'outcomes': outcomes
})
# Sort by flake score
flaky_tests.sort(key=lambda x: x['flake_score'], reverse=True)
return flaky_tests
def main():
parser = argparse.ArgumentParser(description='Detect flaky tests')
parser.add_argument('--results', nargs='+', help='Result JSON files')
parser.add_argument('--output', default='flaky_tests.json', help='Output file')
parser.add_argument('--threshold', type=float, default=0.2, help='Flake threshold')
args = parser.parse_args()
detector = FlakyTestDetector(threshold=args.threshold)
# Load all result files with error handling
files_loaded = 0
for result_file in args.results:
for path in Path('.').glob(result_file):
try:
if not path.exists():
print(f"Warning: File not found: {path}")
continue
print(f"Loading {path}")
detector.add_results(path)
files_loaded += 1
except json.JSONDecodeError as e:
print(f"Error: Malformed JSON in {path}: {e}")
except Exception as e:
print(f"Error loading {path}: {e}")
if files_loaded == 0:
print("Error: No valid result files found. Exiting.")
return 1
# Analyze
flaky = detector.analyze()
# Save results
with open(args.output, 'w') as f:
json.dump(flaky, f, indent=2)
# Print summary
print(f"\nFound {len(flaky)} flaky tests:")
for test in flaky[:10]: # Top 10
print(f" {test['name']}")
print(f" Flake rate: {test['flake_rate']:.1%}")
print(f" Score: {test['flake_score']:.2f}")
print(f" Runs: {test['total_runs']} ({test['failures']}F / {test['passes']}P)")
if __name__ == '__main__':
main()
GitHub Actions Integration
flaky-test-detection:
name: Flaky Test Detection
needs: unit-tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install dependencies
run: pip install pytest
- name: Run tests multiple times
run: |
for i in {1..5}; do
pytest tests/ --json-report --json-report-file=results_$i.json || true
done
- name: Analyze for flaky tests
run: |
python scripts/detect_flaky_tests.py \
--results results_*.json \
--output flaky_tests.json \
--threshold 0.2
- name: Comment on PR if flaky tests found
if: github.event_name == 'pull_request'
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const flaky = JSON.parse(fs.readFileSync('flaky_tests.json', 'utf8'));
if (flaky.length > 0) {
// Note: flake_rate is stored as decimal (0.0-1.0), multiply by 100 for percentage
const comment = `## ⚠️ Flaky Tests Detected\n\n` +
`The following tests showed flaky behavior:\n\n` +
flaky.map(t => `- \`${t.name}\` (${(t.flake_rate * 100).toFixed(1)}% flake rate)`).join('\n');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
}
GitLab CI Integration
flaky-detection:
stage: test
extends: .python-base
needs: [unit-tests]
script:
- |
# Run tests 5 times
for i in {1..5}; do
pytest tests/ --json-report --json-report-file=results_$i.json || true
done
# Analyze for flakiness
python scripts/detect_flaky_tests.py \
--results results_*.json \
--output flaky_tests.json \
--threshold 0.2
# Create MR comment if flaky tests found
if [ -s flaky_tests.json ]; then
python scripts/create_mr_comment.py \
--flaky-tests flaky_tests.json \
--mr-iid $CI_MERGE_REQUEST_IID
fi
artifacts:
paths:
- flaky_tests.json
expire_in: 7 days
only:
- merge_requests
Summary
Flaky tests pose a significant challenge to the reliability and efficiency of continuous integration pipelines. By leveraging AI and machine learning techniques, development teams can effectively detect, analyze, and manage these inconsistent tests. Approaches range from statistical analysis of test history to supervised learning models that predict flakiness based on various code and execution features. Integrating flaky test detection into CI/CD pipelines through custom scripts or platform-specific features (like Azure DevOps) helps automate the process, providing early warnings and enabling teams to quarantine or address flaky tests proactively. This ultimately leads to more stable and trustworthy test suites, saving developer time and improving overall release confidence.
References
- Google Testing Blog: https://testing.googleblog.com/
- TestingResearch/DeflakeML GitHub: https://github.com/TestingResearch/DeflakeML
- Relevant documentation for pytest and CI/CD platforms (GitHub Actions, GitLab CI).