Automated Regression Detection
A fixed threshold answers one question — "is this run above the line?" — and answers it badly, because a noisy metric crosses any line by chance and a slowly drifting one never does. Statistical regression detection asks a better question: "is this run's distribution different from the baseline's, beyond what noise explains?" That distinction is the difference between a gate engineers trust and one they learn to re-run until it goes green. This is the change-detection layer of the Threshold Calibration & Baseline Management reference: it compares a candidate run against the distribution captured in your rolling baseline and gates only when the shift is statistically significant.
Detection has three coupled parts — choosing a method (what statistic decides "significant"), sizing the window and sensitivity (how much evidence is enough), and enforcing the decision in CI without drowning the team in false alarms. Tune the method too hot and every wobble blocks a merge; too cold and a real 10% regression sails through. This page is the authoritative spec for all three.
Core Concept: Change Detection vs Fixed Thresholds
A fixed threshold treats each run as a single point against a constant. Change detection treats the candidate as a sample drawn from a distribution and asks whether that distribution has moved relative to the baseline window. The flow below shows the decision: a new run is compared against the baseline distribution, a significance test runs, and only a significant shift alerts or gates.
Prerequisites & Environment
Change detection consumes the rolling baseline distribution, so it inherits everything the baseline needs first. Establish the baseline series and its outlier filtering through Historical Baseline Calibration before enabling detection, and clean the input samples per Statistical Noise & Flakiness Reduction — a detector fed raw noise produces either constant false alarms or a band so wide it is blind.
- A baseline window of samples, not just a number — the detector needs the full set of recent values per metric to estimate the baseline mean and spread, not a single median.
- Deterministic collection —
throttlingMethod: simulate, fixednumberOfRuns, so the candidate's spread is comparable to the baseline's. - A defined per-series identity — detection runs per (metric, device, route); never compare a mobile candidate against a desktop baseline.
Map the inputs through environment variables: DETECT_BASELINE_URL for the baseline window source, DETECT_METHOD to select the test, and DETECT_ALPHA for the significance level.
Configuration Reference
The detection config below is the authoritative spec. It selects the statistical method, the comparison window, and the sensitivity that trades false positives against missed regressions. Every field is explained inline.
{
"detection": {
"method": "welch",
"window": { "baselineSize": 60, "candidateRuns": 5, "minBaseline": 30 },
"sensitivity": { "alpha": 0.01, "minEffect": { "type": "rel", "value": 0.05 } },
"direction": "regression-only",
"metrics": {
"lcp": { "enabled": true, "level": "error" },
"inp": { "enabled": true, "level": "error" },
"cls": { "enabled": true, "level": "warn" },
"script_bytes": { "method": "cusum", "level": "error" }
}
}
}
method chooses the statistic: welch (Welch's t-test) for normally distributed timings, mannwhitney for skewed metrics, cusum for catching slow drift that a single-run test misses. alpha is the false-positive rate — 0.01 means a 1% chance of flagging pure noise. minEffect is the floor on practical significance: a difference must be both statistically significant and at least 5% to gate, which suppresses tiny-but-real shifts nobody cares about. direction: regression-only ignores improvements so a faster run never fails the build.
Step-by-Step Implementation
-
Pull the baseline window and the candidate. Fetch the recent baseline samples per metric and the candidate's runs.
node scripts/detect-fetch.js --branch main --metric lcp --out baseline_lcp.jsonExpected output:
baseline lcp: 60 samples, mean=2180ms sd=140msconfirming enough samples and a usable spread. -
Run the significance test. Compare candidate against baseline with the configured method, applying both
alphaandminEffect.node scripts/detect-run.js --config detection.json --baseline baseline_lcp.json --run .lighthouseci/Expected tail:
lcp: candidate mean=2460ms Δ=+12.8% p=0.004 → REGRESSION (error)— one line per metric, with the verdict. -
Wire the exit code into the gate. A
REGRESSIONaterrorlevel exits non-zero;warnannotates without blocking. Calibrate sensitivity against your own false-positive rate before promoting any metric toerror.
Threshold Calibration
The single dial that matters is sensitivity: lower alpha and higher minEffect mean fewer false alarms but slower detection of real regressions; the reverse catches small shifts fast at the cost of noise. Calibrate by replaying the detector over your last few weeks of known-good runs and counting how often it would have fired — that empirical false-positive rate, not theory, sets the dial. The matrix gives defensible starting points by environment.
| Device class | Connection profile | Method | alpha | minEffect | candidateRuns |
|---|---|---|---|---|---|
| Desktop | Cable / Fiber | Welch t-test | 0.01 | 4% | 5 |
| High-end mobile | 4G / LTE | Welch t-test | 0.01 | 5% | 5 |
| Mid-range mobile | Fast 3G | Mann–Whitney | 0.02 | 6% | 7 |
Mid-range mobile metrics are skewed and noisier, so a rank-based test and a looser alpha keep the false-positive rate tolerable. Keep every metric at warn until its replayed false-positive rate sits below roughly one alarm per two weeks, then promote to error. The detailed method-by-method tuning lives in Configuring Statistical Regression Alerts.
CI Enforcement Snippet
This GitHub Actions job runs detection against the baseline window and gates the merge on a significant regression. It is copy-paste ready and exposes a required status check.
name: Regression Detection
on:
pull_request:
branches: [main]
jobs:
detect:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: "20", cache: "npm" }
- run: npm ci
- name: Fetch baseline window
run: node scripts/detect-fetch.js --branch main --out baseline_window.json
env: { DETECT_BASELINE_URL: "${{ secrets.DETECT_BASELINE_URL }}" }
- name: Collect candidate
run: npx lhci collect && npx lhci upload --target=filesystem
- name: Run change detection
run: node scripts/detect-run.js --config detection.json --baseline baseline_window.json --run .lighthouseci/
The detect-run.js step exits non-zero only on a significant, large-enough regression, so the job is safe to make a required check. For the alert side — firing on real shifts without paging on noise — see Configuring Statistical Regression Alerts.
Troubleshooting & Edge Cases
- Constant false positives →
alphatoo high or the baseline window too noisy. Loweralphato 0.01, raiseminEffect, and verify outliers are trimmed upstream per Statistical Noise & Flakiness Reduction. - Real regressions slip through →
minEffectset above the regression size, or too fewcandidateRunsto reach significance. Raise the run count and lower the effect floor. - Skewed metric flagged constantly by a t-test → Welch assumes roughly normal data; switch that metric to
mannwhitney. - Slow drift never fires → single-run tests compare one point against the window; add a
cusumdetector that accumulates small shifts over successive runs. - Baseline window too small after a reset → fewer than
minBaselinesamples. Fall back to a static ceiling until the window refills. - Improvement fails the build →
directionnot set toregression-only; a two-sided test flags faster runs too.
Frequently Asked Questions
Why not just use a fixed threshold?
A fixed threshold ignores variance: a noisy metric crosses any line by chance and a slowly drifting one stays under it for weeks. Change detection compares the candidate against the baseline distribution and fires only when the shift exceeds what noise explains, which is why it produces a gate engineers trust. Fixed ceilings still have a role as a hard backstop alongside detection, set in Historical Baseline Calibration.
Which method should I start with?
Welch's t-test for normally distributed timing metrics like LCP and TBT, Mann–Whitney for skewed or mobile metrics, and CUSUM when you need to catch slow accumulating drift. Most teams start with Welch at alpha=0.01 and add CUSUM for size metrics. The full comparison is in Configuring Statistical Regression Alerts.
How do I keep detection from blocking on pure noise?
Set a low alpha (0.01), require a minimum practical effect size (4–6%), and keep new metrics at warn until you have replayed the detector over known-good runs and confirmed its false-positive rate is below roughly one alarm per two weeks.