Automated Regression Detection

A fixed threshold answers one question — "is this run above the line?" — and answers it badly, because a noisy metric crosses any line by chance and a slowly drifting one never does. Statistical regression detection asks a better question: "is this run's distribution different from the baseline's, beyond what noise explains?" That distinction is the difference between a gate engineers trust and one they learn to re-run until it goes green. This is the change-detection layer of the Threshold Calibration & Baseline Management reference: it compares a candidate run against the distribution captured in your rolling baseline and gates only when the shift is statistically significant.

Detection has three coupled parts — choosing a method (what statistic decides "significant"), sizing the window and sensitivity (how much evidence is enough), and enforcing the decision in CI without drowning the team in false alarms. Tune the method too hot and every wobble blocks a merge; too cold and a real 10% regression sails through. This page is the authoritative spec for all three.

Core Concept: Change Detection vs Fixed Thresholds

A fixed threshold treats each run as a single point against a constant. Change detection treats the candidate as a sample drawn from a distribution and asks whether that distribution has moved relative to the baseline window. The flow below shows the decision: a new run is compared against the baseline distribution, a significance test runs, and only a significant shift alerts or gates.

Regression detection decision flow A new run feeds into a comparison against the baseline distribution; a significance test evaluates the difference; if not significant the run passes; if significant the build is gated and an alert fires. New run candidate samples vs baseline distribution window of N runs significant? p < α pass merge ✓ gate alert + block no yes
Each candidate is tested against the baseline distribution; a non-significant difference passes, and only a statistically significant shift alerts and gates the build.

Prerequisites & Environment

Change detection consumes the rolling baseline distribution, so it inherits everything the baseline needs first. Establish the baseline series and its outlier filtering through Historical Baseline Calibration before enabling detection, and clean the input samples per Statistical Noise & Flakiness Reduction — a detector fed raw noise produces either constant false alarms or a band so wide it is blind.

  • A baseline window of samples, not just a number — the detector needs the full set of recent values per metric to estimate the baseline mean and spread, not a single median.
  • Deterministic collectionthrottlingMethod: simulate, fixed numberOfRuns, so the candidate's spread is comparable to the baseline's.
  • A defined per-series identity — detection runs per (metric, device, route); never compare a mobile candidate against a desktop baseline.

Map the inputs through environment variables: DETECT_BASELINE_URL for the baseline window source, DETECT_METHOD to select the test, and DETECT_ALPHA for the significance level.

Configuration Reference

The detection config below is the authoritative spec. It selects the statistical method, the comparison window, and the sensitivity that trades false positives against missed regressions. Every field is explained inline.

{
  "detection": {
    "method": "welch",
    "window": { "baselineSize": 60, "candidateRuns": 5, "minBaseline": 30 },
    "sensitivity": { "alpha": 0.01, "minEffect": { "type": "rel", "value": 0.05 } },
    "direction": "regression-only",
    "metrics": {
      "lcp": { "enabled": true, "level": "error" },
      "inp": { "enabled": true, "level": "error" },
      "cls": { "enabled": true, "level": "warn" },
      "script_bytes": { "method": "cusum", "level": "error" }
    }
  }
}

method chooses the statistic: welch (Welch's t-test) for normally distributed timings, mannwhitney for skewed metrics, cusum for catching slow drift that a single-run test misses. alpha is the false-positive rate — 0.01 means a 1% chance of flagging pure noise. minEffect is the floor on practical significance: a difference must be both statistically significant and at least 5% to gate, which suppresses tiny-but-real shifts nobody cares about. direction: regression-only ignores improvements so a faster run never fails the build.

Step-by-Step Implementation

  1. Pull the baseline window and the candidate. Fetch the recent baseline samples per metric and the candidate's runs.

    node scripts/detect-fetch.js --branch main --metric lcp --out baseline_lcp.json

    Expected output: baseline lcp: 60 samples, mean=2180ms sd=140ms confirming enough samples and a usable spread.

  2. Run the significance test. Compare candidate against baseline with the configured method, applying both alpha and minEffect.

    node scripts/detect-run.js --config detection.json --baseline baseline_lcp.json --run .lighthouseci/

    Expected tail: lcp: candidate mean=2460ms Δ=+12.8% p=0.004 → REGRESSION (error) — one line per metric, with the verdict.

  3. Wire the exit code into the gate. A REGRESSION at error level exits non-zero; warn annotates without blocking. Calibrate sensitivity against your own false-positive rate before promoting any metric to error.

Threshold Calibration

The single dial that matters is sensitivity: lower alpha and higher minEffect mean fewer false alarms but slower detection of real regressions; the reverse catches small shifts fast at the cost of noise. Calibrate by replaying the detector over your last few weeks of known-good runs and counting how often it would have fired — that empirical false-positive rate, not theory, sets the dial. The matrix gives defensible starting points by environment.

Device class Connection profile Method alpha minEffect candidateRuns
Desktop Cable / Fiber Welch t-test 0.01 4% 5
High-end mobile 4G / LTE Welch t-test 0.01 5% 5
Mid-range mobile Fast 3G Mann–Whitney 0.02 6% 7

Mid-range mobile metrics are skewed and noisier, so a rank-based test and a looser alpha keep the false-positive rate tolerable. Keep every metric at warn until its replayed false-positive rate sits below roughly one alarm per two weeks, then promote to error. The detailed method-by-method tuning lives in Configuring Statistical Regression Alerts.

CI Enforcement Snippet

This GitHub Actions job runs detection against the baseline window and gates the merge on a significant regression. It is copy-paste ready and exposes a required status check.

name: Regression Detection
on:
  pull_request:
    branches: [main]

jobs:
  detect:
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20", cache: "npm" }
      - run: npm ci
      - name: Fetch baseline window
        run: node scripts/detect-fetch.js --branch main --out baseline_window.json
        env: { DETECT_BASELINE_URL: "${{ secrets.DETECT_BASELINE_URL }}" }
      - name: Collect candidate
        run: npx lhci collect && npx lhci upload --target=filesystem
      - name: Run change detection
        run: node scripts/detect-run.js --config detection.json --baseline baseline_window.json --run .lighthouseci/

The detect-run.js step exits non-zero only on a significant, large-enough regression, so the job is safe to make a required check. For the alert side — firing on real shifts without paging on noise — see Configuring Statistical Regression Alerts.

Troubleshooting & Edge Cases

  • Constant false positivesalpha too high or the baseline window too noisy. Lower alpha to 0.01, raise minEffect, and verify outliers are trimmed upstream per Statistical Noise & Flakiness Reduction.
  • Real regressions slip throughminEffect set above the regression size, or too few candidateRuns to reach significance. Raise the run count and lower the effect floor.
  • Skewed metric flagged constantly by a t-test → Welch assumes roughly normal data; switch that metric to mannwhitney.
  • Slow drift never fires → single-run tests compare one point against the window; add a cusum detector that accumulates small shifts over successive runs.
  • Baseline window too small after a reset → fewer than minBaseline samples. Fall back to a static ceiling until the window refills.
  • Improvement fails the builddirection not set to regression-only; a two-sided test flags faster runs too.

Frequently Asked Questions

Why not just use a fixed threshold?

A fixed threshold ignores variance: a noisy metric crosses any line by chance and a slowly drifting one stays under it for weeks. Change detection compares the candidate against the baseline distribution and fires only when the shift exceeds what noise explains, which is why it produces a gate engineers trust. Fixed ceilings still have a role as a hard backstop alongside detection, set in Historical Baseline Calibration.

Which method should I start with?

Welch's t-test for normally distributed timing metrics like LCP and TBT, Mann–Whitney for skewed or mobile metrics, and CUSUM when you need to catch slow accumulating drift. Most teams start with Welch at alpha=0.01 and add CUSUM for size metrics. The full comparison is in Configuring Statistical Regression Alerts.

How do I keep detection from blocking on pure noise?

Set a low alpha (0.01), require a minimum practical effect size (4–6%), and keep new metrics at warn until you have replayed the detector over known-good runs and confirmed its false-positive rate is below roughly one alarm per two weeks.