Configuring Statistical Regression Alerts
An alert that fires on every noisy run is worse than no alert — the team mutes the channel and the one real regression a month goes unread. The fix is to make the alert statistical: it fires only when a candidate run differs from the baseline distribution beyond what run-to-run variance explains. This guide, part of the Automated Regression Detection reference, shows how to choose between a z-score, a CUSUM, and a Mann–Whitney test, compute the significance, and wire the result into a CI alert that stays quiet until something actually moved.
The choice of test is not cosmetic — each answers a different question about how the metric is allowed to move, and picking the wrong one is the most common reason an alert is either deaf or hysterical.
Method Comparison
The three tests cover the situations a budget metric actually presents: a single suspicious run, slow accumulating drift, and a skewed metric that breaks normality assumptions.
| Method | Detects | Assumes | Best for | Watch out for |
|---|---|---|---|---|
| Z-score | A single run far from the baseline mean | Roughly normal, known baseline mean & sd | LCP, TBT, FCP — symmetric timings | Sensitive to outliers in the baseline |
| CUSUM | Small persistent shifts accumulating over runs | Stable baseline mean | Slow drift in bundle size or LCP | Needs a tuned slack k to ignore noise |
| Mann–Whitney U | A distribution shift, ignoring shape | Nothing about normality | INP, mid-range mobile, skewed data | Needs enough samples per side (≥8) |
A practical default: z-score for desktop timings, Mann–Whitney for mobile and INP, and a CUSUM running alongside on bundle size to catch the drift that single-run tests miss. The full per-environment dials are in Automated Regression Detection.
Diagnostic Steps
Before wiring an alert, check that the baseline has enough samples and a stable spread, and see what each test would say about the latest candidate.
# 1. Inspect the baseline distribution for the metric
node scripts/detect-fetch.js --branch main --metric lcp --stats
# → lcp n=60 mean=2180ms sd=140ms skew=0.31 (normal-ish → z-score ok)
# 2. Dry-run all three tests against the candidate without alerting
node scripts/alert-eval.js --metric lcp --baseline baseline_lcp.json --run .lighthouseci/ --dry
# → z=3.1 (p=0.002) cusum=+0.4σ (below h) mw p=0.01 → z-score & MW agree: REGRESSION
When two independent tests agree, the signal is solid. A low skew (under ~0.5) confirms the z-score is appropriate; a higher skew is the cue to lean on Mann–Whitney instead.
Implementation
The snippet below is runnable and self-contained: it computes a z-score and a Mann–Whitney U significance for a candidate against a baseline array, applies the alpha and minimum-effect gates, and emits an alert payload only when a real shift is found.
// scripts/alert-eval.js — fire only on statistically real regressions
const ALPHA = 0.01, MIN_EFFECT = 0.05; // 1% false-positive rate, 5% practical floor
const mean = a => a.reduce((s, x) => s + x, 0) / a.length;
const sd = a => { const m = mean(a); return Math.sqrt(mean(a.map(x => (x - m) ** 2))); };
// z-score of the candidate median against the baseline distribution
function zTest(baseline, candidate) {
const z = (mean(candidate) - mean(baseline)) / (sd(baseline) || 1);
const p = 0.5 * (1 - erf(Math.abs(z) / Math.SQRT2)); // one-sided
return { z: +z.toFixed(2), p: +p.toFixed(4) };
}
// Mann–Whitney U: distribution-free, robust to skew
function mannWhitney(a, b) {
const all = [...a.map(v => [v, "a"]), ...b.map(v => [v, "b"])].sort((x, y) => x[0] - y[0]);
let rank = 0, Ua = 0;
all.forEach(([, g], i) => { rank = i + 1; if (g === "a") Ua += rank; });
const U = Ua - (a.length * (a.length + 1)) / 2;
const mu = (a.length * b.length) / 2;
const sigma = Math.sqrt((a.length * b.length * (a.length + b.length + 1)) / 12);
const z = (U - mu) / sigma;
return { p: +(0.5 * (1 - erf(Math.abs(z) / Math.SQRT2))).toFixed(4) };
}
function erf(x) { // Abramowitz–Stegun approximation
const t = 1 / (1 + 0.3275911 * x);
const y = 1 - (((((1.061405429 * t - 1.453152027) * t) + 1.421413741) * t - 0.284496736) * t + 0.254829592) * t * Math.exp(-x * x);
return y;
}
export function evaluate(baseline, candidate) {
const effect = (mean(candidate) - mean(baseline)) / mean(baseline);
const z = zTest(baseline, candidate);
const mw = mannWhitney(baseline, candidate);
const significant = z.p < ALPHA && mw.p < ALPHA && effect >= MIN_EFFECT;
return { significant, effect: +(effect * 100).toFixed(1), z: z.z, pZ: z.p, pMW: mw.p };
}
A regression is reported only when both tests clear alpha and the effect is at least 5% — requiring agreement plus a practical floor is what keeps the alert quiet on noise.
CI Gating Assertion
This GitHub Actions step runs the evaluator and posts an alert plus a non-zero exit only on a real regression. The continue-on-error: false is the assertion: a significant shift fails the required check.
- name: Statistical regression alert
continue-on-error: false
run: |
node -e '
import("./scripts/alert-eval.js").then(async ({ evaluate }) => {
const fs = await import("node:fs");
const base = JSON.parse(fs.readFileSync("baseline_lcp.json"));
const cand = JSON.parse(fs.readFileSync("candidate_lcp.json"));
const r = evaluate(base.samples, cand.samples);
console.log(`lcp Δ=${r.effect}% pZ=${r.pZ} pMW=${r.pMW}`);
if (r.significant) {
fs.writeFileSync("alert.json", JSON.stringify({ metric: "lcp", ...r }));
process.exit(1);
}
});
'
- name: Notify on regression
if: failure()
run: node scripts/alert-notify.js --in alert.json --channel "#perf-alerts"
Verification
Confirm the alert is calibrated by replaying it over known-good and known-bad runs. A correctly tuned alert stays silent on clean history and fires on the seeded regression.
# Replay 30 known-good main runs: expect zero alerts
node scripts/alert-replay.js --metric lcp --window known-good/ --expect 0
# → 30 runs evaluated, 0 alerts ✓ false-positive rate within budget
# Replay a run with a seeded +12% regression: expect one alert
node scripts/alert-replay.js --metric lcp --run seeded-regression.json --expect 1
# → REGRESSION lcp Δ=+12.0% pZ=0.001 pMW=0.004 ✓ alert fired
Zero alerts over clean history and a clean fire on the seeded case means the alpha and effect floor are right for this metric. If clean history produces even one or two alerts, lower alpha or raise MIN_EFFECT. The deeper significance-testing methodology behind these choices is covered in Statistical Significance Testing for Noisy CI.
Frequently Asked Questions
Z-score or Mann–Whitney — how do I choose?
Check the skew of the baseline. If it is roughly symmetric (skew under about 0.5), a z-score is simpler and slightly more powerful. If the metric is skewed — INP and mid-range mobile timings usually are — use Mann–Whitney, which assumes nothing about shape. When both agree, treat the signal as solid. See Automated Regression Detection for the per-environment defaults.
What does CUSUM add over a per-run test?
A z-score or Mann–Whitney test looks at one candidate against the baseline. CUSUM accumulates small deviations across successive runs, so it catches a slow drift where each individual run looks fine but the trend is clearly upward — the classic bundle-size creep. Run it alongside, not instead of, a per-run test.
Why require both a p-value and an effect size?
With a large baseline window, a statistically significant difference can be tiny and meaningless. Requiring a minimum effect (5%) alongside alpha ensures the alert fires only on shifts that are both real and big enough to act on, which is what keeps the channel readable.