Configuring Statistical Regression Alerts

Q: Z-score or Mann-Whitney - how do I choose?

Check baseline skew. If roughly symmetric (skew under ~0.5), a z-score is simpler and slightly more powerful; if skewed, like INP, use Mann-Whitney which assumes nothing about shape.

Q: What does CUSUM add over a per-run test?

CUSUM accumulates small deviations across successive runs, catching slow drift where each run looks fine but the trend creeps upward, such as bundle-size growth. Run it alongside a per-run test.

Q: Why require both a p-value and an effect size?

With a large baseline window a significant difference can be tiny and meaningless. Requiring a 5% minimum effect alongside alpha ensures the alert fires only on shifts that are real and big enough to act on.

An alert that fires on every noisy run is worse than no alert — the team mutes the channel and the one real regression a month goes unread. The fix is to make the alert statistical: it fires only when a candidate run differs from the baseline distribution beyond what run-to-run variance explains. This guide, part of the Automated Regression Detection reference, shows how to choose between a z-score, a CUSUM, and a Mann–Whitney test, compute the significance, and wire the result into a CI alert that stays quiet until something actually moved.

The choice of test is not cosmetic — each answers a different question about how the metric is allowed to move, and picking the wrong one is the most common reason an alert is either deaf or hysterical.

Method Comparison

The three tests cover the situations a budget metric actually presents: a single suspicious run, slow accumulating drift, and a skewed metric that breaks normality assumptions.

Method	Detects	Assumes	Best for	Watch out for
Z-score	A single run far from the baseline mean	Roughly normal, known baseline mean & sd	LCP, TBT, FCP — symmetric timings	Sensitive to outliers in the baseline
CUSUM	Small persistent shifts accumulating over runs	Stable baseline mean	Slow drift in bundle size or LCP	Needs a tuned slack `k` to ignore noise
Mann–Whitney U	A distribution shift, ignoring shape	Nothing about normality	INP, mid-range mobile, skewed data	Needs enough samples per side (≥8)

A practical default: z-score for desktop timings, Mann–Whitney for mobile and INP, and a CUSUM running alongside on bundle size to catch the drift that single-run tests miss. The full per-environment dials are in Automated Regression Detection.

Diagnostic Steps

Before wiring an alert, check that the baseline has enough samples and a stable spread, and see what each test would say about the latest candidate.

# 1. Inspect the baseline distribution for the metric
node scripts/detect-fetch.js --branch main --metric lcp --stats
# → lcp  n=60  mean=2180ms  sd=140ms  skew=0.31  (normal-ish → z-score ok)

# 2. Dry-run all three tests against the candidate without alerting
node scripts/alert-eval.js --metric lcp --baseline baseline_lcp.json --run .lighthouseci/ --dry
# → z=3.1 (p=0.002)  cusum=+0.4σ (below h)  mw p=0.01  → z-score & MW agree: REGRESSION

When two independent tests agree, the signal is solid. A low skew (under ~0.5) confirms the z-score is appropriate; a higher skew is the cue to lean on Mann–Whitney instead.

Implementation

The snippet below is runnable and self-contained: it computes a z-score and a Mann–Whitney U significance for a candidate against a baseline array, applies the alpha and minimum-effect gates, and emits an alert payload only when a real shift is found.

// scripts/alert-eval.js  —  fire only on statistically real regressions
const ALPHA = 0.01, MIN_EFFECT = 0.05; // 1% false-positive rate, 5% practical floor

const mean = a => a.reduce((s, x) => s + x, 0) / a.length;
const sd = a => { const m = mean(a); return Math.sqrt(mean(a.map(x => (x - m) ** 2))); };

// z-score of the candidate median against the baseline distribution
function zTest(baseline, candidate) {
  const z = (mean(candidate) - mean(baseline)) / (sd(baseline) || 1);
  const p = 0.5 * (1 - erf(Math.abs(z) / Math.SQRT2)); // one-sided
  return { z: +z.toFixed(2), p: +p.toFixed(4) };
}

// Mann–Whitney U: distribution-free, robust to skew
function mannWhitney(a, b) {
  const all = [...a.map(v => [v, "a"]), ...b.map(v => [v, "b"])].sort((x, y) => x[0] - y[0]);
  let rank = 0, Ua = 0;
  all.forEach(([, g], i) => { rank = i + 1; if (g === "a") Ua += rank; });
  const U = Ua - (a.length * (a.length + 1)) / 2;
  const mu = (a.length * b.length) / 2;
  const sigma = Math.sqrt((a.length * b.length * (a.length + b.length + 1)) / 12);
  const z = (U - mu) / sigma;
  return { p: +(0.5 * (1 - erf(Math.abs(z) / Math.SQRT2))).toFixed(4) };
}

function erf(x) { // Abramowitz–Stegun approximation
  const t = 1 / (1 + 0.3275911 * x);
  const y = 1 - (((((1.061405429 * t - 1.453152027) * t) + 1.421413741) * t - 0.284496736) * t + 0.254829592) * t * Math.exp(-x * x);
  return y;
}

export function evaluate(baseline, candidate) {
  const effect = (mean(candidate) - mean(baseline)) / mean(baseline);
  const z = zTest(baseline, candidate);
  const mw = mannWhitney(baseline, candidate);
  const significant = z.p < ALPHA && mw.p < ALPHA && effect >= MIN_EFFECT;
  return { significant, effect: +(effect * 100).toFixed(1), z: z.z, pZ: z.p, pMW: mw.p };
}

A regression is reported only when both tests clear alpha and the effect is at least 5% — requiring agreement plus a practical floor is what keeps the alert quiet on noise.

CI Gating Assertion

This GitHub Actions step runs the evaluator and posts an alert plus a non-zero exit only on a real regression. The continue-on-error: false is the assertion: a significant shift fails the required check.

- name: Statistical regression alert
  continue-on-error: false
  run: |
    node -e '
      import("./scripts/alert-eval.js").then(async ({ evaluate }) => {
        const fs = await import("node:fs");
        const base = JSON.parse(fs.readFileSync("baseline_lcp.json"));
        const cand = JSON.parse(fs.readFileSync("candidate_lcp.json"));
        const r = evaluate(base.samples, cand.samples);
        console.log(`lcp Δ=${r.effect}%  pZ=${r.pZ}  pMW=${r.pMW}`);
        if (r.significant) {
          fs.writeFileSync("alert.json", JSON.stringify({ metric: "lcp", ...r }));
          process.exit(1);
        }
      });
    '
- name: Notify on regression
  if: failure()
  run: node scripts/alert-notify.js --in alert.json --channel "#perf-alerts"

Verification

Confirm the alert is calibrated by replaying it over known-good and known-bad runs. A correctly tuned alert stays silent on clean history and fires on the seeded regression.

# Replay 30 known-good main runs: expect zero alerts
node scripts/alert-replay.js --metric lcp --window known-good/ --expect 0
# → 30 runs evaluated, 0 alerts  ✓ false-positive rate within budget

# Replay a run with a seeded +12% regression: expect one alert
node scripts/alert-replay.js --metric lcp --run seeded-regression.json --expect 1
# → REGRESSION lcp Δ=+12.0% pZ=0.001 pMW=0.004  ✓ alert fired

Zero alerts over clean history and a clean fire on the seeded case means the alpha and effect floor are right for this metric. If clean history produces even one or two alerts, lower alpha or raise MIN_EFFECT. The deeper significance-testing methodology behind these choices is covered in Statistical Significance Testing for Noisy CI.

Frequently Asked Questions

Z-score or Mann–Whitney — how do I choose?

Check the skew of the baseline. If it is roughly symmetric (skew under about 0.5), a z-score is simpler and slightly more powerful. If the metric is skewed — INP and mid-range mobile timings usually are — use Mann–Whitney, which assumes nothing about shape. When both agree, treat the signal as solid. See Automated Regression Detection for the per-environment defaults.

What does CUSUM add over a per-run test?

A z-score or Mann–Whitney test looks at one candidate against the baseline. CUSUM accumulates small deviations across successive runs, so it catches a slow drift where each individual run looks fine but the trend is clearly upward — the classic bundle-size creep. Run it alongside, not instead of, a per-run test.

Why require both a p-value and an effect size?

With a large baseline window, a statistically significant difference can be tiny and meaningless. Requiring a minimum effect (5%) alongside alpha ensures the alert fires only on shifts that are both real and big enough to act on, which is what keeps the channel readable.

Configuring Statistical Regression Alerts #

Method Comparison #

Diagnostic Steps #

Implementation #

CI Gating Assertion #

Verification #

Frequently Asked Questions #

Related Pages #