Statistical Significance Testing for Noisy CI

A median LCP that ticks up 80 ms on a pull request might be a real regression or it might be the same noise that moves it 80 ms on a no-op commit. Failing the build on the raw delta makes the gate flaky; ignoring it lets regressions through. The fix is to ask a sharper question — is this change larger than the run-to-run noise predicts? — and answer it with a significance test. This guide is part of the Statistical Noise & Flakiness Reduction reference and shows how to compare a baseline sample against a PR sample and only fail when the difference is statistically real.

Choosing a Test Method

You are comparing two small samples — N runs from the baseline and N from the PR — and deciding whether their central tendency differs. The right test depends on the distribution shape.

Method Compares Assumes Use when
Welch's t-test Means Roughly normal, unequal variance LCP/TBT with N ≥ 5, few outliers
Mann-Whitney U Rank distributions None (non-parametric) Skewed metrics or visible outliers
Confidence interval on Δ Difference of means Roughly normal You want an effect size, not just pass/fail
Bootstrap CI Any statistic None Tiny N or unusual distributions

For Lighthouse metrics with five or more runs, Welch's t-test paired with a confidence interval on the difference is the pragmatic default; reach for Mann-Whitney when a metric is visibly skewed or an outlier survives your median aggregation. The key principle: never fail on a point estimate alone — require that the confidence interval of the regression excludes zero and that the effect exceeds a minimum size you care about.

Diagnostic Steps

Collect matched samples from the baseline branch and the PR branch under identical settings, then look at the raw spread before testing.

git checkout main && npm run build && npx lhci collect --url=http://localhost:8080/ --numberOfRuns=8
node -e "const fs=require('fs');console.log(fs.readdirSync('.lighthouseci').filter(f=>f.endsWith('.json')).map(f=>JSON.parse(fs.readFileSync('.lighthouseci/'+f)).audits['largest-contentful-paint'].numericValue.toFixed(0)).join(' '))"

Example baseline output: 2310 2280 2350 2300 2330 2290 2360 2320. Repeat on the PR branch into a separate directory; an example PR sample might be 2390 2410 2380 2440 2400 2370 2420 2400. The medians differ by ~90 ms, but whether that is significant depends on the spread — which the test below decides.

Implementation

This runnable script takes two arrays of metric values, computes Welch's t-test and a 95% confidence interval on the difference of means, and prints a verdict. It uses only Node's standard library.

// significance.js — node significance.js
// Decide if a PR metric sample is a significant regression vs baseline.

const BASELINE = [2310, 2280, 2350, 2300, 2330, 2290, 2360, 2320];
const PR = [2390, 2410, 2380, 2440, 2400, 2370, 2420, 2400];
const MIN_EFFECT_MS = 50; // ignore differences smaller than this
const ALPHA = 0.05; // 95% confidence

const mean = (a) => a.reduce((s, x) => s + x, 0) / a.length;
const variance = (a) => {
  const m = mean(a);
  return a.reduce((s, x) => s + (x - m) ** 2, 0) / (a.length - 1);
};

function welch(a, b) {
  const ma = mean(a), mb = mean(b);
  const va = variance(a), vb = variance(b);
  const se = Math.sqrt(va / a.length + vb / b.length);
  const t = (mb - ma) / se;
  // Welch–Satterthwaite degrees of freedom
  const df =
    (va / a.length + vb / b.length) ** 2 /
    ((va / a.length) ** 2 / (a.length - 1) +
      (vb / b.length) ** 2 / (b.length - 1));
  return { diff: mb - ma, se, t, df };
}

// Two-sided p-value via a t-distribution survival approximation.
function tPValue(t, df) {
  const x = df / (df + t * t);
  // Regularized incomplete beta I_x(df/2, 1/2) gives the tail.
  const ib = incBeta(x, df / 2, 0.5);
  return Math.min(1, ib); // two-sided
}

function incBeta(x, a, b) {
  if (x <= 0) return 0;
  if (x >= 1) return 1;
  const lbeta =
    lgamma(a) + lgamma(b) - lgamma(a + b);
  const front =
    Math.exp(Math.log(x) * a + Math.log(1 - x) * b - lbeta) / a;
  let c = 1, d = 0, f = 1;
  for (let i = 0; i <= 200; i++) {
    const m = Math.floor(i / 2);
    let num;
    if (i === 0) num = 1;
    else if (i % 2 === 0)
      num = (m * (b - m) * x) / ((a + 2 * m - 1) * (a + 2 * m));
    else
      num =
        -((a + m) * (a + b + m) * x) /
        ((a + 2 * m) * (a + 2 * m + 1));
    d = 1 + num * d;
    if (Math.abs(d) < 1e-30) d = 1e-30;
    d = 1 / d;
    c = 1 + num / c;
    if (Math.abs(c) < 1e-30) c = 1e-30;
    f *= d * c;
  }
  return front * (f - 1);
}

function lgamma(z) {
  const g = [
    676.5203681218851, -1259.1392167224028, 771.32342877765313,
    -176.61502916214059, 12.507343278686905, -0.13857109526572012,
    9.9843695780195716e-6, 1.5056327351493116e-7,
  ];
  z -= 1;
  let x = 0.99999999999980993;
  for (let i = 0; i < g.length; i++) x += g[i] / (z + i + 1);
  const t = z + g.length - 0.5;
  return (
    0.5 * Math.log(2 * Math.PI) +
    (z + 0.5) * Math.log(t) -
    t +
    Math.log(x)
  );
}

const { diff, se, t, df } = welch(BASELINE, PR);
const p = tPValue(t, df);
const tCrit = 1.96 + 2.4 / df; // approx two-sided 95% critical t
const ciLow = diff - tCrit * se;
const ciHigh = diff + tCrit * se;
const significant = p < ALPHA && Math.abs(diff) >= MIN_EFFECT_MS;

console.log(`Δ mean        : ${diff.toFixed(1)} ms`);
console.log(`95% CI        : [${ciLow.toFixed(1)}, ${ciHigh.toFixed(1)}] ms`);
console.log(`p-value       : ${p.toFixed(4)}`);
console.log(`min effect    : ${MIN_EFFECT_MS} ms`);
console.log(`verdict       : ${significant ? "REGRESSION (fail)" : "noise (pass)"}`);
process.exit(significant && diff > 0 ? 1 : 0);

Run it after collecting both samples. With the example data it prints a Δ near +91 ms, a CI that excludes zero, a p-value well under 0.05, and exits 1 — a confirmed regression.

CI Gating Assertion

Wire the script as the gate so the build fails only on a statistically significant, large-enough regression. The baseline sample is fetched from your trend store rather than recollected, which is faster and more stable.

- name: Collect PR sample
  run: |
    npm run build
    npx lhci collect --url=http://localhost:8080/ --numberOfRuns=8
    node scripts/extract-lcp.js .lighthouseci > pr-sample.json
- name: Fetch baseline sample
  run: curl -s "$BASELINE_API/lcp?branch=main&n=8" -o baseline-sample.json
- name: Significance gate
  run: node scripts/significance.js baseline-sample.json pr-sample.json

Because the gate only fails when the regression's confidence interval clears zero by the minimum effect size, a PR that merely catches a bad noise day passes — exactly the flakiness this removes. Feed confirmed failures into your trend history through Automated Regression Detection so a slow drift across many small-but-insignificant PRs is still caught at the baseline level.

Verification

Validate the gate against two known cases before trusting it. First, run the script with the PR sample set equal to the baseline — it must print a Δ near zero and noise (pass), exiting 0. Second, inject a deliberate +300 ms regression into the PR sample — it must print a CI that excludes zero, a p-value under 0.05, and REGRESSION (fail), exiting 1.

node scripts/significance.js   # with PR === BASELINE → "noise (pass)", exit 0

A correctly tuned gate produces no failures across a week of no-op commits (the false-positive rate should sit under the chosen alpha of 5%) while still failing the injected regression. If no-op commits trip it, raise the run count or the minimum effect size; if the injected regression passes, the minimum effect is set too high.

Frequently Asked Questions

Why not just fail when the median goes up?

Because the median moves on its own from run-to-run noise even with identical code. Failing on the raw delta means the gate fires on bad-luck builds and gets ignored. A significance test asks whether the change is larger than the noise predicts — it only fails when the confidence interval of the regression excludes zero and the effect exceeds a minimum you set, so no-op commits pass.

t-test or Mann-Whitney for Lighthouse metrics?

Use Welch's t-test for LCP and TBT with five or more runs and few outliers — it is simple and gives a confidence interval as an effect size. Switch to Mann-Whitney U when a metric is visibly skewed or an outlier survives median aggregation, since it ranks values and assumes no particular distribution.

How many runs does significance testing need?

More than a simple median gate. With a coefficient of variation near 4%, eight runs per side reliably detects a ~5% regression at 95% confidence. Fewer runs widen the confidence interval and let real regressions slip through; if eight is too slow, reduce noise first via variance reduction in staging so fewer runs suffice.