Statistical Noise & Flakiness Reduction

A performance gate that fails one build in five for no code reason is worse than no gate at all — engineers learn to re-run until green, and the signal is gone. This is the variance-control layer of the Threshold Calibration & Baseline Management reference: it turns a jittery lab measurement into a stable number you can assert against, by collecting multiple runs, reducing them to a median, controlling the runner environment, and sizing tolerances to the residual noise you cannot remove.

The core problem is that a single Lighthouse or WebPageTest run is a sample, not a measurement. V8 garbage-collection timing, CPU contention from noisy neighbours, cold CDN caches, and background network activity all push individual metrics around by 10–30% even when the code under test is byte-identical. The job here is to shrink that spread until it is smaller than the regressions you care about, then set the gate just above the residual.

How Variance Becomes a Stable Signal

Every noise source feeds the raw spread of a metric. Mitigations attack those sources, and what survives is collapsed by a median-of-N into a single gateable value. The diagram traces that flow from cause to a narrowed distribution.

Noise sources to mitigations to a narrowed, gateable metric CPU contention, cold caches, network jitter, and garbage collection produce a wide raw spread. Pinned environment, warm cache, simulated throttling, and multiple runs narrow that spread, and a median-of-N collapses it to a single stable value the gate asserts against. Noise sources CPU contention Cold CDN cache Network jitter GC / V8 timing Mitigations Pinned environment Warm cache Simulated throttling N runs, median Median-of-N CV < 3% stable value gate asserts Raw spread vs median-of-5 raw single-run range median-of-5 range
Each noise source widens the raw spread; mitigations shrink it and a median-of-N collapses what remains into a single value with a coefficient of variation low enough to gate on.

Prerequisites & Environment

Variance reduction starts with a runner you control. The default GitHub-hosted runner is acceptable for simulate throttling but is a shared 2-vCPU box, so never use real-network or provided throttling on it. The settings below are the deterministic baseline; tighten the CPU multiplier against your real device target using Device & Network Emulation Weighting.

  • @lhci/cli ≥ 0.13 with the version pinned in package-lock.json so scoring weights do not shift mid-quarter.
  • Pinned Chrome major version — a Chromium bump can move LCP by 100–200 ms on its own; bake the binary into a container image when reproducibility matters.
  • A warmable target URL — a staging deploy whose cache and database you can prime before collection, so the first run is not penalised for a cold start.
  • A baseline sample — at least 20 historical runs of the unchanged target so you can measure the coefficient of variation (CV) before choosing a gate tolerance.

Configuration Reference

The block below is the authoritative noise-control configuration. numberOfRuns raises the sample size, throttlingMethod: simulate models the network in software so timings do not depend on the runner's real bandwidth, and the aggregation method tells Lighthouse CI which run to keep.

{
  "ci": {
    "collect": {
      "url": ["https://staging.example.com/"],
      "numberOfRuns": 5,
      "settings": {
        "preset": "desktop",
        "throttlingMethod": "simulate",
        "throttling": { "cpuSlowdownMultiplier": 4, "requestLatencyMs": 150 },
        "disableStorageReset": false,
        "chromeFlags": "--no-sandbox --disable-dev-shm-usage --disable-background-networking --disable-extensions"
      }
    },
    "assert": {
      "aggregationMethod": "median-run",
      "assertions": {
        "metric-lcp": ["error", { "maxNumericValue": 2600 }],
        "metric-cls": ["error", { "maxNumericValue": 0.1 }],
        "metric-tbt": ["error", { "maxNumericValue": 220 }]
      }
    }
  }
}

numberOfRuns: 5 is the comfortable default for noisy shared runners; 3 is the floor. aggregationMethod: median-run keeps a single internally consistent report rather than mixing the best LCP from one run with the best TBT from another. The --disable-background-networking flag removes a frequent source of late-run jitter. Note that the assertion ceilings are deliberately set a little above the calibrated P75 target to absorb residual noise — sizing that gap is the calibration step below.

Step-by-Step Implementation

  1. Collect a baseline sample of the unchanged target so you can measure its noise floor.

    npx lhci collect --url=https://staging.example.com/ --numberOfRuns=10

    Expected tail: Done running Lighthouse! ten times, with ten reports written to .lighthouseci/.

  2. Compute the coefficient of variation for the metric you intend to gate. CV is the standard deviation divided by the mean, expressed as a percentage.

    npx lhci collect --url=https://staging.example.com/ --numberOfRuns=10
    node -e "const fs=require('fs');const v=fs.readdirSync('.lighthouseci').filter(f=>f.endsWith('.json')).map(f=>JSON.parse(fs.readFileSync('.lighthouseci/'+f)).audits['largest-contentful-paint'].numericValue);const m=v.reduce((a,b)=>a+b)/v.length;const sd=Math.sqrt(v.reduce((a,b)=>a+(b-m)**2,0)/v.length);console.log('mean',m.toFixed(0),'CV%',(100*sd/m).toFixed(1))"

    Expected output resembles mean 2310 CV% 4.2. A CV above 8% means the environment is too noisy to gate tightly — fix the runner before lowering thresholds.

  3. Apply mitigations and re-measure. Switch to simulate throttling, warm the cache, pin Chrome, and re-run step 2. A well-controlled run should land under 3% CV for LCP and under 5% for TBT.

  4. Set the assertion ceiling at the median-of-5 target plus a margin equal to two standard deviations, then commit lighthouserc.json.

Threshold Calibration

How many runs you need depends on the CV you measured and how tight a gate you want. The relationship is that the standard error of the median shrinks roughly with the square root of the run count, so doubling precision costs four times the runs. The matrix gives practical starting points by measured noise level.

Measured CV (raw) Runs for a stable median Recommended gate margin Notes
< 3% 3 median + 1.5σ Quiet dedicated runner
3–6% 5 median + 2σ Typical hosted runner with simulate
6–10% 7–9 median + 2.5σ Shared runner; fix environment first
> 10% n/a do not gate Environmental fault — diagnose before asserting

Keep a metric at warn while its CV is still above target and promote it to error only after the noise floor holds for two consecutive weeks of baselines, mirroring the promotion discipline in Percentile-Based Threshold Tuning. For the field-driven side of choosing the underlying target, see the staging-specific walkthrough in Reducing Lighthouse CI Variance in Staging.

CI Enforcement Snippet

This GitHub Actions job warms the target, runs five collections, and asserts against the median — the warm-up step removes the cold-cache outlier that otherwise dominates the first run.

name: Performance Gating
on:
  pull_request:
    branches: [main]

jobs:
  lighthouse-ci:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    concurrency:
      group: lhci-${{ github.ref }}
      cancel-in-progress: true
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
      - run: npm ci
      - run: npm run build
      - name: Warm the target cache
        run: curl -s -o /dev/null https://staging.example.com/ || true
      - name: Run Lighthouse CI (5 runs, median)
        run: npx lhci autorun
        env:
          LHCI_TOKEN: ${{ secrets.LHCI_TOKEN }}
      - name: Upload reports
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: lighthouse-reports
          path: .lighthouseci/

When a residual breach is still ambiguous after five runs, decide whether it is real before failing the build using Statistical Significance Testing for Noisy CI, and route confirmed shifts into your trend store via Automated Regression Detection.

Troubleshooting & Edge Cases

  • One run is always 2× slower than the rest → it is the cold-cache first run; warm the target with a curl before lhci collect or discard the first sample explicitly.
  • CV good locally, terrible in CI → the runner is the noise source; switch from provided/devtools throttling to simulate, which does not depend on real bandwidth.
  • LCP stable but TBT swings wildly → CPU contention from a noisy neighbour; isolate to a dedicated runner or container with a guaranteed CPU quota.
  • Variance crept up after a dependency bump → an unpinned Chromium or Lighthouse version changed scoring; pin both in the lockfile and rebuild the baseline.
  • Median still drifts week to week with no code change → real baseline movement, not noise; recalibrate against your historical store rather than widening tolerances.
  • Background networking spikes late runs → add --disable-background-networking and --disable-sync to chromeFlags.

Frequently Asked Questions

How many runs do I need to gate reliably?

It depends on the measured coefficient of variation. At under 3% CV, three runs give a stable median; at 3–6% CV use five; above 6% raise to seven or nine and fix the environment first. The standard error of the median falls roughly with the square root of the run count, so each doubling of precision costs four times the runs.

Should I use the median or the mean of my runs?

Use the median. Performance distributions are right-skewed — a single slow run from a garbage-collection pause or a cold cache drags the mean up but barely moves the median. Set aggregationMethod: median-run so Lighthouse CI keeps one internally consistent report rather than mixing best metrics across runs.

My CV is above 10% — can I still set a gate?

Not reliably. A CV above 10% means environmental noise is larger than most regressions you care about, so any tight threshold will flake. Treat it as a fault to diagnose: switch to simulate throttling, isolate the runner, warm the cache, and pin Chrome. See Device & Network Emulation Weighting for the throttling side.