Reducing Lighthouse CI Variance in Staging

Staging is where most flaky gates are born: shared host CPU, cold CDN caches, and live-changing seed data push LCP and TBT around far more than the production-like numbers you actually want to assert against. This guide is a practical walkthrough within the Statistical Noise & Flakiness Reduction reference — it isolates the specific variance sources that staging adds, measures them, removes what it can, and shows how to gate around the residual instead of fighting it.

The goal is concrete: get the coefficient of variation (CV) of your gated metrics below roughly 3% for LCP and 5% for TBT, measured on a high-end mobile / 4G profile at the median of five runs, so that a 10% regression is unmistakably larger than the noise.

Variance Sources in Staging

Source Typical impact Why staging makes it worse Mitigation
Host CPU contention TBT ±20%, LCP ±8% Shared runners, noisy neighbours Dedicated runner, simulate throttling
Cold CDN / app cache First-run LCP +50–100% Caches purged between deploys Warm with curl before collect
Live seed data LCP ±15% Row counts change row-render cost Pin a fixed dataset snapshot
Real-network throttling TTFB ±300 ms Variable runner bandwidth throttlingMethod: simulate
Single-run sampling Whole-metric jitter One sample is not a measurement numberOfRuns: 5, median

Diagnostic Steps

First, measure the noise floor on the unchanged staging target so you know what you are dealing with before touching thresholds.

npx lhci collect --url=https://staging.example.com/ --numberOfRuns=10
node -e "const fs=require('fs');const v=fs.readdirSync('.lighthouseci').filter(f=>f.endsWith('.json')).map(f=>JSON.parse(fs.readFileSync('.lighthouseci/'+f)).audits['largest-contentful-paint'].numericValue);const m=v.reduce((a,b)=>a+b)/v.length;const sd=Math.sqrt(v.reduce((a,b)=>a+(b-m)**2,0)/v.length);console.log('LCP mean',m.toFixed(0),'ms  CV%',(100*sd/m).toFixed(1))"

Expected output before mitigation looks like LCP mean 2480 ms CV% 9.7 — too noisy to gate. Next, confirm whether the first run is a cold-cache outlier by inspecting the per-run spread.

node -e "const fs=require('fs');const v=fs.readdirSync('.lighthouseci').filter(f=>f.endsWith('.json')).sort().map(f=>JSON.parse(fs.readFileSync('.lighthouseci/'+f)).audits['largest-contentful-paint'].numericValue);console.log(v.map(x=>x.toFixed(0)).join('  '))"

If the output is something like 3910 2410 2380 2350 2400, the first run is cold — a warm-up step will recover most of your CV.

Implementation

Apply the four mitigations together: warm the cache, pin the dataset, simulate throttling, and collect five runs. The lighthouserc.js below resolves the staging URL from CI context and bakes in the deterministic settings.

// lighthouserc.js
module.exports = {
  ci: {
    collect: {
      url: [process.env.STAGING_URL || "https://staging.example.com/"],
      numberOfRuns: 5,
      settings: {
        preset: "perf",
        formFactor: "mobile",
        throttlingMethod: "simulate",
        throttling: {
          cpuSlowdownMultiplier: 4,
          rttMs: 150,
          throughputKbps: 1638,
        },
        chromeFlags:
          "--no-sandbox --disable-dev-shm-usage --disable-background-networking --disable-extensions --disable-sync",
      },
    },
    assert: {
      aggregationMethod: "median-run",
      assertions: {
        "metric-lcp": ["error", { maxNumericValue: 2900 }],
        "metric-cls": ["error", { maxNumericValue: 0.1 }],
        "metric-tbt": ["error", { maxNumericValue: 300 }],
      },
    },
  },
};

Wire the warm-up and data pin into the job. Seeding a fixed snapshot before collection removes the data-driven swing that no Chrome flag can fix.

- name: Pin staging dataset
  run: ./scripts/seed-staging.sh --snapshot fixtures/perf-baseline.sql
- name: Warm cache (discard cold run)
  run: |
    curl -s -o /dev/null https://staging.example.com/
    sleep 2
- name: Collect and assert (5 runs, median)
  run: npx lhci autorun

The CPU multiplier of 4 here is a starting point; calibrate it to your real device target with Calibrating CPU Throttling for CI Runners so the emulated machine matches the hardware your users actually carry.

CI Gating Assertion

The assertion ceilings are set above the calibrated median to absorb the CV that survives mitigation. With a 2480 ms median LCP and a post-mitigation CV near 3%, two standard deviations is about 150 ms, so a 2900 ms ceiling leaves comfortable headroom while still catching any regression larger than ~15%.

{
  "ci": {
    "assert": {
      "aggregationMethod": "median-run",
      "assertions": {
        "metric-lcp": ["error", { "maxNumericValue": 2900 }],
        "metric-tbt": ["warn", { "maxNumericValue": 300 }],
        "metric-cls": ["error", { "maxNumericValue": 0.1 }]
      }
    }
  }
}

Keep TBT at warn until its CV holds under 5% for two weeks, then promote to error.

Verification

Re-run the diagnostic after applying the mitigations and confirm the CV dropped into the gateable band.

npx lhci collect --url=https://staging.example.com/ --numberOfRuns=10

A passing result shows LCP mean 2460 ms CV% 2.8 — under 3%, with no single run more than ~6% from the median. The assertion summary should then read All results processed! with no metric within its noise margin of the ceiling. If the first run is still an outlier, your warm-up did not take; check that the curl target matches the audited URL exactly, including trailing slash.

Frequently Asked Questions

Why is staging noisier than production for Lighthouse?

Three reasons stack up: staging usually runs on smaller, shared infrastructure so CPU contention is higher; its caches are purged on every deploy so the first run is cold; and its seed data changes, which alters render cost. None of these reflect a real code regression, so they have to be controlled before you gate. Pin the dataset, warm the cache, and use simulate throttling.

Should I throw away the cold first run or warm the cache?

Warm the cache — it is more honest. Discarding the first run hides a real cold-start cost and can mask a regression in cache configuration. A curl to the exact audited URL before lhci collect primes the CDN and app cache so all five measured runs start warm, which is what your repeat visitors experience.

What coefficient of variation is low enough to gate?

Aim for under 3% CV on LCP and under 5% on TBT at the median of five runs. At that level a 10% regression sits well outside the noise band, so an error assertion fails on real changes rather than jitter. Above 8% CV, fix the environment before tightening any threshold.