Historical Baseline Calibration

A budget threshold copied from a blog post is a guess; a threshold derived from your own last ninety days of runs is a measurement. Historical baseline calibration replaces arbitrary static limits with thresholds computed from the longitudinal distribution of your own metrics, so the gate reflects how the site actually behaves rather than how someone hoped it would. This is the baseline layer of the Threshold Calibration & Baseline Management reference: it turns a stream of past Lighthouse and RUM samples into a rolling baseline, wraps that baseline in a tolerance band, and fails the build only when a new run breaks out of the band.

The work splits into three coupled concerns — ingesting clean historical samples (windowing and outlier removal), deriving a baseline and tolerance from that window (percentiles plus deltas), and enforcing a new run against the baseline in CI. Get the first wrong and the baseline inherits the noise; get the second wrong and the gate either nags constantly or never fires. This page is the authoritative spec for all three.

Core Concept: The Rolling Baseline

A rolling baseline is not a single number frozen at a point in time. It is a value recomputed continuously over a sliding history window, surrounded by a tolerance band that absorbs normal run-to-run variance. A new run is compared against that band, not against a hand-picked constant. The diagram below shows how a history window collapses into a baseline, how the tolerance band widens it, and how each incoming run lands inside or outside.

Rolling baseline: history window to baseline to tolerance band to new run A sliding window over historical runs produces a median baseline line; a tolerance band is drawn above and below it; incoming runs that fall inside the band pass and a run that breaks above the band is gated. History window (90 runs) outliers trimmed (IQR) baseline = P75 tolerance band (baseline ± δ) pass gate
The history window is trimmed of outliers and collapsed to a P75 baseline; runs inside the tolerance band pass, and a run that breaks above the band is gated as a regression.

Prerequisites & Environment

Calibration needs a durable store of past runs and a deterministic collection pipeline feeding it. Without determinism the band has to be so wide it never catches anything, so pin your collection settings first against the Lighthouse CI Configuration & Storage reference before you trust any baseline derived from the data.

  • A time-series or artifact store — TimescaleDB, InfluxDB, or even a versioned JSON artifact in object storage. It must retain at least one full window of runs keyed by Git SHA and branch, partitioned by day for fast window queries.
  • Deterministic collectionthrottlingMethod: simulate and a fixed numberOfRuns ≥ 3 so each stored sample is a stable median, not a single noisy draw.
  • Per-environment separation — never blend mobile and desktop, or staging and production, into one baseline. Each (device class, connection profile, route) tuple gets its own baseline series.

Map the moving parts through environment variables so nothing is hardcoded: BASELINE_STORE_URL for the store endpoint, BASELINE_BRANCH for the series key (usually main), and BASELINE_WINDOW for the rolling window size in runs or days.

Configuration Reference

The baseline store schema below is the authoritative spec. Each metric series carries the derived baseline, the tolerance that defines its band, the window it was computed over, and the provenance needed to audit it. Storing provenance — the Git SHA and sample count — is what lets you later trust or reject a baseline.

{
  "schemaVersion": 2,
  "key": { "branch": "main", "device": "mobile", "route": "/checkout" },
  "window": { "type": "rolling", "size": 90, "unit": "runs", "minSamples": 30 },
  "outlierFilter": { "method": "iqr", "k": 1.5 },
  "metrics": {
    "lcp":  { "baseline": "p75", "tolerance": { "type": "abs", "value": 200 }, "unit": "ms" },
    "inp":  { "baseline": "p75", "tolerance": { "type": "abs", "value": 50 },  "unit": "ms" },
    "cls":  { "baseline": "p75", "tolerance": { "type": "rel", "value": 0.10 } },
    "tbt":  { "baseline": "p75", "tolerance": { "type": "abs", "value": 75 },  "unit": "ms" },
    "script_bytes": { "baseline": "p90", "tolerance": { "type": "rel", "value": 0.05 } }
  },
  "provenance": { "computedAt": "2026-06-20T02:00:00Z", "sampleCount": 87, "gitSha": "4f1c9ad" }
}

window.minSamples is the safety valve: if the trimmed window holds fewer than 30 clean samples the baseline is marked stale and the gate falls back to warn rather than blocking on thin data. The interquartile-range filter (k: 1.5) removes any sample outside Q1 − 1.5·IQR to Q3 + 1.5·IQR so a transient network spike cannot drag the baseline. tolerance is either absolute (abs, in metric units) or relative (rel, a fraction of the baseline) — use absolute for metrics with a meaningful floor like LCP and relative for metrics that scale, like script bytes.

Step-by-Step Implementation

  1. Ingest and trim the window. Pull the last N runs for each series and remove outliers with an interquartile-range filter so a single bad runner cannot shift the baseline. This preprocessing is the same discipline covered in Statistical Noise & Flakiness Reduction.

    node scripts/baseline-ingest.js --branch main --window 90 --filter iqr

    Expected output: window=90 raw=90 trimmed=87 (3 outliers removed) confirming the filter ran and enough clean samples remain.

  2. Derive the baseline and band. Compute the configured percentile per metric and attach the tolerance to form the band. Derive the percentile with the methodology in Percentile-Based Threshold Tuning so the baseline reflects the experience you actually want to hold.

    node scripts/baseline-derive.js --in trimmed.json --out baseline_store.json

    Expected tail: lcp p75=2180ms band=[1980,2380] inp p75=164ms band=[114,214] — one line per gated metric.

  3. Compare a candidate run. Run the PR build, then diff its median against the band. Commit baseline_store.json only from the promotion job, never from a feature branch, so the baseline never absorbs an unmerged regression.

Threshold Calibration

Two knobs govern whether the gate is useful: the window size and the tolerance. Too short a window tracks every wobble and the baseline chases noise; too long and it lags real improvements for weeks. Too tight a tolerance fires on normal variance; too loose and a real 15% regression slips through. The matrix below shows defensible starting points by environment — calibrate, then hold for two weeks before tightening.

Device class Connection profile Window size Baseline percentile Tolerance (band half-width)
Desktop Cable / Fiber 60 runs P75 LCP ±150 ms, scripts ±4%
High-end mobile 4G / LTE 90 runs P75 LCP ±200 ms, scripts ±5%
Mid-range mobile Fast 3G 90 runs P90 LCP ±300 ms, scripts ±6%

Size the tolerance from the measured run-to-run standard deviation of each metric, not by feel: a band of roughly two standard deviations around the baseline catches real shifts while tolerating normal jitter. Keep new metrics at warn until the band has held for two consecutive weeks, then promote them to error so the gate earns trust before it blocks merges.

CI Enforcement Snippet

This GitHub Actions job fetches the current baseline, runs the candidate, and fails the build when any gated metric breaks out of its band. It is copy-paste ready and surfaces a required status check that branch protection can gate on.

name: Baseline Gate
on:
  pull_request:
    branches: [main]

jobs:
  baseline-compare:
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
      - run: npm ci
      - name: Fetch current baseline
        run: node scripts/baseline-fetch.js --branch main --out baseline_store.json
        env:
          BASELINE_STORE_URL: ${{ secrets.BASELINE_STORE_URL }}
      - name: Collect candidate run
        run: npx lhci collect && npx lhci upload --target=filesystem
      - name: Compare against band
        run: node scripts/baseline-compare.js --baseline baseline_store.json --run .lighthouseci/

The baseline-compare.js step exits non-zero when a metric's median falls outside [baseline − tolerance, baseline + tolerance], so the job becomes a required check. To close the loop and keep the baseline current, promote a fresh baseline after every clean main-branch run with Automating Baseline Promotion Workflows. For statistical change detection that adapts to the distribution rather than a fixed band, pair this with Automated Regression Detection.

Troubleshooting & Edge Cases

  • Baseline poisoning → a regression that merged to main gets folded into the next baseline, ratcheting the band up so the regression looks normal. Gate promotion on a clean run (see the promotion workflow) and store the Git SHA so a poisoned baseline can be rolled back.
  • Slow drift → tiny per-run regressions each stay inside the band but accumulate. Add an absolute ceiling derived from your field P75 alongside the rolling band, so the baseline can never drift past a hard limit.
  • Cold start / thin window → fewer than minSamples clean runs after trimming. Fall back to warn and a static ceiling until the window fills, rather than blocking on noise.
  • Bimodal metrics → a route with two code paths produces two clusters; a single percentile sits in the empty middle. Split the series by path or segment before deriving the baseline.
  • Window discontinuity after a deploy → an intentional architecture change shifts the true baseline. Reset the window (clear history before the change) so the band re-forms around the new normal instead of straddling both.
  • Cross-environment bleed → mobile samples leaking into a desktop series widen the band uselessly. Verify the series key (device, route, connection) on ingest.

Frequently Asked Questions

How long should the rolling window be?

Long enough to hold at least 30 clean samples after outlier trimming, which on a daily pipeline is roughly 60–90 runs. Shorter windows chase noise; much longer windows lag real improvements. Start at 90 runs for mobile and 60 for desktop, then shorten only if the baseline reacts too slowly to deliberate wins. See Percentile-Based Threshold Tuning for picking the baseline percentile.

What stops a regression from quietly becoming the new baseline?

Only promote a new baseline from a run that already passed the gate on main, never from an arbitrary build. Storing the Git SHA with each baseline lets you audit and roll back if a bad one slips through. The mechanics live in Automating Baseline Promotion Workflows.

Should the tolerance be absolute or relative?

Use an absolute tolerance (in milliseconds) for time metrics with a meaningful floor like LCP and INP, and a relative tolerance (a percentage of the baseline) for metrics that scale with page size like script bytes. Size either one from the measured run-to-run standard deviation — roughly two standard deviations is a good first band.