Percentile-Based Threshold Tuning
A mean hides the failure that matters: it absorbs a handful of fast loads and a handful of slow outliers into one number that no real user ever experiences, so a budget gated on the average passes while a quarter of your sessions degrade. Percentile-driven gating replaces brittle mean and median baselines with distribution-aware thresholds that track the tail behaviour users actually feel. This is the distribution layer of the Threshold Calibration & Baseline Management reference: it specifies how to compute P75, P90, and P99 from your runs, decide which percentile belongs to which metric, and assert on that percentile in CI so a tail regression is unmergeable.
The work splits into three coupled concerns — which percentile expresses the contract for a given metric, how you compute that percentile stably from a finite sample of runs, and what assertion fails the build when the percentile drifts. Choose the percentile too low and you ship tail pain; compute it from too few runs and the gate flaps; assert on a noisy percentile and the team mutes it. This page is the authoritative spec for all three.
Core Concept: From a Distribution to a Gate
Every metric you collect — LCP, INP, CLS, a custom beacon — is a distribution, not a point. A percentile is the value below which that share of observations fall: P75 is the slowest experience of the fastest three-quarters of users, P90 the slowest of the fastest nine-tenths. The budget line is a single horizontal threshold; the gate fails when the chosen percentile of the run distribution crosses it. The diagram below shows where each marker sits and how the budget line relates to them.
Prerequisites & Environment
Percentile tuning consumes a sample of runs, not a single audit. You need enough collection to estimate the percentile and enough storage to track it over time.
- A multi-run collector — Lighthouse CI with
numberOfRuns ≥ 5per URL, or a RUM stream feeding an aggregation store. A single run cannot produce a percentile. Pin collection determinism per the Lighthouse CI Configuration & Storage reference so the spread you measure is real, not runner noise. - Node.js ≥ 18 for the percentile evaluation script, plus
jqfor quick CLI inspection of result manifests. - A field or lab dataset with a known sample size. Record N alongside every percentile; a P90 from 8 samples is not a P90. Below ~20 samples for lab and ~1,000 sessions for field, percentile estimates are too unstable to gate on — see Troubleshooting.
- Environment pinning. Always tag each percentile with its device class and connection profile. A P75 LCP of 2,500 ms on mid-range mobile / Fast 3G is a different contract from P75 on desktop / cable, and mixing them silently corrupts the budget. Calibrate emulation through Device & Network Emulation Weighting.
Configuration Reference
Express percentile budgets as data, not code, so the gate is auditable and diff-able. The annotated thresholds.json below is the authoritative spec — each route declares the metric, the percentile that expresses its contract, the ceiling, and the minimum sample size required before the assertion is allowed to fail rather than warn.
{
"minimumSamples": 20,
"routes": {
"/checkout": {
"lcp": { "percentile": 75, "maxMs": 2500, "level": "error" },
"inp": { "percentile": 75, "maxMs": 200, "level": "error" },
"cls": { "percentile": 90, "max": 0.10, "level": "error" }
},
"/landing": {
"lcp": { "percentile": 75, "maxMs": 2800, "level": "error" },
"inp": { "percentile": 90, "maxMs": 300, "level": "warn" }
}
},
"gating": { "failBufferPercent": 5 }
}
percentile names which point of the distribution is the contract — P75 for typical-user metrics, P90 for stricter flows, escalating to P95/P99 only for revenue-critical paths (the decision is spelled out in Choosing Between P75 and P90 Budget Targets). minimumSamples blocks the gate from asserting on an under-sampled percentile: below the floor it downgrades to warn. failBufferPercent adds a small tolerance so a percentile sitting exactly on the line does not flap the build.
Step-by-Step Implementation
-
Collect a sample. Run the collector with at least five runs per URL so each metric has a distribution to percentile over.
npx lhci collect --numberOfRuns=5 --url=https://staging.example.com/checkoutExpected tail:
Run #5 ... Done running Lighthouse!and a.lighthouseci/directory holding five JSON reports. -
Compute percentiles from the runs. The script below reads every numeric value for a metric, sorts, and interpolates the requested percentile — the same nearest-rank-with-interpolation method CrUX uses.
// scripts/percentile.js function percentile(values, p) { const sorted = [...values].sort((a, b) => a - b); if (sorted.length === 0) return NaN; const rank = (p / 100) * (sorted.length - 1); const lo = Math.floor(rank); const hi = Math.ceil(rank); if (lo === hi) return sorted[lo]; return sorted[lo] + (rank - lo) * (sorted[hi] - sorted[lo]); } module.exports = { percentile };node -e "const {percentile}=require('./scripts/percentile');\ console.log(percentile([180,190,205,210,260,195,200],75))"Expected output:
207.5— the interpolated P75 INP in milliseconds across those seven runs. -
Assert against the budget. Feed the computed percentile and the
thresholds.jsoncontract into an evaluator that exits non-zero on a breach, then commit boththresholds.jsonand the evaluator.
Threshold Calibration
Pick the percentile per metric from how the metric behaves and how much risk a slow tail carries, not from habit. Layout shift is near-binary and rare-but-severe, so it earns a stricter percentile than a metric that degrades gracefully. The matrix below is a representative starting point by metric and context; derive the actual ceiling from your own field P75 and set the lab assertion 10–15% tighter to absorb lab-to-field drift.
| Metric | Context | Percentile | Ceiling | Why this percentile |
|---|---|---|---|---|
| LCP | Marketing / content routes | P75 | 2,500 ms | Matches the "Good" field tier; covers typical users without chasing rare stalls |
| INP | Interactive routes | P75 | 200 ms | Tail interaction latency matters, but idle-tab outliers should not gate |
| CLS | All routes | P90 | 0.10 | Shifts are rare but jarring; P90 catches the severe minority a P75 misses |
| LCP / INP | Checkout / payment | P90 | route-specific | Revenue-critical flows justify covering nine in ten users, not three in four |
| Custom long-task beacon | Enterprise SLA paths | P95–P99 | contract value | When an SLA names a tail figure, gate at the contracted percentile |
Set the assertion level to warn for any percentile still being calibrated and promote to error only after the threshold has held for two consecutive weekly baselines, so the gate earns trust before it can block a merge.
CI Enforcement
This GitHub Actions job collects five runs, computes the configured percentile per metric, and fails the required status check when any error-level percentile breaches its ceiling.
name: Percentile Performance Gate
on:
pull_request:
branches: [main]
jobs:
percentile-gate:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- run: npm ci
- run: npm run build
- name: Collect runs
run: npx lhci collect --numberOfRuns=5 --url=http://localhost:8080/checkout
- name: Evaluate percentile budgets
run: node ./scripts/evaluate-percentiles.js
--reports .lighthouseci
--thresholds ./config/thresholds.json
- name: Upload reports
if: always()
uses: actions/upload-artifact@v4
with:
name: percentile-reports
path: .lighthouseci/
Require the percentile-gate check in branch protection so a tail regression cannot merge. Wire the same evaluator into Automated Regression Detection to alert when a percentile trends toward its ceiling before it crosses, and stabilise the underlying distribution first via Statistical Noise & Flakiness Reduction so the percentile you assert on is signal, not jitter.
Troubleshooting & Edge Cases
- Percentile flaps run-to-run → the sample is too small. A P90 over 5 runs is dominated by one observation; raise
numberOfRunsto 9+, or aggregate several PR runs into a rolling window before evaluating. - P99 is wildly unstable in lab → you cannot estimate a P99 from tens of samples. Reserve P95/P99 for field datasets with thousands of sessions; gate lab runs at P75/P90 and watch the high tail in RUM.
- Lab percentile passes but field P75 fails → expected lab-to-field gap. Set lab ceilings 10–15% tighter than the field target you actually care about.
- Mean looks fine, users complain → the average is absorbing the tail. Switch the assertion from
mean/median to the percentile; that is the entire point of this layer. - Mixed device classes in one percentile → segment first. Compute and gate P75 per device class and connection profile, never on a pooled distribution.
- A single slow third-party run poisons P90 → apply IQR or Z-score outlier filtering before the percentile step, and pin vendor versions per Third-Party Script Constraints.
Frequently Asked Questions
Why gate on a percentile instead of the average?
The mean blends fast and slow sessions into a value no user experiences, so it stays green while the slow tail degrades. A percentile such as P75 or P90 is an actual point in the distribution — it answers "how bad is the experience for the slowest quarter (or tenth) of users?", which is the question a budget exists to protect.
How many runs do I need to compute a stable percentile?
For lab runs, five is the floor for a P75 and nine or more for a P90; a P99 needs field data with thousands of sessions, not a handful of CI runs. The rule of thumb: the higher the percentile, the more samples it takes to estimate it without flapping. Below the floor, downgrade the assertion to warn using a minimumSamples guard.
Should every metric use the same percentile?
No. Match the percentile to the metric's shape and the route's business risk. Typical timing metrics like LCP and INP work well at P75; rare-but-severe metrics like CLS earn P90; revenue-critical or SLA-bound paths justify P90 through P99. See Choosing Between P75 and P90 Budget Targets for the decision procedure.