Percentile-Based Threshold Tuning

A mean hides the failure that matters: it absorbs a handful of fast loads and a handful of slow outliers into one number that no real user ever experiences, so a budget gated on the average passes while a quarter of your sessions degrade. Percentile-driven gating replaces brittle mean and median baselines with distribution-aware thresholds that track the tail behaviour users actually feel. This is the distribution layer of the Threshold Calibration & Baseline Management reference: it specifies how to compute P75, P90, and P99 from your runs, decide which percentile belongs to which metric, and assert on that percentile in CI so a tail regression is unmergeable.

The work splits into three coupled concerns — which percentile expresses the contract for a given metric, how you compute that percentile stably from a finite sample of runs, and what assertion fails the build when the percentile drifts. Choose the percentile too low and you ship tail pain; compute it from too few runs and the gate flaps; assert on a noisy percentile and the team mutes it. This page is the authoritative spec for all three.

Core Concept: From a Distribution to a Gate

Every metric you collect — LCP, INP, CLS, a custom beacon — is a distribution, not a point. A percentile is the value below which that share of observations fall: P75 is the slowest experience of the fastest three-quarters of users, P90 the slowest of the fastest nine-tenths. The budget line is a single horizontal threshold; the gate fails when the chosen percentile of the run distribution crosses it. The diagram below shows where each marker sits and how the budget line relates to them.

Latency distribution with P75, P90, P99 markers against a budget line A right-skewed distribution of a latency metric. Vertical markers fall at the 75th, 90th, and 99th percentiles, each further into the slow tail. A horizontal budget line sits at the chosen percentile, and the area beyond it is the share of users over budget. metric value (slower →) users P75 P90 P99 budget tail over budget
The chosen percentile (here P90) is compared against the flat budget line; everything in the right tail past that marker is the population the gate protects.

Prerequisites & Environment

Percentile tuning consumes a sample of runs, not a single audit. You need enough collection to estimate the percentile and enough storage to track it over time.

  • A multi-run collector — Lighthouse CI with numberOfRuns ≥ 5 per URL, or a RUM stream feeding an aggregation store. A single run cannot produce a percentile. Pin collection determinism per the Lighthouse CI Configuration & Storage reference so the spread you measure is real, not runner noise.
  • Node.js ≥ 18 for the percentile evaluation script, plus jq for quick CLI inspection of result manifests.
  • A field or lab dataset with a known sample size. Record N alongside every percentile; a P90 from 8 samples is not a P90. Below ~20 samples for lab and ~1,000 sessions for field, percentile estimates are too unstable to gate on — see Troubleshooting.
  • Environment pinning. Always tag each percentile with its device class and connection profile. A P75 LCP of 2,500 ms on mid-range mobile / Fast 3G is a different contract from P75 on desktop / cable, and mixing them silently corrupts the budget. Calibrate emulation through Device & Network Emulation Weighting.

Configuration Reference

Express percentile budgets as data, not code, so the gate is auditable and diff-able. The annotated thresholds.json below is the authoritative spec — each route declares the metric, the percentile that expresses its contract, the ceiling, and the minimum sample size required before the assertion is allowed to fail rather than warn.

{
  "minimumSamples": 20,
  "routes": {
    "/checkout": {
      "lcp":  { "percentile": 75, "maxMs": 2500, "level": "error" },
      "inp":  { "percentile": 75, "maxMs": 200,  "level": "error" },
      "cls":  { "percentile": 90, "max":   0.10, "level": "error" }
    },
    "/landing": {
      "lcp":  { "percentile": 75, "maxMs": 2800, "level": "error" },
      "inp":  { "percentile": 90, "maxMs": 300,  "level": "warn"  }
    }
  },
  "gating": { "failBufferPercent": 5 }
}

percentile names which point of the distribution is the contract — P75 for typical-user metrics, P90 for stricter flows, escalating to P95/P99 only for revenue-critical paths (the decision is spelled out in Choosing Between P75 and P90 Budget Targets). minimumSamples blocks the gate from asserting on an under-sampled percentile: below the floor it downgrades to warn. failBufferPercent adds a small tolerance so a percentile sitting exactly on the line does not flap the build.

Step-by-Step Implementation

  1. Collect a sample. Run the collector with at least five runs per URL so each metric has a distribution to percentile over.

    npx lhci collect --numberOfRuns=5 --url=https://staging.example.com/checkout

    Expected tail: Run #5 ... Done running Lighthouse! and a .lighthouseci/ directory holding five JSON reports.

  2. Compute percentiles from the runs. The script below reads every numeric value for a metric, sorts, and interpolates the requested percentile — the same nearest-rank-with-interpolation method CrUX uses.

    // scripts/percentile.js
    function percentile(values, p) {
      const sorted = [...values].sort((a, b) => a - b);
      if (sorted.length === 0) return NaN;
      const rank = (p / 100) * (sorted.length - 1);
      const lo = Math.floor(rank);
      const hi = Math.ceil(rank);
      if (lo === hi) return sorted[lo];
      return sorted[lo] + (rank - lo) * (sorted[hi] - sorted[lo]);
    }
    
    module.exports = { percentile };
    node -e "const {percentile}=require('./scripts/percentile');\
    console.log(percentile([180,190,205,210,260,195,200],75))"

    Expected output: 207.5 — the interpolated P75 INP in milliseconds across those seven runs.

  3. Assert against the budget. Feed the computed percentile and the thresholds.json contract into an evaluator that exits non-zero on a breach, then commit both thresholds.json and the evaluator.

Threshold Calibration

Pick the percentile per metric from how the metric behaves and how much risk a slow tail carries, not from habit. Layout shift is near-binary and rare-but-severe, so it earns a stricter percentile than a metric that degrades gracefully. The matrix below is a representative starting point by metric and context; derive the actual ceiling from your own field P75 and set the lab assertion 10–15% tighter to absorb lab-to-field drift.

Metric Context Percentile Ceiling Why this percentile
LCP Marketing / content routes P75 2,500 ms Matches the "Good" field tier; covers typical users without chasing rare stalls
INP Interactive routes P75 200 ms Tail interaction latency matters, but idle-tab outliers should not gate
CLS All routes P90 0.10 Shifts are rare but jarring; P90 catches the severe minority a P75 misses
LCP / INP Checkout / payment P90 route-specific Revenue-critical flows justify covering nine in ten users, not three in four
Custom long-task beacon Enterprise SLA paths P95–P99 contract value When an SLA names a tail figure, gate at the contracted percentile

Set the assertion level to warn for any percentile still being calibrated and promote to error only after the threshold has held for two consecutive weekly baselines, so the gate earns trust before it can block a merge.

CI Enforcement

This GitHub Actions job collects five runs, computes the configured percentile per metric, and fails the required status check when any error-level percentile breaches its ceiling.

name: Percentile Performance Gate
on:
  pull_request:
    branches: [main]

jobs:
  percentile-gate:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
      - run: npm ci
      - run: npm run build
      - name: Collect runs
        run: npx lhci collect --numberOfRuns=5 --url=http://localhost:8080/checkout
      - name: Evaluate percentile budgets
        run: node ./scripts/evaluate-percentiles.js
              --reports .lighthouseci
              --thresholds ./config/thresholds.json
      - name: Upload reports
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: percentile-reports
          path: .lighthouseci/

Require the percentile-gate check in branch protection so a tail regression cannot merge. Wire the same evaluator into Automated Regression Detection to alert when a percentile trends toward its ceiling before it crosses, and stabilise the underlying distribution first via Statistical Noise & Flakiness Reduction so the percentile you assert on is signal, not jitter.

Troubleshooting & Edge Cases

  • Percentile flaps run-to-run → the sample is too small. A P90 over 5 runs is dominated by one observation; raise numberOfRuns to 9+, or aggregate several PR runs into a rolling window before evaluating.
  • P99 is wildly unstable in lab → you cannot estimate a P99 from tens of samples. Reserve P95/P99 for field datasets with thousands of sessions; gate lab runs at P75/P90 and watch the high tail in RUM.
  • Lab percentile passes but field P75 fails → expected lab-to-field gap. Set lab ceilings 10–15% tighter than the field target you actually care about.
  • Mean looks fine, users complain → the average is absorbing the tail. Switch the assertion from mean/median to the percentile; that is the entire point of this layer.
  • Mixed device classes in one percentile → segment first. Compute and gate P75 per device class and connection profile, never on a pooled distribution.
  • A single slow third-party run poisons P90 → apply IQR or Z-score outlier filtering before the percentile step, and pin vendor versions per Third-Party Script Constraints.

Frequently Asked Questions

Why gate on a percentile instead of the average?

The mean blends fast and slow sessions into a value no user experiences, so it stays green while the slow tail degrades. A percentile such as P75 or P90 is an actual point in the distribution — it answers "how bad is the experience for the slowest quarter (or tenth) of users?", which is the question a budget exists to protect.

How many runs do I need to compute a stable percentile?

For lab runs, five is the floor for a P75 and nine or more for a P90; a P99 needs field data with thousands of sessions, not a handful of CI runs. The rule of thumb: the higher the percentile, the more samples it takes to estimate it without flapping. Below the floor, downgrade the assertion to warn using a minimumSamples guard.

Should every metric use the same percentile?

No. Match the percentile to the metric's shape and the route's business risk. Typical timing metrics like LCP and INP work well at P75; rare-but-severe metrics like CLS earn P90; revenue-critical or SLA-bound paths justify P90 through P99. See Choosing Between P75 and P90 Budget Targets for the decision procedure.