Comparing Performance Testing Tools

Teams that adopt three performance tools and run them everywhere end up trusting none of them, because each measures something different and they disagree by design. The fix is not to pick one tool forever — it is to know which question each tool answers and route the right tool to the right job. This guide is part of the Lighthouse CI & WebPageTest Integration reference and lays out a head-to-head decision between the three engines that matter: Lighthouse CI for fast deterministic PR gates, WebPageTest for deep connection-controlled diagnostics, and real-user beacons for the ground truth no lab can produce.

The core tension is lab versus field. Lab tools (Lighthouse CI, WebPageTest) give you a reproducible number on demand, perfect for blocking a merge, but they synthesize a single device and connection. Field tools (RUM) give you the real distribution across every device and network your users actually have, but only after code ships and only as a percentile, never on demand. A mature setup uses lab tools to gate and field tools to keep the lab honest.

Decision Matrix

The three engines sit on different points of a depth-versus-speed and lab-versus-field plane. The matrix below maps each tool against the dimensions that decide which one to reach for.

Performance tool comparison matrix A matrix comparing Lighthouse CI, WebPageTest, and RUM beacons across speed, determinism, network control, diagnostic depth, real-user fidelity, and CI gating fit, showing strong, partial, or weak fit for each. Lighthouse CI fast lab gate WebPageTest deep lab RUM beacons real field data Speed Determinism Network control Diagnostic depth Real-user fidelity CI gating fit strong partial weak
Lighthouse CI wins on speed and gating fit; WebPageTest wins on network control and depth; RUM beacons win on real-user fidelity and nothing else gates a merge.

The matrix reads as a routing table: gate with the green-on-speed-and-gating column, diagnose with the green-on-depth-and-network column, and validate with the green-on-fidelity column. No single column is green everywhere, which is exactly why a complete setup uses all three for different jobs.

Prerequisites & Environment

Comparing tools fairly means running them against the same target under the same conditions, or the disagreement you see is an artifact of setup, not a real difference.

  • A stable staging or preview URL all three can hit. Lighthouse CI and WebPageTest run against it directly; RUM compares against production traffic for the same routes.
  • @lhci/cli ≥ 0.13 and Node.js ≥ 18 for the Lighthouse path, detailed in Lighthouse CI Configuration & Storage.
  • A reachable WebPageTest agent and API key. A shared public instance works for a one-off comparison; a dedicated agent removes location and queue variance, covered in WebPageTest Private Instance Setup.
  • A field beacon already deployed so you have a P75 baseline to compare lab numbers against, set up through Custom Performance Beacons & RUM.
  • One shared budget file — the same thresholds expressed once and imported by every engine, so a disagreement is a real signal rather than a typo.

Configuration Reference: One Budget, Three Consumers

The trap is encoding the budget three times and drifting. Define it once and let each engine read it. This annotated module is the single source of truth.

// budget.js — one budget, consumed by every tool
module.exports = {
  // P75 ceilings for high-end mobile on a 4G profile
  lcp: 2500,            // ms — Largest Contentful Paint, lab + field
  cls: 0.1,            // unitless — Cumulative Layout Shift, lab + field
  tbt: 200,            // ms — lab proxy for INP; INP itself is field-only
  inp: 200,            // ms — field gate, enforced via RUM percentiles
  bytesJs: 200000,     // bytes — script transfer budget
  connectivity: "4G",   // WebPageTest connection profile to match the device class
};

Lighthouse CI maps these into assertions, the WebPageTest script maps them into its comparison thresholds, and the RUM pipeline aggregates field values to P75 and compares against the same lcp, cls, and inp. When all three read this file, "Lighthouse passed but WebPageTest failed" means the engines genuinely saw different conditions — which is the signal you want.

Step-by-Step Selection Process

  1. State the question. "Will this PR regress performance?" routes to Lighthouse CI. "Why did this route get slower?" routes to WebPageTest. "Are real users actually affected?" routes to RUM. Write the question down before picking a tool.

    Expected outcome: each task maps to exactly one primary engine.

  2. Check the speed budget of the answer. A PR gate must return in under a couple of minutes, which rules out WebPageTest as a blocking check on every commit. Run it:

    time npx lhci autorun --collect.url=http://localhost:3000/

    Expected: under ~90 seconds for three runs on a single URL. If you need an answer faster than WebPageTest's multi-minute turnaround, the gate is Lighthouse CI.

  3. Check whether network realism matters. If the regression is connection-sensitive — TTFB, request chains, third-party blocking — only WebPageTest's real shaped connection reproduces it. Confirm the agent responds:

    curl -s "$WPT_SERVER/getLocations.php?f=json" | head -c 200

    Expected: a JSON payload listing your agent location. Empty or error means the agent is unreachable and the comparison would be invalid.

  4. Anchor to field truth. Pull the live P75 for the route and compare it to the lab number. If lab and field disagree by more than ~15%, trust the field and recalibrate the lab, not the other way around.

Capability Comparison

Dimension Lighthouse CI WebPageTest RUM beacons
Data source Synthetic lab Synthetic lab Real users (field)
Turnaround Seconds Minutes Continuous, post-ship
Determinism High (simulate) High (shaped line) None — a distribution
Network control Simulated only Real, shaped per-profile Whatever users have
Diagnostic depth Audit-level Waterfall, filmstrip, connection view Aggregate percentiles
INP measurement Proxy via TBT Proxy via TBT Direct, real
Cost to run Free, CI minutes Agent compute, slower Storage + ingest pipeline
Best for Gating every PR Explaining a regression Validating lab budgets

CI Enforcement Snippet

In a complete pipeline the three engines occupy different stages: Lighthouse CI gates every PR, WebPageTest runs on connection-sensitive routes or nightly, and RUM aggregation runs continuously and feeds back into the budget. This workflow wires the two synthetic gates against the shared budget.

name: Performance Gating
on:
  pull_request:
    branches: [main]

jobs:
  fast-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20", cache: "npm" }
      - run: npm ci && npm run build
      - name: Lighthouse CI (every PR)
        run: npx lhci autorun
        env:
          LHCI_GITHUB_APP_TOKEN: ${{ secrets.LHCI_GITHUB_APP_TOKEN }}

  deep-gate:
    runs-on: ubuntu-latest
    # only on routes where network realism matters
    if: contains(github.event.pull_request.labels.*.name, 'network-sensitive')
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20", cache: "npm" }
      - run: npm ci
      - name: WebPageTest (deep diagnostics)
        run: node wpt-gate.js
        env:
          WPT_SERVER: ${{ secrets.WPT_SERVER }}
          WPT_API_KEY: ${{ secrets.WPT_API_KEY }}

The fast-gate is always required; the deep-gate runs only when a PR is labeled network-sensitive, so WebPageTest's slower turnaround never blocks routine merges.

Troubleshooting & Edge Cases

  • Lighthouse and WebPageTest disagree on LCP → Lighthouse simulated the network while WebPageTest shaped a real one. For connection-sensitive metrics WebPageTest is authoritative; align the connectivity profile to the device class before concluding either is wrong.
  • Lab passes but RUM shows users are slow → the lab device and connection are faster than your real P75 user. The lab budget is stale; retighten it against field data through Custom Performance Beacons & RUM.
  • RUM looks fine but the lab keeps failing → the runner is noisier than production. Fix the environment — raise numberOfRuns, switch to simulate, isolate the runner — rather than loosening the budget.
  • WebPageTest results swing between runs → public-instance queue contention or location variance. Move to a dedicated agent so the connection and hardware are fixed.
  • INP regressions slip through lab gates → INP cannot be measured in a lab; both lab tools only proxy it via TBT. Enforce the real INP ceiling through RUM percentiles, not the lab.
  • All three drift after a dependency bump → a third-party tag got heavier. This is a real regression, not a tooling artifact; track it against Third-Party Script Constraints.

Frequently Asked Questions

If I can only run one tool, which should it be?

Lighthouse CI, because it is the only one of the three that gates a merge in seconds with a deterministic number. WebPageTest and RUM make that gate smarter and more trustworthy, but they do not replace the fast verdict. Start with the Lighthouse CI vs WebPageTest decision guide if depth is your concern.

Why do Lighthouse CI and WebPageTest report different numbers for the same page?

They model the network differently. Lighthouse CI simulates a connection in software for determinism; WebPageTest shapes a real connection on the agent. The same page on a simulated 4G and a shaped 4G can differ by 10–20% on connection-sensitive metrics. Match the connectivity profile to the device class before treating the gap as a bug.

Can RUM beacons gate a pull request?

No. RUM measures real users after code ships, so by definition it cannot block a merge. Its job is to validate that your lab budgets still match reality and to enforce field-only metrics like INP. Use lab tools to gate and RUM to keep the gate honest — the tradeoffs are covered in synthetic monitoring vs RUM tradeoffs.