Alerting on Performance Budget Regressions

Q: How do I stop a noisy percentile from paging the team repeatedly?

Use a for duration so the breach must persist before firing, a wide enough evaluation window to smooth single-bucket noise, and a notification policy with group_by and a long repeat_interval so an ongoing regression is one grouped message.

Q: Should the alert threshold match the CI gate threshold?

Yes. Use the same budget value for the field alert and the lab assertion so the team is not paged for breaches the gate should block, and continues to trust both signals.

A dashboard catches regressions only when someone is looking at it; an alert catches them at 2 a.m. when a deploy quietly pushes P75 LCP past budget. This guide configures Grafana unified alerting on the budget threshold lines from Visualizing Budget Trends with Grafana so a sustained breach pages the right team — without flooding the channel every time a single noisy bucket clips the line.

The hard part of performance alerting is not detecting a breach, it is not detecting a non-breach. Field percentiles wobble bucket to bucket, and a naive "alert when LCP > 2500 ms" rule fires and resolves dozens of times an hour. The fix is a for duration that requires the breach to persist, an evaluation window wide enough to smooth single-bucket noise, and notification routing that groups and deduplicates before anyone is paged.

Alert Rule Plan

Metric	Breach condition (P75)	Evaluate every	For (sustain)	Severity
LCP	> 2500 ms	5 m	15 m	warning
INP	> 200 ms	5 m	15 m	warning
CLS	> 0.10	5 m	30 m	warning
LCP	> 4000 ms	5 m	10 m	critical

CLS gets a longer for because layout-shift spikes are often a single bad deploy or a mis-sized ad slot that self-corrects; LCP gets a second, tighter rule at the "poor" boundary (4000 ms) that escalates to critical and pages immediately. Pair both with the statistical methods in Automated Regression Detection so the rule fires on a real shift rather than on variance.

Diagnostic Steps

Before wiring an alert, confirm the query returns the same value the panel shows, evaluated as Grafana's alerting engine will see it (a single reduced number, not a series).

curl -s -u admin:$GRAFANA_PW -X POST http://grafana.internal/api/v1/eval \
  -H 'Content-Type: application/json' \
  -d '{"expr":"SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY value) FROM web_vitals WHERE metric='LCP' AND ts > now() - interval '15 minutes'"}'

Expected output is a single numeric reduction, e.g. 2310, matching the latest LCP panel value within rounding. If it returns a time series instead of one number, add a reduce (Last) expression to the alert query so the threshold compares against a scalar.

Implementation

Provision the rule as code. The block below is a Grafana unified-alerting rule (LCP warning) with a query stage, a reduce stage, and a threshold condition, plus the for debounce.

# /etc/grafana/provisioning/alerting/cwv-rules.yaml
apiVersion: 1
groups:
  - orgId: 1
    name: core-web-vitals
    folder: Performance
    interval: 5m
    rules:
      - uid: lcp-budget-warning
        title: LCP P75 over budget
        condition: C
        for: 15m
        labels:
          severity: warning
          team: frontend
        annotations:
          summary: "LCP P75 is {{ $values.B }}ms, budget 2500ms"
        data:
          - refId: A
            datasourceUid: PerfTSDB
            model:
              format: time_series
              rawSql: "SELECT ts AS time, value FROM web_vitals WHERE metric='LCP' AND ts > now() - interval '15 minutes'"
          - refId: B
            datasourceUid: __expr__
            model: { type: reduce, expression: A, reducer: p75 }
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator: { type: gt, params: [2500] }

Wire the contact point and a notification policy that groups by metric and route, waits before the first send, and repeats sparingly so a single ongoing regression is one message, not a stream:

contactPoints:
  - orgId: 1
    name: frontend-perf
    receivers:
      - uid: slack-perf
        type: slack
        settings:
          recipient: "#perf-alerts"
          title: "{{ .CommonLabels.severity }}: budget regression"
policies:
  - orgId: 1
    receiver: frontend-perf
    group_by: ['alertname', 'route']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h

The for: 15m is the single most important fatigue control: it requires the P75 to stay over 2500 ms for fifteen minutes before the rule transitions from Pending to Firing, so transient single-bucket spikes never page anyone. repeat_interval: 4h and group_by collapse an ongoing breach into one alert that re-notifies only every four hours.

CI Gating Assertion

Alerting watches the field; the gate still blocks the merge. Keep the alert threshold and the build assertion on the same value so an engineer is never paged for a breach the gate should have caught:

{
  "ci": {
    "assert": {
      "assertions": {
        "metric-lcp": ["error", { "maxNumericValue": 2500 }]
      }
    }
  }
}

Verification

Fire a test alert without waiting for a real regression. Insert a sustained over-budget sample series, then check the rule state:

psql "$PERF_DB_URL" -c "INSERT INTO web_vitals(ts, metric, route, session_id, value, source)
  SELECT now() - (g || ' minute')::interval, 'LCP', '/checkout', 'verify-'||g, 4300, 'test'
  FROM generate_series(0,16) g;"
curl -s -u admin:$GRAFANA_PW http://grafana.internal/api/alertmanager/grafana/api/v2/alerts | jq '.[].labels.alertname'

After the next two evaluation cycles the rule moves Pending → Firing and the contact point receives one grouped message reading LCP P75 is 4300ms, budget 2500ms. Confirm the critical rule also fired (since 4300 > 4000), then delete the test rows (WHERE source='test') and verify the alert resolves on the following evaluation.

Frequently Asked Questions

How do I stop a noisy percentile from paging the team repeatedly?

Three controls together: a for duration (15 m here) so the breach must persist before firing, an evaluation window wide enough to smooth single-bucket noise, and a notification policy with group_by plus a long repeat_interval so an ongoing regression is one message every few hours, not a stream. Pair this with statistical detection from Automated Regression Detection.

Should the alert threshold match the CI gate threshold?

Yes. Use the same budget number for the field alert and the lab assertion. If they drift apart, engineers get paged for breaches the gate should have blocked, or the gate blocks merges that the field never alerts on — either way the team stops trusting both signals.

Alerting on Performance Budget Regressions #

Alert Rule Plan #

Diagnostic Steps #

Implementation #

CI Gating Assertion #

Verification #

Frequently Asked Questions #

Related Pages #