16 min read

Normalizing Performance Data Across Device Types

Without device-stratified normalization, a raw LCP of 3 800 ms on a mid-tier Android handset and a raw LCP of 1 100 ms on a desktop Chrome session feed the same scoring function and produce wildly different outputs — even when both devices are hitting identical server-side performance. Health scores diverge, alert thresholds fire on mobile while desktop stays green, and SREs spend sprint capacity chasing phantom regressions. This workflow is part of the broader Metric Scoring & Data Normalization system: it standardizes ingestion, applies statistical alignment per device class, orchestrates ETL scheduling, and stores versioned artifacts so every audit cycle is reproducible.

Prerequisites & Environment Setup

All commands assume Python 3.11+ and a dedicated virtualenv. Pin every dependency in requirements.txt to avoid baseline drift caused by silent library upgrades.

python==3.11.9
pandas==2.2.2
numpy==1.26.4
scipy==1.13.0
pyarrow==16.1.0
great-expectations==0.18.14
apache-airflow==2.9.1

Required environment variables — export these before running any pipeline step:

Variable	Type	Default	Purpose
`NORMALIZATION_DEVICE_CLASSES`	`str`	`mobile,tablet,desktop,low-tier`	Canonical device segments used throughout the pipeline
`NORMALIZATION_METRICS`	`str`	`LCP,CLS,INP,FCP`	Core Web Vitals to normalize
`NORMALIZATION_PERCENTILES`	`str`	`75,90`	Percentile targets for per-device baseline calculation
`BASELINE_SNAPSHOT_DIR`	`str`	`/data/baselines`	Absolute path for persisting versioned baseline JSON files
`ARTIFACT_PARQUET_DIR`	`str`	`/data/normalized`	Output directory for normalized Parquet partitions
`DRIFT_PSI_THRESHOLD`	`float`	`0.1`	Population Stability Index alert threshold
`PIPELINE_VERSION`	`str`	—	Semver tag injected at deploy time; stored in every artifact
`LIGHTHOUSE_SEMVER`	`str`	—	Auditor version used during collection; changes trigger baseline refresh

Dependency lockfile pattern — run once after activating the virtualenv:

set -euo pipefail
python -m pip install --upgrade pip
pip install -r /opt/audit-pipeline/requirements.txt
pip freeze > /opt/audit-pipeline/requirements.lock

Step 1 — Initialization: Device Classification and Schema Validation

Route all inbound telemetry through a schema validation layer before any transformation. Malformed records that reach the normalization stage introduce silent bias — a single record with null CLS inflates the group mean and shifts every subsequent percentile calculation.

This initializer parses raw payloads from Lighthouse CI, CrUX API, WebPageTest, and RUM collectors, classifies each record into a canonical device segment, and rejects anything that fails schema validation with a structured error code. It writes clean, classified records to a staging table and logs all rejections for audit review.

#!/usr/bin/env python3
"""
/opt/audit-pipeline/normalize/classify.py
Classify and validate raw performance telemetry before normalization.
"""
import os
import json
import logging
from datetime import datetime, timezone
from pathlib import Path
from typing import Iterator

import pandas as pd
from pydantic import BaseModel, Field, ValidationError, field_validator

DEVICE_CLASSES = os.environ["NORMALIZATION_DEVICE_CLASSES"].split(",")
METRICS = os.environ["NORMALIZATION_METRICS"].split(",")
STAGING_DIR = Path(os.environ.get("STAGING_DIR", "/data/staging"))
STAGING_DIR.mkdir(parents=True, exist_ok=True)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)


class RawMetricRecord(BaseModel):
    url: str
    device_type: str
    LCP: float = Field(ge=0)
    CLS: float = Field(ge=0)
    INP: float = Field(ge=0)
    FCP: float = Field(ge=0)
    viewport_width: int
    cpu_throttling: float = Field(ge=1.0)
    network_rtt_ms: float = Field(ge=0)
    collected_at: datetime
    source: str  # "lighthouse", "crux", "wpt", "rum"

    @field_validator("device_type")
    @classmethod
    def device_must_be_canonical(cls, v: str) -> str:
        if v not in DEVICE_CLASSES:
            raise ValueError(f"Unknown device class: {v!r}. Expected one of {DEVICE_CLASSES}")
        return v


def load_raw_payload(path: Path) -> Iterator[dict]:
    """Stream newline-delimited JSON records from the given file."""
    with path.open() as fh:
        for line in fh:
            line = line.strip()
            if line:
                yield json.loads(line)


def classify_and_validate(input_path: Path, run_id: str) -> pd.DataFrame:
    valid_records = []
    rejection_log = []

    for raw in load_raw_payload(input_path):
        try:
            record = RawMetricRecord(**raw)
            valid_records.append(record.model_dump())
        except ValidationError as exc:
            rejection_log.append({"raw": raw, "errors": exc.errors(), "run_id": run_id})

    if rejection_log:
        reject_path = STAGING_DIR / f"rejections_{run_id}.jsonl"
        with reject_path.open("w") as fh:
            for entry in rejection_log:
                fh.write(json.dumps(entry) + "\n")
        log.warning("Rejected %d records — see %s", len(rejection_log), reject_path)

    df = pd.DataFrame(valid_records)
    log.info("Classified %d valid records from %s", len(df), input_path)
    return df


if __name__ == "__main__":
    import sys
    input_path = Path(sys.argv[1])
    run_id = sys.argv[2]
    df = classify_and_validate(input_path, run_id)
    out = STAGING_DIR / f"classified_{run_id}.parquet"
    df.to_parquet(out, engine="pyarrow", index=False)
    log.info("Staged classified records to %s", out)

The cpu_throttling and network_rtt_ms fields are mandatory — without them, you cannot detect when a lab environment silently drops its throttle configuration, which is one of the most common causes of phantom performance improvements.

Step 2 — Core Configuration: Baseline Computation and Normalization Parameters

The normalization stage reads the classified staging data, computes device-stratified percentiles, and applies either Z-score or Min-Max scaling to map raw metric distributions onto a 0–100 index. This is where standardizing mobile vs desktop performance metrics becomes concrete: the same function handles both, but the baseline snapshots it references are device-specific.

Key normalization parameters — store these in normalization_config.yaml and inject via environment:

Parameter	Type	Default	Purpose
`method`	`str`	`zscore`	`zscore` or `minmax`; per-device override allowed
`min_samples_per_group`	`int`	`30`	Groups with fewer samples fall back to `minmax`
`baseline_window_days`	`int`	`90`	Rolling window for mean/stddev computation
`outlier_iqr_fence`	`float`	`3.0`	IQR multiplier; records beyond fence are winsorized before scaling
`score_inversion`	`bool`	`true`	Invert metrics where lower is better (LCP, INP, FCP)
`cls_weight`	`float`	`0.15`	CLS contribution weight in composite index
`lcp_weight`	`float`	`0.40`	LCP contribution weight
`inp_weight`	`float`	`0.30`	INP contribution weight
`fcp_weight`	`float`	`0.15`	FCP contribution weight

Production normalization function — reads the parameter table above from NORMALIZATION_CONFIG_PATH:

#!/usr/bin/env python3
"""
/opt/audit-pipeline/normalize/zscore_norm.py
Device-stratified Z-score normalization for Core Web Vitals.
"""
import os
import json
import logging
from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from scipy.stats import zscore

log = logging.getLogger(__name__)
METRICS = os.environ["NORMALIZATION_METRICS"].split(",")
BASELINE_DIR = Path(os.environ["BASELINE_SNAPSHOT_DIR"])
ARTIFACT_DIR = Path(os.environ["ARTIFACT_PARQUET_DIR"])
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

WEIGHTS = {
    "LCP": float(os.environ.get("LCP_WEIGHT", "0.40")),
    "CLS": float(os.environ.get("CLS_WEIGHT", "0.15")),
    "INP": float(os.environ.get("INP_WEIGHT", "0.30")),
    "FCP": float(os.environ.get("FCP_WEIGHT", "0.15")),
}
MIN_SAMPLES = int(os.environ.get("MIN_SAMPLES_PER_GROUP", "30"))
IQR_FENCE = float(os.environ.get("OUTLIER_IQR_FENCE", "3.0"))
PIPELINE_VERSION = os.environ["PIPELINE_VERSION"]


def winsorize_group(series: pd.Series, fence: float) -> pd.Series:
    """Cap values beyond ±fence IQR from the median."""
    q25, q75 = series.quantile(0.25), series.quantile(0.75)
    iqr = q75 - q25
    lower, upper = q25 - fence * iqr, q75 + fence * iqr
    return series.clip(lower, upper)


def normalize_device_group(group: pd.DataFrame, device: str) -> pd.DataFrame:
    group = group.copy()
    scaled_cols = {}

    for metric in METRICS:
        raw = group[metric].fillna(group[metric].median())
        raw = winsorize_group(raw, IQR_FENCE)

        if len(raw) >= MIN_SAMPLES:
            # Z-score normalisation
            z = zscore(raw, nan_policy="omit")
            z = pd.Series(z, index=raw.index)
            z_min, z_max = z.min(), z.max()
            if z_max > z_min:
                scaled = (z - z_min) / (z_max - z_min) * 100
            else:
                scaled = pd.Series(50.0, index=raw.index)
        else:
            # Fallback: Min-Max when sample count too low for Z-score
            log.warning("Device %s metric %s: %d samples < %d; falling back to Min-Max",
                        device, metric, len(raw), MIN_SAMPLES)
            r_min, r_max = raw.min(), raw.max()
            scaled = (raw - r_min) / (r_max - r_min) * 100 if r_max > r_min else pd.Series(50.0, index=raw.index)

        # Invert: for latency metrics, lower raw = higher score
        if metric in ("LCP", "INP", "FCP"):
            scaled = 100.0 - scaled

        scaled_cols[f"{metric}_norm"] = scaled.round(2)

    group = group.assign(**scaled_cols)

    # Weighted composite index
    group["health_index"] = sum(
        group[f"{m}_norm"] * w for m, w in WEIGHTS.items()
    ).round(2)

    return group


def normalize_all(staged_parquet: Path, run_id: str) -> Path:
    df = pd.read_parquet(staged_parquet, engine="pyarrow")
    frames = []

    for device, group in df.groupby("device_type"):
        frames.append(normalize_device_group(group, device))

    result = pd.concat(frames, ignore_index=True)
    result["pipeline_version"] = PIPELINE_VERSION
    result["run_id"] = run_id

    out_path = ARTIFACT_DIR / f"normalized_{run_id}.parquet"
    pq.write_table(pa.Table.from_pandas(result), str(out_path))
    log.info("Normalized %d records → %s", len(result), out_path)
    return out_path

The pipeline diagram below shows data flow from raw collection through device classification, per-group normalization, composite scoring, and final storage:

Step 3 — Execution & Scheduling: Airflow DAG with Concurrency Guard

Configure the DAG to trigger immediately after the raw ingestion job completes. Use ExternalTaskSensor to avoid a fixed time offset that drifts when upstream ingestion slows down. Set max_active_runs=1 to prevent two normalization runs from writing to the same output partition simultaneously — a common cause of corrupted baselines.

# /opt/audit-pipeline/dags/normalize_performance_data.yaml
# Parsed by the Airflow DagBag loader; requires airflow>=2.9
dag:
  dag_id: normalize_performance_data
  schedule_interval: "@daily"
  start_date: "2025-01-01T00:00:00Z"
  catchup: false
  max_active_runs: 1
  default_timezone: UTC
  default_args:
    owner: sre_team
    retries: 2
    retry_delay_seconds: 300
    email_on_failure: true
    email: ["[email protected]"]

tasks:
  - id: wait_for_raw_data
    operator: ExternalTaskSensor
    external_dag_id: ingest_telemetry
    external_task_id: upload_to_staging
    timeout: 3600
    mode: reschedule

  - id: classify_and_validate
    operator: BashOperator
    depends_on: [wait_for_raw_data]
    bash_command: >
      set -euo pipefail &&
      python /opt/audit-pipeline/normalize/classify.py
        /data/raw/{{ ds_nodash }}.jsonl
        {{ run_id }}

  - id: run_zscore_normalization
    operator: BashOperator
    depends_on: [classify_and_validate]
    bash_command: >
      set -euo pipefail &&
      python /opt/audit-pipeline/normalize/zscore_norm.py
        /data/staging/classified_{{ run_id }}.parquet
        {{ run_id }}

  - id: check_psi_drift
    operator: BranchPythonOperator
    depends_on: [run_zscore_normalization]
    python_callable: evaluate_psi_drift
    op_kwargs:
      artifact_path: "/data/normalized/normalized_{{ run_id }}.parquet"
      baseline_dir: "/data/baselines"
      psi_threshold: "{{ var.value.DRIFT_PSI_THRESHOLD }}"

  - id: promote_artifact
    operator: BashOperator
    depends_on: [check_psi_drift]
    bash_command: >
      set -euo pipefail &&
      python /opt/audit-pipeline/normalize/promote.py
        {{ run_id }} {{ var.value.PIPELINE_VERSION }}

  - id: alert_on_drift
    operator: SlackAPIPostOperator
    trigger_rule: one_failed
    slack_conn_id: perf_alerts_slack
    channel: "#perf-alerts"
    text: "Normalization drift alert for run {{ run_id }}. PSI exceeded threshold. Artifact quarantined."

The flock-equivalent here is max_active_runs: 1 — Airflow holds subsequent runs in a queued state rather than spawning a second DAG run while the first is active. For teams not running Airflow, wrap the Python scripts in a shell script with flock:

#!/usr/bin/env bash
set -euo pipefail

LOCK_FILE="/var/lock/normalize_performance.lock"
RUN_ID="${1:?RUN_ID required}"

exec 9>"${LOCK_FILE}"
if ! flock -n 9; then
  echo "ERROR: Normalization already running. Exiting." >&2
  exit 1
fi

python /opt/audit-pipeline/normalize/classify.py /data/raw/"${RUN_ID}".jsonl "${RUN_ID}"
python /opt/audit-pipeline/normalize/zscore_norm.py /data/staging/classified_"${RUN_ID}".parquet "${RUN_ID}"

Step 4 — Artifact Capture & Storage: Versioned Parquet Partitions

Store normalized outputs in partitioned Parquet so downstream designing custom health score algorithms pipelines and dashboards can query by date and device without scanning full tables.

Partition scheme: s3://audit-artifacts/normalized/date={YYYY-MM-DD}/device_type={class}/run_id={id}/data.parquet

Retention policy:

Raw staging files: 7 days (delete after promotion succeeds)
Normalized Parquet partitions: 90 days rolling
Baseline snapshots: indefinite (immutable, versioned by PIPELINE_VERSION + LIGHTHOUSE_SEMVER)

Promotion script — writes the artifact to its final S3 path and records the run metadata:

#!/usr/bin/env bash
# /opt/audit-pipeline/normalize/promote.sh
set -euo pipefail

RUN_ID="${1:?RUN_ID required}"
PIPELINE_VERSION="${2:?PIPELINE_VERSION required}"
DATE_PREFIX="$(date -u +%Y-%m-%d)"
SOURCE="/data/normalized/normalized_${RUN_ID}.parquet"
DEST="s3://audit-artifacts/normalized/date=${DATE_PREFIX}/run_id=${RUN_ID}/data.parquet"

aws s3 cp "${SOURCE}" "${DEST}" \
  --metadata "pipeline_version=${PIPELINE_VERSION},run_id=${RUN_ID}"

# Record manifest entry
python - <<PYEOF
import json, os
from pathlib import Path
manifest = Path("/data/baselines/manifest.jsonl")
entry = {
    "run_id": "${RUN_ID}",
    "pipeline_version": "${PIPELINE_VERSION}",
    "date": "${DATE_PREFIX}",
    "s3_path": "${DEST}",
    "lighthouse_semver": os.environ.get("LIGHTHOUSE_SEMVER", "unknown")
}
with manifest.open("a") as fh:
    fh.write(json.dumps(entry) + "\n")
print(f"Manifest updated: {entry}")
PYEOF

Verification Checklist

Confirm the classified staging file exists and has zero zero-byte device groups: parquet-tools show /data/staging/classified_<RUN_ID>.parquet | grep device_type | sort | uniq -c
Validate that health_index values fall in [0, 100] for every row: python -c "import pandas as pd; df=pd.read_parquet('/data/normalized/normalized_<RUN_ID>.parquet'); assert df['health_index'].between(0,100).all(), 'Out-of-range health_index found'"
Check that all four device classes have records in the normalized output (absent device type means the classifier dropped an entire segment): python -c "import pandas as pd; print(pd.read_parquet('/data/normalized/normalized_<RUN_ID>.parquet')['device_type'].value_counts())"
Verify the S3 artifact was written: aws s3 ls s3://audit-artifacts/normalized/date=$(date -u +%Y-%m-%d)/
Confirm the manifest entry was appended: tail -1 /data/baselines/manifest.jsonl | python -m json.tool
Spot-check score inversion: for a URL where desktop LCP improved between yesterday and today, health_index for desktop should be higher today, not lower.

Troubleshooting

Device group missing from normalized output — all records for mobile are absent

Root cause: the raw telemetry source changed its device_type field name or casing (e.g. "Mobile" instead of "mobile"), causing 100% rejection during classification.

# Inspect rejection log for the affected run
python -c "
import json
from pathlib import Path
log_path = Path('/data/staging/rejections_<RUN_ID>.jsonl')
errors = [json.loads(l) for l in log_path.read_text().splitlines() if l]
for e in errors[:5]:
    print(e['errors'])
"

Fix: add a .lower() normalization step in classify.py before the Pydantic validator runs, then re-run the classification step from the failed run ID.

PSI threshold exceeded — normalization run quarantined

Root cause: a Lighthouse version bump changed throttling defaults, shifting the raw distribution far enough from the stored baseline that PSI > 0.1.

# Compare current p75 values against stored baseline
python - <<'EOF'
import pandas as pd, json
from pathlib import Path

df = pd.read_parquet('/data/normalized/normalized_<RUN_ID>.parquet')
baseline = json.loads(Path('/data/baselines/baseline_latest.json').read_text())

for device in df['device_type'].unique():
    grp = df[df['device_type'] == device]
    for m in ['LCP', 'CLS', 'INP', 'FCP']:
        current_p75 = grp[m].quantile(0.75)
        baseline_p75 = baseline.get(device, {}).get(m, {}).get('p75', None)
        if baseline_p75:
            pct_delta = (current_p75 - baseline_p75) / baseline_p75 * 100
            print(f"{device}/{m}: baseline={baseline_p75:.0f} current={current_p75:.0f} delta={pct_delta:+.1f}%")
EOF

Fix: If the shift is caused by a known Lighthouse upgrade, generate a new baseline using the two-week parallel collection window documented in the FAQ, then update LIGHTHOUSE_SEMVER and re-promote.

Composite health_index is NaN for several rows

Root cause: a metric column (often INP) contains NaN that propagated through the weighted sum because fillna was not applied consistently.

python -c "
import pandas as pd
df = pd.read_parquet('/data/normalized/normalized_<RUN_ID>.parquet')
mask = df['health_index'].isna()
print(df[mask][['url','device_type','LCP','CLS','INP','FCP']].head(10))
"

Fix: in zscore_norm.py, verify that fillna(group[metric].median()) is called before winsorize_group for every metric in METRICS. Add a post-normalization assertion: assert not result['health_index'].isna().any().

Parquet write fails with ArrowInvalid: Schema mismatch

Root cause: a schema-compatible but type-mismatched field (e.g. viewport_width arriving as float64 in one batch and int64 in another) causes PyArrow to reject the concat.

python -c "
import pyarrow.parquet as pq
schema = pq.read_schema('/data/staging/classified_<RUN_ID>.parquet')
print(schema)
"

Fix: add explicit dtype casting in classify_and_validate after building the DataFrame: df['viewport_width'] = df['viewport_width'].astype('int32'). Pin all column dtypes to match the schema declared in the Pydantic model.

Artifact promotion to S3 fails intermittently

Root cause: the AWS session token expired mid-pipeline run (common with 1-hour STS tokens in CI environments).

aws sts get-caller-identity  # verify credentials are valid before promotion

Fix: configure the Airflow connection to use IAM role assumption with automatic token refresh, or store credentials as short-lived OIDC tokens injected at task execution time rather than at DAG parse time.

All device groups produce health_index of exactly 50

Root cause: Z-score normalization collapses to constant 50 when all values in the group are identical — this happens in test environments where the telemetry fixture repeats the same record.

python -c "
import pandas as pd
df = pd.read_parquet('/data/staging/classified_<RUN_ID>.parquet')
for device, grp in df.groupby('device_type'):
    for m in ['LCP','CLS','INP','FCP']:
        print(f'{device}/{m}: stddev={grp[m].std():.4f}')
"

Fix: verify that the ingestion fixture contains genuine variance. If running in a staging environment, seed it with synthetic data generated by scipy.stats.norm.rvs with realistic mean/stddev values pulled from CrUX percentile tables.

FAQ

Why do raw LCP values differ so widely between mobile and desktop even on the same page?

Mobile devices apply CPU throttling (4x by default in Lighthouse), slower network profiles (Fast 3G), and smaller viewports that change which image is the LCP candidate. These three variables compound, producing raw values that are incomparable without stratified normalization. This is also why you cannot apply a single global scaling factor — the relationship between mobile and desktop LCP varies by site architecture and image serving strategy.

When should I use Z-score normalization versus Min-Max scaling?

Z-score works best when the device group has enough samples (n > 30) and roughly normal distributions — it preserves relative outlier distance. Use Min-Max when you need a hard 0–100 bound and the distribution is bounded or uniform. For sparse device segments (e.g. tablet), Min-Max avoids the inflated z-scores that appear when stddev is tiny. The pipeline above applies this decision automatically via min_samples_per_group.

How do I prevent normalization drift after a Lighthouse version upgrade?

Version-tag every baseline snapshot against the Lighthouse semver. When the auditor version changes, run a parallel baseline collection for two weeks before switching. Detect drift with PSI thresholds below 0.1; alert and freeze the new baseline if PSI exceeds 0.2. The manifest file written by the promotion script records lighthouse_semver in every artifact entry, enabling you to query which version produced any given baseline.

Can I mix CrUX field data with Lighthouse synthetic data in the same normalization pipeline?

Yes, but apply calibration weights before aggregation. CrUX p75 values typically run 20–40% higher than Lighthouse synthetic medians because they capture real network variance. Train a per-device regression model against 90 days of overlapping CrUX and synthetic runs, then apply the resulting multipliers before feeding both sources into the shared scoring index. Tag records by source (as the Pydantic model requires) so the calibration step can apply the correct multiplier per source type.

Metric Scoring & Data Normalization — parent workflow covering the full scoring and normalization pipeline
Standardizing Mobile vs Desktop Performance Metrics — viewport-specific variance compensation in depth
Designing Custom Health Score Algorithms — how normalized device outputs feed weighted scoring matrices
Calibrating Error Thresholds for Different Site Sections — applying section-specific tolerance bands after normalization
Tracking Metric Trends Across Release Cycles — using versioned normalization artifacts to detect regression across deploys

Normalizing Performance Data Across Device Types #

Prerequisites & Environment Setup #

Step 1 — Initialization: Device Classification and Schema Validation #

Step 2 — Core Configuration: Baseline Computation and Normalization Parameters #

Step 3 — Execution & Scheduling: Airflow DAG with Concurrency Guard #

Step 4 — Artifact Capture & Storage: Versioned Parquet Partitions #

Verification Checklist #

Troubleshooting #

FAQ #

Related #