12 min read

Integrating Custom Crawlers with CI/CD Pipelines

Q: Should I run crawlers on push events or on a schedule?

Both, separated by purpose. Push-triggered runs catch regressions before merge. Scheduled runs establish a continuous health baseline independent of code changes.

Q: How do I prevent parallel crawl runs from fighting over the same target?

Use a distributed mutex — a CI artifact lock file or a short-TTL key in a KV store. Acquire the lock before the crawl starts; release it in a finally block or pipeline cleanup step.

Q: What is the minimum runner spec for a headless Chromium crawl?

4 vCPU, 8 GB RAM, and at least 10 GB of ephemeral storage. Under-spec runners produce flaky LCP measurements because the browser competes with the runner OS for memory.

Q: How do I stop a WAF from blocking the CI crawler?

Allowlist the runner's egress IP range in your WAF rules, or pass a shared secret in a custom request header (X-Audit-Token) that the WAF accepts for crawl traffic. Rotate the secret through CI secrets on a monthly cycle.

Without a deterministic execution environment, crawler results drift between runs: different Chromium builds produce different LCP numbers, unguarded concurrency generates duplicate audit records, and ad-hoc manual runs are never reproducible. The impact falls on SREs who can't baseline regressions, SEO engineers whose dashboards show phantom fluctuations, and agency teams who can't defend audit results to clients. This page is part of the Automated Crawling & Pipeline Tooling reference, which covers the full pipeline from runner configuration through artifact versioning.

Prerequisites & Environment Setup

All five pipeline stages depend on a locked, reproducible environment. Floating tags and loose version ranges are the single most common cause of audit drift.

Dependency	Minimum version	Pin mechanism
Node.js	22.16.0	`.nvmrc` + `engines` field in `package.json`
Playwright	1.49.0	`package-lock.json` (`npm ci`)
Chromium	bundled with Playwright	`PLAYWRIGHT_BROWSERS_PATH` env var
Docker base image	`node:22.16.0-alpine`	full digest pin in Dockerfile `FROM`
Ubuntu runner	`ubuntu-24.04`	explicit label, not `ubuntu-latest`

Required environment variables:

# .env.ci — committed without secrets; values injected by CI secret manager
CRAWL_SEED_URL=         # e.g. https://staging.example.com
CRAWL_CONCURRENCY=4
CRAWL_RATE_PER_SEC=2
AUDIT_OUTPUT_DIR=/tmp/crawl-output
PLAYWRIGHT_BROWSERS_PATH=/home/node/.cache/ms-playwright
AUDIT_TOKEN=            # injected from CI secrets — never hardcode

Step 1 — Initialization: Runner Image & Dependency Cache

A multi-stage Dockerfile isolates build artifacts from the runtime layer and keeps the image small enough to pull quickly on cold runners.

# Dockerfile.crawler  (pinned digests truncated for readability)
FROM node:22.16.0-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --ignore-scripts

FROM node:22.16.0-alpine
# System deps for headless Chromium
RUN apk add --no-cache \
    chromium \
    nss \
    freetype \
    harfbuzz \
    ca-certificates \
    ttf-freefont
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser
ENV PLAYWRIGHT_BROWSERS_PATH=/home/node/.cache/ms-playwright
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY src/ ./src/
# Never run Chrome as root inside a container
USER node
ENTRYPOINT ["node", "src/crawl.js"]

Wire the image into GitHub Actions with explicit resource constraints and a cache key derived from the lockfile SHA — so the cache busts automatically when any dependency changes:

# .github/workflows/crawl-audit.yml
name: Technical Audit Pipeline
on:
  workflow_dispatch:
  schedule:
    - cron: "0 02 * * 1"   # every Monday 02:00 UTC
  push:
    branches: [main, staging]

jobs:
  audit:
    runs-on: ubuntu-24.04
    container:
      image: ghcr.io/your-org/crawler-runner:1.2.0
      options: --shm-size=2g
    env:
      CRAWL_SEED_URL: ${{ vars.TARGET_DOMAIN }}
      AUDIT_TOKEN:    ${{ secrets.AUDIT_TOKEN }}
    steps:
      - uses: actions/checkout@v4

      - name: Restore Playwright browser cache
        uses: actions/cache@v4
        with:
          path: ${{ env.PLAYWRIGHT_BROWSERS_PATH }}
          key: ${{ runner.os }}-playwright-${{ hashFiles('package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-playwright-

      - name: Install Playwright browsers (if cache miss)
        run: npx playwright install chromium --with-deps

      - name: Execute crawl
        run: node src/crawl.js --output "$AUDIT_OUTPUT_DIR"

      - name: Upload crawl artifacts
        uses: actions/upload-artifact@v4
        with:
          name: crawl-results-${{ github.run_id }}
          path: ${{ env.AUDIT_OUTPUT_DIR }}/
          retention-days: 30
        if: always()   # upload even on failure for post-mortem

Step 2 — Core Configuration: Browser Parameters & Key Parameters Table

For configuring headless browsers on JS-heavy sites, the launch arguments below are the production-safe baseline. Running without --disable-dev-shm-usage causes silent OOM crashes on runners with small /dev/shm.

Parameter	Type	Default	Purpose
`--no-sandbox`	flag	off	Required in containerised environments without a user namespace
`--disable-setuid-sandbox`	flag	off	Companion to `--no-sandbox` for Alpine-based images
`--disable-dev-shm-usage`	flag	off	Prevents `/dev/shm` exhaustion on restricted runners
`--disable-gpu`	flag	off	Eliminates GPU driver errors in headless mode
`--js-flags=--max-old-space-size=4096`	string	V8 default	Raises V8 heap cap to avoid OOM on large SPAs
`waitUntil`	string	`load`	Set to `networkidle` for SPAs; `domcontentloaded` for static sites
`viewport`	object	`1280x800`	Standardise across runs; emulate mobile separately with `deviceScaleFactor`

// src/browser-init.js
'use strict';
const { chromium } = require('playwright');

async function launchCrawler() {
  const browser = await chromium.launch({
    headless: true,
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-gpu',
      '--js-flags=--max-old-space-size=4096',
    ],
  });

  const context = await browser.newContext({
    viewport: { width: 1280, height: 800 },
    userAgent: 'SiteHealthAuditBot/1.0 (+https://site-health-audit.com/bot)',
    ignoreHTTPSErrors: false,
  });

  return { browser, context };
}

module.exports = { launchCrawler };

Route interceptors block analytics and ad-network scripts before they fire. This keeps payload sizes stable across runs — critical for reliable INP and LCP baselines:

// src/interceptor.js
'use strict';

const BLOCKED_PATTERNS = [
  'analytics', 'gtm', 'doubleclick', 'facebook',
  'hotjar', 'intercom', 'adsbygoogle',
];

async function setupInterceptors(page) {
  await page.route('**/*', async (route) => {
    const url = route.request().url();
    if (BLOCKED_PATTERNS.some((p) => url.includes(p))) {
      return route.abort();
    }
    return route.continue();
  });
}

module.exports = { setupInterceptors };

Step 3 — Execution & Scheduling: Concurrency Guard & Cron Triggers

Before investing effort in managing crawl budget and rate limiting at the application layer, the pipeline itself must prevent overlapping runs. A second crawl starting before the first finishes doubles target server load and produces artifacts with overlapping timestamps that break diff tooling.

Token-bucket rate limiter

// src/rate-limiter.js
'use strict';

class TokenBucket {
  constructor(capacity, refillRatePerSec) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRatePerSec;
    this.lastRefill = Date.now();
  }

  async consume() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;

    if (this.tokens >= 1) {
      this.tokens -= 1;
      return;
    }

    const waitMs = Math.ceil((1 - this.tokens) / this.refillRate * 1000);
    await new Promise((r) => setTimeout(r, waitMs));
    return this.consume();
  }
}

module.exports = { TokenBucket };

Distributed mutex for scheduled runs

Store the lock in a CI artifact so it survives container restarts. The trap ensures cleanup even on SIGTERM:

#!/usr/bin/env bash
# scripts/acquire-lock.sh
set -euo pipefail

LOCK_KEY="${1:?Lock key required}"
LOCK_FILE="/tmp/${LOCK_KEY}.lock"
LOCK_TTL_SECONDS=3600

if [[ -f "$LOCK_FILE" ]]; then
  lock_age=$(( $(date +%s) - $(stat -c %Y "$LOCK_FILE") ))
  if (( lock_age < LOCK_TTL_SECONDS )); then
    echo "ERROR: Pipeline already running (lock age ${lock_age}s). Exiting." >&2
    exit 1
  fi
  echo "Stale lock detected (age ${lock_age}s). Removing." >&2
  rm -f "$LOCK_FILE"
fi

touch "$LOCK_FILE"
trap 'rm -f "$LOCK_FILE"' EXIT INT TERM

GitLab CI scheduled trigger with timezone-safe UTC enforcement

For setting up a cron job for weekly site crawls, tag every run with a stable RUN_ID so artifact filenames are sortable and diffable:

# .gitlab-ci.yml
weekly_audit:
  stage: audit
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
    - if: $CI_PIPELINE_SOURCE == "web"
  variables:
    RUN_ID: "$CI_PIPELINE_ID"
    TZ: "UTC"
    LOCK_KEY: "crawl-audit-lock"
  before_script:
    - bash scripts/acquire-lock.sh "$LOCK_KEY"
  script:
    - node src/crawl.js --run-id "$RUN_ID" --output /tmp/crawl-output
  after_script:
    - bash scripts/release-lock.sh "$LOCK_KEY" || true
  artifacts:
    paths:
      - /tmp/crawl-output/
    expire_in: 30 days
    when: always

Step 4 — Artifact Capture & Storage

Raw crawl data must be serialised, versioned, and checksummed before upload. Unversioned overwrites make historical diffing impossible. For long-term retention strategies, storing and versioning crawl artifacts in cloud storage covers S3/R2 bucket layouts and lifecycle policies.

Output format: newline-delimited JSON (.ndjson) for streaming ingest, or Apache Parquet for columnar analysis. Both are compressed with gzip before upload.

// src/artifact-writer.js
'use strict';
const fs = require('fs');
const path = require('path');
const crypto = require('crypto');
const zlib = require('zlib');

async function writeArtifact(results, outputDir, runId) {
  const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
  const filename = `crawl-${runId}-${timestamp}.ndjson`;
  const filepath = path.join(outputDir, filename);

  fs.mkdirSync(outputDir, { recursive: true });

  const ndjson = results.map((r) => JSON.stringify(r)).join('\n');
  const compressed = zlib.gzipSync(Buffer.from(ndjson, 'utf8'));

  fs.writeFileSync(`${filepath}.gz`, compressed);

  // SHA-256 checksum alongside the artifact
  const checksum = crypto.createHash('sha256').update(compressed).digest('hex');
  fs.writeFileSync(`${filepath}.gz.sha256`, checksum);

  return { filepath: `${filepath}.gz`, checksum };
}

module.exports = { writeArtifact };

JSON Schema enforces output shape before upload — a broken schema means a corrupted artifact, not a failed crawl that looks healthy:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["url", "status_code", "lcp_ms", "cls_score", "inp_ms", "timestamp_utc"],
  "properties": {
    "url":            { "type": "string",  "format": "uri" },
    "status_code":    { "type": "integer", "minimum": 100, "maximum": 599 },
    "lcp_ms":         { "type": "number",  "minimum": 0 },
    "cls_score":      { "type": "number",  "minimum": 0, "maximum": 1 },
    "inp_ms":         { "type": "number",  "minimum": 0 },
    "wcag_violations":{ "type": "array",   "items": { "type": "string" } },
    "timestamp_utc":  { "type": "string",  "format": "date-time" }
  }
}

Verification Checklist

Run these checks after every pipeline execution before treating the results as authoritative.

Log tail: grep -E '"level":"error"' $AUDIT_OUTPUT_DIR/pipeline.log | wc -l — expect zero.
Artifact checksum: sha256sum -c crawl-<run-id>-*.ndjson.gz.sha256 — all entries OK.
Record count: zcat crawl-*.ndjson.gz | wc -l — count should match the seed URL list size within ±2%.
Schema validation: npx ajv validate -s schema/crawl-output.json -d <(zcat crawl-*.ndjson.gz | head -1) — exits 0.
Health-score diff: compare lcp_ms p95 against the previous run's artifact. A delta above 500 ms warrants investigation before treating the run as baseline.
Lock file absent: ls /tmp/crawl-audit-lock.lock — file must not exist after pipeline completion; a stale lock means trap failed.

Troubleshooting

Runner OOM kill during crawl

Root cause: Insufficient /dev/shm or V8 heap exhaustion on pages with large DOM trees.

# Confirm OOM on GitHub Actions
grep -E 'OOMKilled|Killed' /var/log/syslog | tail -20

# Fix: add --shm-size to the container options in the workflow YAML
# and raise V8 heap:
node --max-old-space-size=4096 src/crawl.js

WAF blocks crawler IP and returns 403

Root cause: WAF rate-rule triggered by the runner's egress IP.

# Identify which rule fired
curl -I -H "X-Audit-Token: $AUDIT_TOKEN" "$CRAWL_SEED_URL" | grep -i cf-ray

# Fix: allowlist the runner IP range in the WAF dashboard, or
# pass the shared secret header so the WAF exempts audit traffic.

Metrics drift between identical runs

Root cause: Service workers serving stale cache states, or third-party scripts not consistently blocked.

// Disable service workers at context creation time
const context = await browser.newContext({
  serviceWorkers: 'block',
});

Pipeline stalls on lock acquisition

Root cause: A previous run exited without releasing the lock (runner preempted before trap fired).

# Check lock age
stat -c '%Y %n' /tmp/crawl-audit-lock.lock
# If older than $LOCK_TTL_SECONDS, force-remove:
rm -f /tmp/crawl-audit-lock.lock

JSON Schema validation rejects otherwise valid records

Root cause: timestamp_utc formatted in local time rather than UTC ISO 8601.

// Always generate timestamps with:
new Date().toISOString()  // returns "2026-06-21T02:00:00.000Z" — always UTC
// Never use:
new Date().toLocaleString() // locale-dependent, not ISO 8601

Artifact upload fails on large HAR files

Root cause: GitHub Actions artifact upload has a 500 MB per-file limit; uncompressed HAR exports from Chromium easily exceed this.

# Compress before upload; gzip typically achieves 10:1 on HAR JSON
gzip -9 output/*.har
# Or switch to .ndjson format, which is ~60% smaller than HAR for the same data

FAQ

Should I run crawlers on push events or on a schedule?

Both, separated by purpose. Push-triggered runs catch regressions before merge — keep them fast by crawling a sampled URL list (50–100 representative pages). Scheduled runs establish a continuous health baseline independent of code changes; these should crawl the full seed list. Use separate workflow files and separate artifact retention policies for each.

How do I prevent parallel crawl runs from fighting over the same target?

Use a distributed mutex — a CI artifact lock file or a short-TTL key in a KV store. Acquire the lock before the crawl starts; release it in a finally block or pipeline cleanup step. Set a TTL slightly longer than your longest expected crawl duration so stale locks from preempted runners auto-expire.

What is the minimum runner spec for a headless Chromium crawl?

4 vCPU, 8 GB RAM, and at least 10 GB of ephemeral storage. Under-spec runners produce flaky LCP measurements because the browser competes with the runner OS for memory. If your site uses many large images or heavy JavaScript bundles, consider 16 GB RAM for reliable p95/p99 Web Vitals capture.

How do I stop a WAF from blocking the CI crawler?

Allowlist the runner's egress IP range in your WAF rules, or pass a shared secret in a custom request header (X-Audit-Token) that the WAF accepts for crawl traffic. Rotate the secret through CI secrets on a monthly cycle. Do not hardcode the token in the Dockerfile — inject it at runtime via the CI secret manager.

Automated Crawling & Pipeline Tooling — parent section covering the full pipeline lifecycle
Configuring Headless Browsers for JS-Heavy Sites — viewport standardisation, SPA routing, and browser flag reference
Managing Crawl Budget & Rate Limiting — token-bucket tuning and robots.txt enforcement
Storing & Versioning Crawl Artifacts in Cloud Storage — S3/R2 bucket layouts, lifecycle policies, and artifact diffing
Setting Up a Cron Job for Weekly Site Crawls — cron expression reference and idempotency patterns
GitHub Actions vs GitLab CI for Crawler Scheduling — parity configs for scheduling audit crawls in either CI system

Integrating Custom Crawlers with CI/CD Pipelines #

Prerequisites & Environment Setup #

Step 1 — Initialization: Runner Image & Dependency Cache #

Step 2 — Core Configuration: Browser Parameters & Key Parameters Table #

Step 3 — Execution & Scheduling: Concurrency Guard & Cron Triggers #

Token-bucket rate limiter #

Distributed mutex for scheduled runs #

GitLab CI scheduled trigger with timezone-safe UTC enforcement #

Step 4 — Artifact Capture & Storage #

Verification Checklist #

Troubleshooting #

Runner OOM kill during crawl #

WAF blocks crawler IP and returns 403 #

Metrics drift between identical runs #

Pipeline stalls on lock acquisition #

JSON Schema validation rejects otherwise valid records #

Artifact upload fails on large HAR files #

FAQ #

Related #