Automating Screaming Frog with Python Scripts
Automating Screaming Frog with Python Scripts eliminates manual GUI bottlenecks in technical audit & site health monitoring workflows. Engineering teams require deterministic execution, reproducible environments, and structured data pipelines. This guide establishes a production-ready architecture for CLI-driven crawls.
Headless CLI Environment & License Validation
Establish a deterministic execution environment before initiating automated workflows. Align container standards with the Automated Crawling & Pipeline Tooling baseline to ensure reproducible dependency resolution. CLI execution requires explicit headless mode configuration to bypass display server dependencies and enable silent license validation.
Root Cause: Manual GUI execution fails in automated pipelines due to missing X11 display servers, unhandled Java runtime dependencies, or interactive license prompts blocking subprocess execution.
Fix: Deploy a minimal containerized runtime with OpenJDK 17+. Install the Screaming Frog CLI binary. Configure silent license activation via SF_LICENSE_KEY environment variables. Wrap execution in a Python subprocess controller with explicit --headless and --config flags.
Validation: Run screamingfrogseospider --version and verify exit code 0. Confirm cli.log contains License: Valid (CLI Mode) and zero X11/display errors. Validate Python wrapper returns CompletedProcess with returncode=0.
Rollback: Terminate active subprocess. Clear cached license tokens. Revert to previous Docker image tag. Restore manual GUI execution workflow.
FROM eclipse-temurin:17-jre-alpine
RUN apk add --no-cache curl unzip bash
ARG SF_VERSION="19.0"
RUN curl -fsSL https://download.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider-${SF_VERSION}.zip \
-o /tmp/sf.zip && \
unzip /tmp/sf.zip -d /opt/screamingfrog && \
chmod +x /opt/screamingfrog/ScreamingFrogSEOSpider
ENV SF_LICENSE_KEY=""
RUN adduser -D -s /bin/sh crawler
USER crawler
WORKDIR /data
ENTRYPOINT ["/opt/screamingfrog/ScreamingFrogSEOSpider"]
import subprocess
import os
import argparse
def run_headless_crawl(url: str, config_path: str, output_dir: str) -> subprocess.CompletedProcess:
env = os.environ.copy()
cmd = [
"ScreamingFrogSEOSpider",
"--headless",
"--url", url,
"--config", config_path,
"--save-crawl",
"--output-dir", output_dir
]
return subprocess.run(
cmd,
env=env,
timeout=3600,
check=True,
capture_output=True,
text=True
)
Common Mistakes:
- Hardcoding license keys in repository history.
- Omitting
--headlesscausingjava.awt.HeadlessException. - Using relative paths for
.seospiderconfigs causingFileNotFoundErrorin CI runners.
Programmatic Crawl Configuration & Execution
Dynamic configuration generation enables precise control over crawl behavior and resource allocation. When targeting modern SPAs or client-side rendered applications, integrate headless rendering parameters as detailed in Configuring Headless Browsers for JS-Heavy Sites. Python scripts must validate config syntax before passing to the CLI to prevent silent execution failures.
Root Cause: Static .seospider profiles fail to adapt to dynamic sitemaps, variable WAF thresholds, or JavaScript rendering requirements, resulting in incomplete URL coverage or IP throttling.
Fix: Implement Python-driven config generation using configparser or YAML templating. Dynamically inject rate limits, custom user agents, and JS rendering toggles based on target domain heuristics. Pass the generated config to the CLI via --config /tmp/dynamic.seospider.
Validation: Execute a 100-URL test crawl. Verify crawl_log.csv shows 100% URL discovery, correct render_mode application, and HTTP status distribution matching baseline metrics. Confirm stdout contains no Rate limit exceeded warnings.
Rollback: Restore static .seospider backup. Disable dynamic config injection. Revert to default --max-threads=5. Clear temporary config artifacts.
import configparser
import signal
import sys
import subprocess
def generate_dynamic_config(target_domain: str, rate_limit: int = 2, js_render: bool = False) -> str:
config = configparser.ConfigParser()
config['Crawl'] = {
'StartUrl': f'https://{target_domain}',
'MaxThreads': '5',
'RateLimit': str(rate_limit)
}
if js_render:
config['Rendering'] = {'RenderMode': 'JavaScript'}
config_path = '/tmp/dynamic.seospider'
with open(config_path, 'w') as f:
config.write(f)
return config_path
def stream_crawl_output(proc: subprocess.Popen):
for line in proc.stdout:
if "Rate limit exceeded" in line:
sys.stderr.write("ALERT: WAF throttling detected\n")
proc.send_signal(signal.SIGTERM)
break
import os
import signal
import sys
import subprocess
def safe_termination_handler(signum, frame):
print("Received termination signal. Saving partial crawl state...")
os.kill(os.getpid(), signal.SIGINT)
sys.exit(0)
signal.signal(signal.SIGINT, safe_termination_handler)
signal.signal(signal.SIGTERM, safe_termination_handler)
Common Mistakes:
- Overwriting active config files mid-crawl causing parser corruption.
- Setting
--max-threads> 10 without adjustingRateLimit, triggering WAF blocks. - Failing to validate config syntax before CLI execution, resulting in silent fallback to defaults.
Output Parsing & Pipeline Ingestion
Post-crawl normalization ensures reliable data handoff to analytics and monitoring systems. Standardize artifact storage using versioned cloud buckets aligned with enterprise data governance practices. Implement strict schema validation before committing to the data warehouse to prevent pipeline corruption. Normalized outputs directly feed LCP, CLS, INP, and WCAG compliance tracking.
Root Cause: Raw .csv and .seospider exports contain inconsistent column ordering, BOM encoding artifacts, or missing schema fields, causing downstream data pipelines to fail on dtype mismatches or ingestion timeouts.
Fix: Deploy a Python post-processing module using pandas or polars to normalize column names, enforce strict dtypes, strip UTF-8 BOM, and push structured Parquet/JSON to cloud storage. Implement schema validation via Pydantic before committing to the warehouse.
Validation: Run schema validation against a predefined Pydantic model. Confirm row counts match crawl_log.csv totals. Verify zero NaN values in critical fields (Address, Status Code, Title 1, Indexability).
Rollback: Archive malformed artifacts. Switch to raw CSV fallback ingestion. Trigger schema drift alerting. Halt downstream ETL jobs.
import pandas as pd
from pydantic import BaseModel, Field, ValidationError
from typing import Optional
class CrawlRecord(BaseModel, extra='forbid'):
address: str = Field(alias="Address")
status_code: int = Field(alias="Status Code")
title: Optional[str] = Field(alias="Title 1")
indexability: str = Field(alias="Indexability")
def normalize_crawl_output(csv_path: str) -> pd.DataFrame:
df = pd.read_csv(csv_path, encoding='utf-8-sig', dtype={"Status Code": "Int64"})
df.columns = df.columns.str.strip()
records = [CrawlRecord(**row) for _, row in df.iterrows()]
return pd.DataFrame([r.model_dump() for r in records])
import boto3
import gzip
import io
from botocore.exceptions import ClientError
def upload_to_s3(df: pd.DataFrame, bucket: str, key: str):
parquet_buffer = io.BytesIO()
df.to_parquet(parquet_buffer, engine='pyarrow')
parquet_buffer.seek(0)
s3 = boto3.client('s3')
try:
s3.put_object(Bucket=bucket, Key=key, Body=parquet_buffer.getvalue())
except ClientError as e:
raise RuntimeError(f"S3 upload failed: {e.response['Error']['Message']}")
Common Mistakes:
- Assuming consistent column order across Screaming Frog version upgrades.
- Ignoring BOM in exported CSVs causing header parsing failures and
KeyError. - Blocking the main thread during large file serialization instead of using async I/O.