What if the site has nested sitemap indexes with thousands of child sitemaps?

Fetch the root sitemap index first, extract all entries, then dispatch parallel curl jobs for each child. Use GNU parallel or xargs -P to bound concurrency. Merge outputs with sort -u before deduplication.

Can I run the hierarchy mapping script against a staging environment?

Yes, but set TARGET_DOMAIN to the staging hostname and add the staging domain to allowed_domains. Verify that robots.txt on staging does not contain blanket Disallow: / rules before running, or the intersection filter will produce an empty URL list.

12 min read

How to Map URL Hierarchies Before Running a Crawl

Q: How do I handle sitemaps served behind authentication?

Pass a session cookie or bearer token via curl -H 'Cookie: session=...' or curl -H 'Authorization: Bearer ...'. Store credentials in environment variables, never hard-code them in scripts.

Q: How do I know if my parameter normalization removed too many essential URLs?

Diff the normalized list against a known-good reference crawl from a previous audit. If any URLs that historically returned HTTP 200 with unique canonical tags disappear from the normalized set, add those parameter patterns to an allowlist before proceeding.

Deploying a crawler against an unmapped site is the fastest way to exhaust your crawl budget on low-value paths and produce an inventory full of duplicates and redirect noise. The pre-crawl hierarchy mapping stage removes that uncertainty: you extract every declared URL, resolve the chains that inflate apparent depth, strip the parameters that cause combinatorial explosions, and hard-code the scope boundaries before the first real request fires. This guide is scoped to enterprises with multi-tier sitemap indexes, faceted navigation, and mixed-environment routing. It sits under Defining Crawl Depth & Scope for Enterprise Sites, which covers the broader scoping strategy.

Environment Isolation & Dependency Declaration

Isolate the mapping scripts from your production Python environment to avoid dependency conflicts with later pipeline stages.

#!/usr/bin/env bash
set -euo pipefail

export TARGET_DOMAIN="example.com"
export WORK_DIR="/tmp/url-hierarchy-${TARGET_DOMAIN}"
export PYTHON_BIN="/usr/local/bin/python3.12"

mkdir -p "${WORK_DIR}"

# Pin tool versions
"${PYTHON_BIN}" -m venv "${WORK_DIR}/.venv"
"${WORK_DIR}/.venv/bin/pip" install --quiet \
  requests==2.32.3 \
  beautifulsoup4==4.12.3 \
  lxml==5.2.2

Required environment variables:

Variable	Type	Default	Purpose
`TARGET_DOMAIN`	string	—	Apex domain without scheme; used in all scope filters
`WORK_DIR`	string	`/tmp/url-hierarchy-<domain>`	Absolute working directory for all intermediate files
`PYTHON_BIN`	string	`/usr/local/bin/python3.12`	Pinned interpreter path; avoids system Python conflicts
`MAX_REDIRECT_DEPTH`	int	`5`	Maximum 3xx hops before a chain is flagged as broken
`CONCURRENCY`	int	`8`	Parallel workers for the redirect resolution pass

Implementation

The following script executes all four mapping stages end-to-end. Each stage writes to a named intermediate file so any stage can be re-run independently without repeating earlier work.

#!/usr/bin/env bash
# url-hierarchy-map.sh — full pre-crawl URL hierarchy mapping pipeline
# Usage: TARGET_DOMAIN=example.com bash url-hierarchy-map.sh
set -euo pipefail

TARGET_DOMAIN="${TARGET_DOMAIN:?TARGET_DOMAIN must be set}"
WORK_DIR="${WORK_DIR:-/tmp/url-hierarchy-${TARGET_DOMAIN}}"
PYTHON_BIN="${PYTHON_BIN:-/usr/local/bin/python3.12}"
MAX_REDIRECT_DEPTH="${MAX_REDIRECT_DEPTH:-5}"
CONCURRENCY="${CONCURRENCY:-8}"

mkdir -p "${WORK_DIR}"

# ── Stage 1: Sitemap Extraction ──────────────────────────────────────────────
echo "[1/4] Extracting URLs from sitemap index..."

# Fetch root sitemap; follow nested <sitemap> entries up to 2 levels deep
curl -fsSL "https://${TARGET_DOMAIN}/sitemap.xml" \
  | grep -oP '(?<=<loc>)[^<]+' \
  > "${WORK_DIR}/sitemap_raw.txt"

# If the root is a sitemap index, fetch each child sitemap and append
while IFS= read -r child_url; do
  if echo "${child_url}" | grep -qi 'sitemap'; then
    curl -fsSL "${child_url}" \
      | grep -oP '(?<=<loc>)[^<]+' \
      >> "${WORK_DIR}/sitemap_raw.txt"
  fi
done < "${WORK_DIR}/sitemap_raw.txt"

sort -u "${WORK_DIR}/sitemap_raw.txt" > "${WORK_DIR}/raw_urls.txt"
echo "  → $(wc -l < "${WORK_DIR}/raw_urls.txt") unique URLs extracted"

# ── Stage 2: robots.txt Filter ───────────────────────────────────────────────
echo "[2/4] Filtering against robots.txt disallowed paths..."

curl -fsSL "https://${TARGET_DOMAIN}/robots.txt" \
  | grep -i '^Disallow:' \
  | awk '{print $2}' \
  | sed 's|/$||' \
  > "${WORK_DIR}/disallowed_prefixes.txt"

# Remove any URL whose path starts with a disallowed prefix
while IFS= read -r url; do
  path="${url#https://${TARGET_DOMAIN}}"
  blocked=false
  while IFS= read -r prefix; do
    [[ -z "${prefix}" ]] && continue
    if [[ "${path}" == "${prefix}"* ]]; then
      blocked=true
      break
    fi
  done < "${WORK_DIR}/disallowed_prefixes.txt"
  "${blocked}" || echo "${url}"
done < "${WORK_DIR}/raw_urls.txt" > "${WORK_DIR}/allowed_urls.txt"

echo "  → $(wc -l < "${WORK_DIR}/allowed_urls.txt") URLs after robots.txt filter"

# ── Stage 3: Redirect Chain Resolution ───────────────────────────────────────
echo "[3/4] Resolving redirect chains and extracting canonical tags..."

"${PYTHON_BIN}" - <<'PYEOF'
import os, json, sys
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
from bs4 import BeautifulSoup

work_dir   = os.environ["WORK_DIR"]
max_hops   = int(os.environ.get("MAX_REDIRECT_DEPTH", 5))
workers    = int(os.environ.get("CONCURRENCY", 8))

session = requests.Session()
session.max_redirects = max_hops

def resolve(url: str) -> dict:
    try:
        r = session.get(url, timeout=15, allow_redirects=True)
        soup = BeautifulSoup(r.text, "lxml")
        tag  = soup.find("link", rel="canonical")
        return {
            "original":   url,
            "final_url":  r.url,
            "status":     r.status_code,
            "canonical":  tag["href"] if tag else None,
            "hops":       len(r.history),
        }
    except requests.TooManyRedirects:
        return {"original": url, "error": "too_many_redirects"}
    except Exception as exc:
        return {"original": url, "error": str(exc)}

input_path  = os.path.join(work_dir, "allowed_urls.txt")
output_path = os.path.join(work_dir, "resolved_urls.jsonl")

with open(input_path) as fh:
    urls = [line.strip() for line in fh if line.strip()]

with open(output_path, "w") as out, ThreadPoolExecutor(max_workers=workers) as pool:
    futures = {pool.submit(resolve, u): u for u in urls}
    for future in as_completed(futures):
        out.write(json.dumps(future.result()) + "\n")

print(f"  → Wrote {len(urls)} resolved entries to {output_path}")
PYEOF

# Extract final URLs that resolved cleanly (HTTP 200) into a flat list
python3 -c "
import json, sys
for line in open('${WORK_DIR}/resolved_urls.jsonl'):
    rec = json.loads(line)
    if rec.get('status') == 200:
        print(rec['final_url'])
" | sort -u > "${WORK_DIR}/resolved_200.txt"

echo "  → $(wc -l < "${WORK_DIR}/resolved_200.txt") URLs resolved to HTTP 200"

# ── Stage 4: Parameter Normalization ─────────────────────────────────────────
echo "[4/4] Stripping non-essential query parameters..."

# Parameters known to generate duplicates: sort, ref, utm_*, color, size, page
# Add domain-specific params to this pattern before running
sed -E \
  -e 's/[?&](utm_source|utm_medium|utm_campaign|utm_term|utm_content)=[^&]*//g' \
  -e 's/[?&](sort|order|ref|affid|gclid|fbclid|color|size|page)=[^&]*//g' \
  -e 's/[?&]+$//' \
  -e 's/\?&/\?/' \
  "${WORK_DIR}/resolved_200.txt" \
  | sort -u \
  > "${WORK_DIR}/normalized_urls.txt"

echo "  → $(wc -l < "${WORK_DIR}/normalized_urls.txt") unique normalized URLs"

# ── Scope Validation ──────────────────────────────────────────────────────────
out_of_scope=$(grep -v "^https://${TARGET_DOMAIN}/" "${WORK_DIR}/normalized_urls.txt" || true)
if [[ -n "${out_of_scope}" ]]; then
  echo "WARNING: out-of-scope URLs detected:" >&2
  echo "${out_of_scope}" >&2
  grep "^https://${TARGET_DOMAIN}/" "${WORK_DIR}/normalized_urls.txt" \
    > "${WORK_DIR}/crawl_ready.txt"
else
  cp "${WORK_DIR}/normalized_urls.txt" "${WORK_DIR}/crawl_ready.txt"
fi

echo ""
echo "Done. Final inventory: ${WORK_DIR}/crawl_ready.txt"
echo "  Total URLs ready for crawl: $(wc -l < "${WORK_DIR}/crawl_ready.txt")"

Line-by-line annotations for key decisions:

set -euo pipefail — exits immediately on any error, unset variable, or failed pipe stage, preventing silent partial inventories.
sort -u after each stage — deduplication at every boundary prevents duplicates from compounding across stages.
session.max_redirects = max_hops — bounds chain traversal; TooManyRedirects is caught and flagged rather than crashing the entire batch.
The sed -E parameter strip uses anchored group matches rather than a blanket \?.* to avoid removing essential query components such as locale tokens (?lang=en) or pagination (?page=2) when those are legitimately distinct pages. Extend the pattern list for your specific site before running.
The final scope validation guard uses grep -v to surface any URLs that slipped past the allowed_domains filter — common when sitemaps include CDN or subdomain URLs.

Verification & Smoke Test

Run these checks immediately after the script completes, before handing the inventory to the crawler. The storing and versioning crawl artifacts cluster covers how to archive these intermediate files for later diff comparisons.

#!/usr/bin/env bash
set -euo pipefail

TARGET_DOMAIN="${TARGET_DOMAIN:?}"
WORK_DIR="${WORK_DIR:-/tmp/url-hierarchy-${TARGET_DOMAIN}}"

echo "=== Hierarchy Mapping Smoke Test ==="

# 1. Inventory size sanity check — warn if fewer than 50 URLs (likely a sitemap parse failure)
url_count=$(wc -l < "${WORK_DIR}/crawl_ready.txt")
if (( url_count < 50 )); then
  echo "WARN: Only ${url_count} URLs in final inventory. Verify sitemap is reachable." >&2
fi
echo "[OK] Final inventory: ${url_count} URLs"

# 2. Confirm zero out-of-scope URLs remain
out_of_scope_count=$(grep -vc "^https://${TARGET_DOMAIN}/" "${WORK_DIR}/crawl_ready.txt" || echo 0)
if (( out_of_scope_count > 0 )); then
  echo "FAIL: ${out_of_scope_count} out-of-scope URLs remain in crawl_ready.txt" >&2
  exit 1
fi
echo "[OK] All URLs scoped to ${TARGET_DOMAIN}"

# 3. Spot-check 5 random URLs return HTTP 200
shuf -n 5 "${WORK_DIR}/crawl_ready.txt" | while IFS= read -r url; do
  http_code=$(curl -o /dev/null -sI -w "%{http_code}" --max-time 10 "${url}")
  if [[ "${http_code}" != "200" ]]; then
    echo "WARN: ${url} returned HTTP ${http_code}" >&2
  else
    echo "[OK] ${url} → 200"
  fi
done

# 4. Confirm no query strings survived that match known duplicate-generating params
if grep -qE '[?&](utm_source|sort|ref|gclid)=' "${WORK_DIR}/crawl_ready.txt"; then
  echo "FAIL: Tracking/faceted parameters found in normalized output." >&2
  grep -E '[?&](utm_source|sort|ref|gclid)=' "${WORK_DIR}/crawl_ready.txt" | head -5 >&2
  exit 1
fi
echo "[OK] No tracking or faceted parameters in normalized URLs"

echo "=== Smoke test passed ==="

Expected output on a healthy run:

=== Hierarchy Mapping Smoke Test ===
[OK] Final inventory: 3847 URLs
[OK] All URLs scoped to example.com
[OK] https://example.com/products/widget-a/ → 200
[OK] https://example.com/blog/post-42/ → 200
...
[OK] No tracking or faceted parameters in normalized URLs
=== Smoke test passed ===

Failure signal: any FAIL: line means do not proceed to the crawl. Either the sitemap parse failed, out-of-scope URLs leaked through, or parameter normalization is incomplete.

Failure Modes

Stage 1 produces fewer than 100 URLs on a known large site

The sitemap index URL may be non-standard or may redirect. Diagnose:

curl -sI "https://${TARGET_DOMAIN}/sitemap.xml" | grep -i 'location\|content-type'

If the response is a redirect, follow it manually and update the script's sitemap.xml path to the final URL. If Content-Type: text/html is returned, the sitemap does not exist at that path — check robots.txt for a Sitemap: directive:

curl -fsSL "https://${TARGET_DOMAIN}/robots.txt" | grep -i '^Sitemap:'

Redirect resolution hangs or times out on a large URL set

The default CONCURRENCY=8 can stall on sites that rate-limit parallel requests. Reduce concurrency and add a polite delay between requests. This is particularly relevant when you need to configure rate limiting to avoid overwhelming the origin during pre-crawl validation:

# Reduce workers and add 200ms inter-request sleep
export CONCURRENCY=3

# In the Python block, add after session.get():
# import time; time.sleep(0.2)

Normalized URL count is lower than expected after parameter stripping

Over-aggressive normalization may have removed essential locale or pagination tokens. Inspect what was removed:

# Find URLs present in resolved_200.txt but absent from normalized_urls.txt
comm -23 \
  <(sed 's/[?#].*//' "${WORK_DIR}/resolved_200.txt" | sort) \
  <(sed 's/[?#].*//' "${WORK_DIR}/normalized_urls.txt" | sort) \
  | head -20

If legitimate unique pages disappeared, add their parameter keys to a PARAM_ALLOWLIST variable and adjust the sed pattern to skip them.

FAQ

What if the site has a nested sitemap index with thousands of child sitemaps?

Fetch the root sitemap index first and extract all <sitemap><loc> entries — these are the child sitemap URLs, not page URLs. Then dispatch parallel curl jobs for each child and append their <loc> entries to sitemap_raw.txt. Use xargs -P 8 or GNU parallel to bound concurrency. After merging, run sort -u once across the entire combined output before proceeding to the robots.txt filter stage.

# Extract child sitemap URLs from a sitemap index
curl -fsSL "https://${TARGET_DOMAIN}/sitemap_index.xml" \
  | grep -oP '(?<=<sitemap><loc>)[^<]+' \
  | xargs -P 8 -I{} bash -c \
      'curl -fsSL "{}" | grep -oP "(?<=<loc>)[^<]+"' \
  | sort -u > "${WORK_DIR}/raw_urls.txt"

How do I handle sitemaps served behind authentication?

Export credentials as environment variables and pass them via curl headers. Never hard-code tokens in the script itself:

export SITEMAP_COOKIE="session=abc123xyz"
export SITEMAP_TOKEN="Bearer eyJhbGci..."

# Cookie-based auth
curl -fsSL -H "Cookie: ${SITEMAP_COOKIE}" "https://${TARGET_DOMAIN}/sitemap.xml" \
  | grep -oP '(?<=<loc>)[^<]+'

# Bearer token auth
curl -fsSL -H "Authorization: ${SITEMAP_TOKEN}" "https://${TARGET_DOMAIN}/sitemap.xml" \
  | grep -oP '(?<=<loc>)[^<]+'

How do I know if parameter normalization removed too many essential URLs?

Diff the normalized list against a reference crawl from a previous audit. If URLs that historically returned HTTP 200 with unique canonical tags are absent from the normalized set, their query parameters are load-bearing and should be added to an allowlist. A quick heuristic: if the normalized count is more than 30% lower than the resolved-200 count, inspect the removed URLs manually before proceeding.

# Count the reduction
before=$(wc -l < "${WORK_DIR}/resolved_200.txt")
after=$(wc -l < "${WORK_DIR}/normalized_urls.txt")
echo "Reduction: $(( (before - after) * 100 / before ))%"

Can I run hierarchy mapping against a staging environment?

Yes. Set TARGET_DOMAIN to the staging hostname and ensure robots.txt on staging does not contain a blanket Disallow: / rule — if it does, the robots filter stage will produce an empty file and every subsequent stage will fail silently. Override by passing SKIP_ROBOTS_FILTER=true and applying your own inclusion regex instead.

Defining Crawl Depth & Scope for Enterprise Sites — parent cluster covering the full scoping strategy this workflow feeds into
Managing Crawl Budget & Rate Limiting — configure token-bucket rate controls after the inventory is locked
Establishing Baseline Health Metrics for New Domains — use the normalized URL list as input when baselining a new domain for the first time
Risk Scoring Frameworks for Technical Debt — prioritize which sections of the hierarchy to audit first based on severity scoring

How to Map URL Hierarchies Before Running a Crawl #

Environment Isolation & Dependency Declaration #

Implementation #

Verification & Smoke Test #

Failure Modes #

FAQ #

Related #

How to Map URL Hierarchies Before Running a Crawl

Environment Isolation & Dependency Declaration

Implementation

Verification & Smoke Test

Failure Modes

FAQ

Related