Anti-bot posture

How we look human to malicious sites — without looking like a scanner.

Phishing kits routinely cloak against well-known scanners: urlscan IP ranges, vanilla Chromium fingerprints, CDP-leak detection. This page lays out the stack we use so it can't be misrepresented and so customers know what they're buying.

The stack, layer by layer

Patchright patchright==1.49.0

A maintained fork of Playwright that suppresses the anti-bot tells the upstream client leaves behind: navigator.webdriver, the CDP Runtime.enable probe, leftover Selenium cdc_* artifacts, and closed-shadow-DOM detection. The API is 1:1 with Playwright — pinning patchright + playwright + the worker image's Chromium together is part of our release process.

Coherent fingerprints browserforge==1.2.3

Every scan gets a fresh, self-consistent fingerprint bundle: User-Agent, Accept-Language, sec-ch-ua headers, platform, viewport, locale — all matched the way a real browser would emit them. Mismatched fingerprints (Chrome UA + Linux platform + en-US locale when sec-ch-ua-platform says Windows) are exactly what bot-detectors grep for.

One fingerprint per scan, never reused
Real-world distributions from browserforge's corpus
Hand-built fallback if the library is unavailable

WebRTC leak prevention

A common scanner-detection trick is forcing the browser to expose its real IP via WebRTC's STUN signaling even when an HTTP proxy is in place. We pin the WebRTC IP-handling policy to disable_non_proxied_udp and force it via Chromium flags so the renderer can't override.

Chromium hardening flags

The launch flags we use to bring our process closer to a fresh desktop Chrome and away from the vanilla Playwright defaults. The full list is in scraper/stealth_flags.py — highlights:

--disable-blink-features=AutomationControlled
--fingerprinting-canvas-image-data-noise
--font-render-hinting=none
--webrtc-ip-handling-policy=disable_non_proxied_udp
--force-webrtc-ip-handling-policy
--blink-settings=primaryHoverType=2,primaryPointerType=4
--disable-features=IsolateOrigins,site-per-process

Turnstile / challenge handling

Optional, opt-in per scan. When enabled, we detect Cloudflare Turnstile widgets and (where legal and within scope) click through to capture the post-challenge page. We always log what happened — clicked / detected / not detected / error — so the scan record is auditable.

Per-scan ephemeral browser context

Every scan gets a fresh BrowserContext. No cookie carryover, no storage state, no shared service workers between scans — so a sticky fingerprint cookie can't follow our scanner around.

Response-body capture inside the renderer

We intercept HTTP responses inside the browser context and capture JS, XHR, fetch and document bodies under a per-resource (1 MB) and total (8 MB) cap. That gets fed straight to YARA — so obfuscated JS that only materializes after the page renders is still in scope, not just the raw HTML the network would see.

Hosted-API fallback for AI reviews

The on-prem AI is the unit-economics moat in our pitch — but a single Spark outage shouldn't gap out verdicts. When configured, the gateway falls back to a hosted OpenAI-compatible endpoint (Anthropic, OpenAI) the moment the on-prem path fails or returns malformed JSON. Every review carries a served_by field (primary or fallback) so the audit trail is honest about which backend actually answered, and the gateway logs a ai.fallback_invoked line on every hit so the fallback can never become the silent default. Disabled by default; turn on per deployment via env.

Residential / mobile proxy support — wired, opt-in

The worker accepts SCRAPE_PROXY_URL in the form scheme://user:pass@host:port. When set, every Chromium context routes through it - the kit's anti-bot stack sees a real consumer egress instead of a datacenter range. Bright Data, Oxylabs, Smartproxy, and any other OpenVPN-style endpoint that speaks HTTP/SOCKS works out of the box. We don't bundle a provider: operators bring their own plan and pin per-country via the URL user prefix the provider issues. SCRAPE_PROXY_ROTATING flags whether we're on a rotating-egress plan (default) or a static one. Empty by default - LAN dev and unauthenticated demos keep their datacenter fingerprint.

Residential egress — opt-in per scan (paid)

Set residential: true on POST /scan to route that scan through a residential proxy pool (via Evomi) - the kit sees a real consumer exit IP instead of a datacenter range, defeating the datacenter/scanner IP cloaking kits use against known ranges (including urlscan's). Rotating exit IP per request. Per COMPETITORS.md §4 this is non-negotiable for paid-scan traffic. Paid feature - non-paying requests silently run direct. Defeats IP-based cloaking, but not cookie-gated cloaking (see cookie injection).

Custom User-Agent — opt-in per scan

Override the scraper's User-Agent with user_agent on POST /scan: a predefined slug (see GET /user-agents; open to all plans) - desktop Chrome/Edge/Firefox/Safari, mobile, or Googlebot - or an arbitrary custom string (paid; non-paying falls back to the default fingerprint). Useful as a cloaking probe, since many kits serve different content to Googlebot than to a normal browser.

Cookie injection — opt-in per scan (paid)

Pass cookies on POST /scan (a name=value; … string or a list of {name,value,domain?,path?}) to set cookies before navigation. This replays a kit's own session cookies to defeat cookie-gated cloaking - a site that serves a benign decoy to a fresh context and the real phish only when a valid session cookie is present. We also capture the cookies a kit sets, so the scan-detail page offers a one-click "Re-scan with captured cookies". Paid feature.

What we deliberately don't do

We don't try to be a real user

We're a scanner. We don't simulate mouse movement, scroll behaviour, or sustained user interaction. That arms race is endless and not where scanning ROI lives. Our threshold is "make sure the page renders realistically and doesn't trip lazy bot-checks" — not "fool a banking trojan's behavioral analytics".

We don't bypass authenticated paywalls or login walls

The Turnstile clicker is opt-in and only used for the public-facing challenge layer. Anything that requires credentials is out of scope by design — it's a different threat model and a different legal posture.

Why publish this? Because security software that can't show its work is hard to trust, and because phishing-kit authors already know what urlscan looks like. The posture above is what we sign up for — push us on it.

Honesty box. No anti-bot stack is perfect. Determined adversaries with fresh kits will sometimes cloak against us too. When that happens we want to know — every scan emits a "thin scan" warning in the UI when the captured JS is suspiciously small. Tell us at hello@scrapesmith.io.

Try it on a sticky page

If our stealth posture matters for your use case, paste a URL that's given other scanners trouble — see what we get back.

Run a scan Read the positioning