Anti-bot posture
How we look human to malicious sites — without looking like a scanner.
Phishing kits routinely cloak against well-known scanners: urlscan IP ranges, vanilla Chromium fingerprints, CDP-leak detection. This page lays out the stack we use so it can't be misrepresented and so customers know what they're buying.
The stack, layer by layer
Patchright patchright==1.49.0
A maintained fork of Playwright that suppresses the anti-bot tells the
upstream client leaves behind: navigator.webdriver, the
CDP Runtime.enable probe, leftover Selenium cdc_*
artifacts, and closed-shadow-DOM detection. The API is 1:1 with
Playwright — pinning patchright + playwright + the worker image's
Chromium together is part of our release process.
Coherent fingerprints browserforge==1.2.3
Every scan gets a fresh, self-consistent fingerprint bundle: User-Agent, Accept-Language, sec-ch-ua headers, platform, viewport, locale — all matched the way a real browser would emit them. Mismatched fingerprints (Chrome UA + Linux platform + en-US locale when sec-ch-ua-platform says Windows) are exactly what bot-detectors grep for.
- One fingerprint per scan, never reused
- Real-world distributions from browserforge's corpus
- Hand-built fallback if the library is unavailable
WebRTC leak prevention
A common scanner-detection trick is forcing the browser to expose its
real IP via WebRTC's STUN signaling even when an HTTP proxy is in place.
We pin the WebRTC IP-handling policy to disable_non_proxied_udp
and force it via Chromium flags so the renderer can't override.
Chromium hardening flags
The launch flags we use to bring our process closer to a fresh desktop
Chrome and away from the vanilla Playwright defaults. The full list is
in scraper/stealth_flags.py — highlights:
--disable-blink-features=AutomationControlled --fingerprinting-canvas-image-data-noise --font-render-hinting=none --webrtc-ip-handling-policy=disable_non_proxied_udp --force-webrtc-ip-handling-policy --blink-settings=primaryHoverType=2,primaryPointerType=4 --disable-features=IsolateOrigins,site-per-process
Turnstile / challenge handling
Optional, opt-in per scan. When enabled, we detect Cloudflare Turnstile widgets and (where legal and within scope) click through to capture the post-challenge page. We always log what happened — clicked / detected / not detected / error — so the scan record is auditable.
Per-scan ephemeral browser context
Every scan gets a fresh BrowserContext. No cookie carryover,
no storage state, no shared service workers between scans — so a sticky
fingerprint cookie can't follow our scanner around.
Response-body capture inside the renderer
We intercept HTTP responses inside the browser context and capture JS, XHR, fetch and document bodies under a per-resource (1 MB) and total (8 MB) cap. That gets fed straight to YARA — so obfuscated JS that only materializes after the page renders is still in scope, not just the raw HTML the network would see.
Hosted-API fallback for AI reviews
The on-prem AI is the unit-economics moat in our pitch
— but a single Spark outage shouldn't gap out verdicts. When configured,
the gateway falls back to a hosted OpenAI-compatible endpoint
(Anthropic, OpenAI) the moment the on-prem path fails or returns
malformed JSON. Every review carries a served_by field
(primary or fallback) so the audit trail is
honest about which backend actually answered, and the gateway logs a
ai.fallback_invoked line on every hit so the fallback can
never become the silent default. Disabled by default; turn on per
deployment via env.
Residential / mobile proxy support — wired, opt-in
The worker accepts SCRAPE_PROXY_URL in the form
scheme://user:pass@host:port. When set, every Chromium
context routes through it - the kit's anti-bot stack sees a real
consumer egress instead of a datacenter range. Bright Data,
Oxylabs, Smartproxy, and any other OpenVPN-style endpoint that
speaks HTTP/SOCKS works out of the box. We don't bundle a provider:
operators bring their own plan and pin per-country via the URL
user prefix the provider issues. SCRAPE_PROXY_ROTATING
flags whether we're on a rotating-egress plan (default) or a
static one. Empty by default - LAN dev and unauthenticated demos
keep their datacenter fingerprint.
Residential egress — opt-in per scan (paid)
Set residential: true on POST /scan to route
that scan through a residential proxy pool (via Evomi) - the kit sees a
real consumer exit IP instead of a datacenter range, defeating the
datacenter/scanner IP cloaking kits use against known ranges (including
urlscan's). Rotating exit IP per request. Per
COMPETITORS.md §4
this is non-negotiable for paid-scan traffic. Paid feature - non-paying
requests silently run direct. Defeats IP-based cloaking, but not
cookie-gated cloaking (see cookie injection).
Custom User-Agent — opt-in per scan
Override the scraper's User-Agent with user_agent on
POST /scan: a predefined slug (see
GET /user-agents; open to all plans) - desktop
Chrome/Edge/Firefox/Safari, mobile, or Googlebot - or an arbitrary custom
string (paid; non-paying falls back to the default fingerprint). Useful as
a cloaking probe, since many kits serve different content to Googlebot than
to a normal browser.
Cookie injection — opt-in per scan (paid)
Pass cookies on POST /scan (a
name=value; … string or a list of
{name,value,domain?,path?}) to set cookies before navigation.
This replays a kit's own session cookies to defeat cookie-gated
cloaking - a site that serves a benign decoy to a fresh context and the
real phish only when a valid session cookie is present. We also capture the
cookies a kit sets, so the scan-detail page offers a one-click
"Re-scan with captured cookies". Paid feature.
What we deliberately don't do
We don't try to be a real user
We're a scanner. We don't simulate mouse movement, scroll behaviour, or sustained user interaction. That arms race is endless and not where scanning ROI lives. Our threshold is "make sure the page renders realistically and doesn't trip lazy bot-checks" — not "fool a banking trojan's behavioral analytics".
We don't bypass authenticated paywalls or login walls
The Turnstile clicker is opt-in and only used for the public-facing challenge layer. Anything that requires credentials is out of scope by design — it's a different threat model and a different legal posture.
Try it on a sticky page
If our stealth posture matters for your use case, paste a URL that's given other scanners trouble — see what we get back.
Run a scan Read the positioning