Web Scraping Without Headless Browsers: A Practical Guide
Core principle: Before writing a single line of scraping code, spend 1 minute in the Network tab. Most modern sites fetch data from clean JSON APIs in the background — hitting those directly is faster, simpler, and harder to block.
Step 1: Find the Real Data Source
Section titled “Step 1: Find the Real Data Source”- Open your browser and navigate to the target page
- Press F12 to open DevTools
- Go to the Network tab
- Filter by Fetch/XHR
- Refresh the page, scroll, or click buttons that trigger data loading
- Look for requests returning JSON — these are your targets
What to look for:
- Responses with
Content-Type: application/json - Endpoints with patterns like
/api/,/graphql,/v1/,/data/ - Requests that fire when you paginate, search, or filter
Tip: Click a request and check the Preview tab. If you can see your data structured there, you’ve found your endpoint.
Step 2: Test It Directly — The curl Test
Section titled “Step 2: Test It Directly — The curl Test”Right-click the request → Copy → Copy as cURL, then paste it in your terminal.
# Example of what gets copiedcurl 'https://example.com/api/products?page=1' \ -H 'accept: application/json' \ -H 'user-agent: Mozilla/5.0 ...' \ -H 'cookie: session=abc123; cf_clearance=xyz'Did it work?
Section titled “Did it work?”- Yes → The site has no TLS fingerprinting. Move to Step 3.
- No → The site may be fingerprinting your TLS handshake. Move to Step 4.
Step 3: Whittle Down the Request
Section titled “Step 3: Whittle Down the Request”If the raw curl worked, start removing headers one by one to find the minimum viable request.
Removal order (start with the safest to remove):
- Generic browser headers (
sec-ch-ua,sec-fetch-*,upgrade-insecure-requests) refererandoriginx-*custom headers (test each carefully — some are required)- Cookies (remove one at a time — identify which are essential)
- Auth headers (
authorization,x-api-key) — keep these if required
Goal: The leanest possible request that still returns data.
import requests
response = requests.get( "https://example.com/api/products", params={"page": 1}, headers={"accept": "application/json"})data = response.json()Also check: Does the endpoint require a Bearer token fetched at page load? Look for an earlier request to /auth, /token, or /session — you may need to grab that first.
Step 4: Handle TLS Fingerprinting with curl_cffi
Section titled “Step 4: Handle TLS Fingerprinting with curl_cffi”If the raw curl didn’t work, the server is likely inspecting your TLS handshake to detect non-browser clients. Use curl_cffi to impersonate a real browser’s networking stack.
pip install curl_cffifrom curl_cffi import requests
response = requests.get( "https://example.com/api/products", impersonate="chrome120", # mimics Chrome's TLS fingerprint params={"page": 1})data = response.json()Available impersonation targets: chrome110, chrome120, safari17, firefox120, and more.
Did it work?
Section titled “Did it work?”- Yes → You have TLS fingerprinting confirmed. Use
curl_cffifor all requests. Then go back to Step 3 and minimize your headers. - No → There’s session-based logic at play. Move to Step 5.
Step 5: Emulate Sessions and Cookies
Section titled “Step 5: Emulate Sessions and Cookies”Some sites issue challenge cookies that must be obtained via a valid browser-like session. Strategies in order of complexity:
5a. Respect Set-Cookie headers
Section titled “5a. Respect Set-Cookie headers”Some sites set a cookie on the first request that must be echoed back. Use a session object:
from curl_cffi import requests
session = requests.Session()session.get("https://example.com/") # Triggers Set-Cookiedata = session.get("https://example.com/api/products").json()5b. Rotating cookies
Section titled “5b. Rotating cookies”Some sites invalidate cookies after each use and issue a new one in the response. You need to update your cookie jar between requests:
session = requests.Session()for page in range(1, 10): r = session.get(f"https://example.com/api/products?page={page}", impersonate="chrome120") # session automatically handles the updated cookies data = r.json()5c. JavaScript challenge cookies (e.g., Cloudflare)
Section titled “5c. JavaScript challenge cookies (e.g., Cloudflare)”If the site issues a cf_clearance or similar JS-challenge cookie, it requires actual JavaScript execution to solve. This is where you finally need a headless browser — but only to obtain the cookie, not to scrape data:
from playwright.sync_api import sync_playwright
with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://example.com/") page.wait_for_timeout(3000) # Let the challenge resolve cookies = page.context.cookies() cf_clearance = next(c["value"] for c in cookies if c["name"] == "cf_clearance") browser.close()
# Now use the cookie in curl_cffiresponse = requests.get( "https://example.com/api/products", impersonate="chrome120", cookies={"cf_clearance": cf_clearance})Step 6: When There’s No API (Server-Side Rendered Pages)
Section titled “Step 6: When There’s No API (Server-Side Rendered Pages)”If the Network tab shows no useful XHR/Fetch requests, the HTML is the data (server-rendered pages). In this case, parse the HTML directly:
import requestsfrom bs4 import BeautifulSoup
r = requests.get("https://example.com/products", headers={"User-Agent": "Mozilla/5.0"})soup = BeautifulSoup(r.text, "lxml")items = soup.select(".product-card .title")Decision Flowchart
Section titled “Decision Flowchart”Start │ ▼Find XHR/Fetch request in Network tab │ ├─ No useful request found ──────────────────────► Parse HTML with BeautifulSoup │ ▼Copy as cURL → paste in terminal │ ├─ Works ──► Whittle down headers ──► Use requests or httpx │ ├─ Fails ──► Try curl_cffi with impersonate="chrome120" │ │ │ ├─ Works ──► TLS fingerprinting confirmed. Use curl_cffi + minimize headers │ │ │ └─ Fails ──► Session logic required │ │ │ ├─ Set-Cookie emulation ──► requests.Session() │ ├─ Rotating cookies ──────► Update jar each request │ └─ JS challenge cookie ───► Playwright to get cookie, then curl_cffi │ └─ All else fails ──────────────────────────────► Full headless browser (Playwright/Selenium)Practical Reminders
Section titled “Practical Reminders”- Rate limiting: Always add delays between requests (
time.sleep(1–3)). Even open APIs will ban aggressive scrapers. - Endpoint instability: Internal APIs have no stability guarantees — your scraper can break without warning. Add error handling.
- Auth tokens: Look for an early request to
/author/sessionthat returns a Bearer token used in subsequent calls. - Respect
robots.txt: Checkhttps://example.com/robots.txtbefore scraping. Honor it where appropriate. - Legal: Scraping terms of service vary. Always check the site’s ToS and applicable laws before scraping at scale.
Quick Reference: Tools by Use Case
Section titled “Quick Reference: Tools by Use Case”| Situation | Tool |
|---|---|
| Simple API, no auth | requests / httpx |
| TLS fingerprinting detected | curl_cffi |
| Session/cookie handling | requests.Session() or curl_cffi Session |
| JS challenge (Cloudflare, etc.) | playwright (for cookie only) + curl_cffi |
| Server-rendered HTML | requests + BeautifulSoup / lxml |
| Full JS rendering required | playwright or selenium |