Skip to content

Web Scraping Without Headless Browsers: A Practical Guide

Core principle: Before writing a single line of scraping code, spend 1 minute in the Network tab. Most modern sites fetch data from clean JSON APIs in the background — hitting those directly is faster, simpler, and harder to block.


  1. Open your browser and navigate to the target page
  2. Press F12 to open DevTools
  3. Go to the Network tab
  4. Filter by Fetch/XHR
  5. Refresh the page, scroll, or click buttons that trigger data loading
  6. Look for requests returning JSON — these are your targets

What to look for:

  • Responses with Content-Type: application/json
  • Endpoints with patterns like /api/, /graphql, /v1/, /data/
  • Requests that fire when you paginate, search, or filter

Tip: Click a request and check the Preview tab. If you can see your data structured there, you’ve found your endpoint.


Step 2: Test It Directly — The curl Test

Section titled “Step 2: Test It Directly — The curl Test”

Right-click the request → CopyCopy as cURL, then paste it in your terminal.

Terminal window
# Example of what gets copied
curl 'https://example.com/api/products?page=1' \
-H 'accept: application/json' \
-H 'user-agent: Mozilla/5.0 ...' \
-H 'cookie: session=abc123; cf_clearance=xyz'
  • Yes → The site has no TLS fingerprinting. Move to Step 3.
  • No → The site may be fingerprinting your TLS handshake. Move to Step 4.

If the raw curl worked, start removing headers one by one to find the minimum viable request.

Removal order (start with the safest to remove):

  1. Generic browser headers (sec-ch-ua, sec-fetch-*, upgrade-insecure-requests)
  2. referer and origin
  3. x-* custom headers (test each carefully — some are required)
  4. Cookies (remove one at a time — identify which are essential)
  5. Auth headers (authorization, x-api-key) — keep these if required

Goal: The leanest possible request that still returns data.

import requests
response = requests.get(
"https://example.com/api/products",
params={"page": 1},
headers={"accept": "application/json"}
)
data = response.json()

Also check: Does the endpoint require a Bearer token fetched at page load? Look for an earlier request to /auth, /token, or /session — you may need to grab that first.


Step 4: Handle TLS Fingerprinting with curl_cffi

Section titled “Step 4: Handle TLS Fingerprinting with curl_cffi”

If the raw curl didn’t work, the server is likely inspecting your TLS handshake to detect non-browser clients. Use curl_cffi to impersonate a real browser’s networking stack.

Terminal window
pip install curl_cffi
from curl_cffi import requests
response = requests.get(
"https://example.com/api/products",
impersonate="chrome120", # mimics Chrome's TLS fingerprint
params={"page": 1}
)
data = response.json()

Available impersonation targets: chrome110, chrome120, safari17, firefox120, and more.

  • Yes → You have TLS fingerprinting confirmed. Use curl_cffi for all requests. Then go back to Step 3 and minimize your headers.
  • No → There’s session-based logic at play. Move to Step 5.

Some sites issue challenge cookies that must be obtained via a valid browser-like session. Strategies in order of complexity:

Some sites set a cookie on the first request that must be echoed back. Use a session object:

from curl_cffi import requests
session = requests.Session()
session.get("https://example.com/") # Triggers Set-Cookie
data = session.get("https://example.com/api/products").json()

Some sites invalidate cookies after each use and issue a new one in the response. You need to update your cookie jar between requests:

session = requests.Session()
for page in range(1, 10):
r = session.get(f"https://example.com/api/products?page={page}", impersonate="chrome120")
# session automatically handles the updated cookies
data = r.json()

5c. JavaScript challenge cookies (e.g., Cloudflare)

Section titled “5c. JavaScript challenge cookies (e.g., Cloudflare)”

If the site issues a cf_clearance or similar JS-challenge cookie, it requires actual JavaScript execution to solve. This is where you finally need a headless browser — but only to obtain the cookie, not to scrape data:

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/")
page.wait_for_timeout(3000) # Let the challenge resolve
cookies = page.context.cookies()
cf_clearance = next(c["value"] for c in cookies if c["name"] == "cf_clearance")
browser.close()
# Now use the cookie in curl_cffi
response = requests.get(
"https://example.com/api/products",
impersonate="chrome120",
cookies={"cf_clearance": cf_clearance}
)

Step 6: When There’s No API (Server-Side Rendered Pages)

Section titled “Step 6: When There’s No API (Server-Side Rendered Pages)”

If the Network tab shows no useful XHR/Fetch requests, the HTML is the data (server-rendered pages). In this case, parse the HTML directly:

import requests
from bs4 import BeautifulSoup
r = requests.get("https://example.com/products", headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "lxml")
items = soup.select(".product-card .title")

Start
Find XHR/Fetch request in Network tab
├─ No useful request found ──────────────────────► Parse HTML with BeautifulSoup
Copy as cURL → paste in terminal
├─ Works ──► Whittle down headers ──► Use requests or httpx
├─ Fails ──► Try curl_cffi with impersonate="chrome120"
│ │
│ ├─ Works ──► TLS fingerprinting confirmed. Use curl_cffi + minimize headers
│ │
│ └─ Fails ──► Session logic required
│ │
│ ├─ Set-Cookie emulation ──► requests.Session()
│ ├─ Rotating cookies ──────► Update jar each request
│ └─ JS challenge cookie ───► Playwright to get cookie, then curl_cffi
└─ All else fails ──────────────────────────────► Full headless browser (Playwright/Selenium)

  • Rate limiting: Always add delays between requests (time.sleep(1–3)). Even open APIs will ban aggressive scrapers.
  • Endpoint instability: Internal APIs have no stability guarantees — your scraper can break without warning. Add error handling.
  • Auth tokens: Look for an early request to /auth or /session that returns a Bearer token used in subsequent calls.
  • Respect robots.txt: Check https://example.com/robots.txt before scraping. Honor it where appropriate.
  • Legal: Scraping terms of service vary. Always check the site’s ToS and applicable laws before scraping at scale.

SituationTool
Simple API, no authrequests / httpx
TLS fingerprinting detectedcurl_cffi
Session/cookie handlingrequests.Session() or curl_cffi Session
JS challenge (Cloudflare, etc.)playwright (for cookie only) + curl_cffi
Server-rendered HTMLrequests + BeautifulSoup / lxml
Full JS rendering requiredplaywright or selenium