Web Scraping Without Headless Browsers: A Practical Guide

Apr 13, 2026

Core principle: Before writing a single line of scraping code, spend 1 minute in the Network tab. Most modern sites fetch data from clean JSON APIs in the background — hitting those directly is faster, simpler, and harder to block.

Step 1: Find the Real Data Source

Open your browser and navigate to the target page
Press F12 to open DevTools
Go to the Network tab
Filter by Fetch/XHR
Refresh the page, scroll, or click buttons that trigger data loading
Look for requests returning JSON — these are your targets

What to look for:

Responses with Content-Type: application/json
Endpoints with patterns like /api/, /graphql, /v1/, /data/
Requests that fire when you paginate, search, or filter

Tip: Click a request and check the Preview tab. If you can see your data structured there, you’ve found your endpoint.

Step 2: Test It Directly — The curl Test

Right-click the request → Copy → Copy as cURL, then paste it in your terminal.

# Example of what gets copied
curl 'https://example.com/api/products?page=1' \
  -H 'accept: application/json' \
  -H 'user-agent: Mozilla/5.0 ...' \
  -H 'cookie: session=abc123; cf_clearance=xyz'

Did it work?

Yes → The site has no TLS fingerprinting. Move to Step 3.
No → The site may be fingerprinting your TLS handshake. Move to Step 4.

Step 3: Whittle Down the Request

If the raw curl worked, start removing headers one by one to find the minimum viable request.

Removal order (start with the safest to remove):

Generic browser headers (sec-ch-ua, sec-fetch-*, upgrade-insecure-requests)
referer and origin
x-* custom headers (test each carefully — some are required)
Cookies (remove one at a time — identify which are essential)
Auth headers (authorization, x-api-key) — keep these if required

Goal: The leanest possible request that still returns data.

import requests

response = requests.get(
    "https://example.com/api/products",
    params={"page": 1},
    headers={"accept": "application/json"}
)
data = response.json()

Also check: Does the endpoint require a Bearer token fetched at page load? Look for an earlier request to /auth, /token, or /session — you may need to grab that first.

Step 4: Handle TLS Fingerprinting with curl_cffi

If the raw curl didn’t work, the server is likely inspecting your TLS handshake to detect non-browser clients. Use curl_cffi to impersonate a real browser’s networking stack.

pip install curl_cffi

from curl_cffi import requests

response = requests.get(
    "https://example.com/api/products",
    impersonate="chrome120",   # mimics Chrome's TLS fingerprint
    params={"page": 1}
)
data = response.json()

Available impersonation targets: chrome110, chrome120, safari17, firefox120, and more.

Did it work?

Yes → You have TLS fingerprinting confirmed. Use curl_cffi for all requests. Then go back to Step 3 and minimize your headers.
No → There’s session-based logic at play. Move to Step 5.

Step 5: Emulate Sessions and Cookies

Some sites issue challenge cookies that must be obtained via a valid browser-like session. Strategies in order of complexity:

Some sites set a cookie on the first request that must be echoed back. Use a session object:

from curl_cffi import requests

session = requests.Session()
session.get("https://example.com/")          # Triggers Set-Cookie
data = session.get("https://example.com/api/products").json()

5b. Rotating cookies

Some sites invalidate cookies after each use and issue a new one in the response. You need to update your cookie jar between requests:

session = requests.Session()
for page in range(1, 10):
    r = session.get(f"https://example.com/api/products?page={page}", impersonate="chrome120")
    # session automatically handles the updated cookies
    data = r.json()

5c. JavaScript challenge cookies (e.g., Cloudflare)

If the site issues a cf_clearance or similar JS-challenge cookie, it requires actual JavaScript execution to solve. This is where you finally need a headless browser — but only to obtain the cookie, not to scrape data:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/")
    page.wait_for_timeout(3000)  # Let the challenge resolve
    cookies = page.context.cookies()
    cf_clearance = next(c["value"] for c in cookies if c["name"] == "cf_clearance")
    browser.close()

# Now use the cookie in curl_cffi
response = requests.get(
    "https://example.com/api/products",
    impersonate="chrome120",
    cookies={"cf_clearance": cf_clearance}
)

Step 6: When There’s No API (Server-Side Rendered Pages)

If the Network tab shows no useful XHR/Fetch requests, the HTML is the data (server-rendered pages). In this case, parse the HTML directly:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://example.com/products", headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "lxml")
items = soup.select(".product-card .title")

Decision Flowchart

Start
  │
  ▼
Find XHR/Fetch request in Network tab
  │
  ├─ No useful request found ──────────────────────► Parse HTML with BeautifulSoup
  │
  ▼
Copy as cURL → paste in terminal
  │
  ├─ Works ──► Whittle down headers ──► Use requests or httpx
  │
  ├─ Fails ──► Try curl_cffi with impersonate="chrome120"
  │               │
  │               ├─ Works ──► TLS fingerprinting confirmed. Use curl_cffi + minimize headers
  │               │
  │               └─ Fails ──► Session logic required
  │                               │
  │                               ├─ Set-Cookie emulation ──► requests.Session()
  │                               ├─ Rotating cookies ──────► Update jar each request
  │                               └─ JS challenge cookie ───► Playwright to get cookie, then curl_cffi
  │
  └─ All else fails ──────────────────────────────► Full headless browser (Playwright/Selenium)

Practical Reminders

Rate limiting: Always add delays between requests (time.sleep(1–3)). Even open APIs will ban aggressive scrapers.
Endpoint instability: Internal APIs have no stability guarantees — your scraper can break without warning. Add error handling.
Auth tokens: Look for an early request to /auth or /session that returns a Bearer token used in subsequent calls.
Respect robots.txt: Check https://example.com/robots.txt before scraping. Honor it where appropriate.
Legal: Scraping terms of service vary. Always check the site’s ToS and applicable laws before scraping at scale.

Quick Reference: Tools by Use Case

Situation	Tool
Simple API, no auth	`requests` / `httpx`
TLS fingerprinting detected	`curl_cffi`
Session/cookie handling	`requests.Session()` or `curl_cffi` Session
JS challenge (Cloudflare, etc.)	`playwright` (for cookie only) + `curl_cffi`
Server-rendered HTML	`requests` + `BeautifulSoup` / `lxml`
Full JS rendering required	`playwright` or `selenium`

Web Scraping Without Headless Browsers: A Practical Guide

Step 1: Find the Real Data Source

Step 2: Test It Directly — The curl Test

Did it work?

Step 3: Whittle Down the Request

Step 4: Handle TLS Fingerprinting with curl_cffi

Did it work?

Step 5: Emulate Sessions and Cookies

5a. Respect Set-Cookie headers

5b. Rotating cookies

5c. JavaScript challenge cookies (e.g., Cloudflare)

Step 6: When There’s No API (Server-Side Rendered Pages)

Decision Flowchart

Practical Reminders

Quick Reference: Tools by Use Case