Skip to content

Blog

Web Scraping Without Headless Browsers: A Practical Guide

Core principle: Before writing a single line of scraping code, spend 1 minute in the Network tab. Most modern sites fetch data from clean JSON APIs in the background — hitting those directly is faster, simpler, and harder to block.


  1. Open your browser and navigate to the target page
  2. Press F12 to open DevTools
  3. Go to the Network tab
  4. Filter by Fetch/XHR
  5. Refresh the page, scroll, or click buttons that trigger data loading
  6. Look for requests returning JSON — these are your targets

What to look for:

  • Responses with Content-Type: application/json
  • Endpoints with patterns like /api/, /graphql, /v1/, /data/
  • Requests that fire when you paginate, search, or filter

Tip: Click a request and check the Preview tab. If you can see your data structured there, you’ve found your endpoint.


Step 2: Test It Directly — The curl Test

Section titled “Step 2: Test It Directly — The curl Test”

Right-click the request → CopyCopy as cURL, then paste it in your terminal.

Terminal window
# Example of what gets copied
curl 'https://example.com/api/products?page=1' \
-H 'accept: application/json' \
-H 'user-agent: Mozilla/5.0 ...' \
-H 'cookie: session=abc123; cf_clearance=xyz'
  • Yes → The site has no TLS fingerprinting. Move to Step 3.
  • No → The site may be fingerprinting your TLS handshake. Move to Step 4.

If the raw curl worked, start removing headers one by one to find the minimum viable request.

Removal order (start with the safest to remove):

  1. Generic browser headers (sec-ch-ua, sec-fetch-*, upgrade-insecure-requests)
  2. referer and origin
  3. x-* custom headers (test each carefully — some are required)
  4. Cookies (remove one at a time — identify which are essential)
  5. Auth headers (authorization, x-api-key) — keep these if required

Goal: The leanest possible request that still returns data.

import requests
response = requests.get(
"https://example.com/api/products",
params={"page": 1},
headers={"accept": "application/json"}
)
data = response.json()

Also check: Does the endpoint require a Bearer token fetched at page load? Look for an earlier request to /auth, /token, or /session — you may need to grab that first.


Step 4: Handle TLS Fingerprinting with curl_cffi

Section titled “Step 4: Handle TLS Fingerprinting with curl_cffi”

If the raw curl didn’t work, the server is likely inspecting your TLS handshake to detect non-browser clients. Use curl_cffi to impersonate a real browser’s networking stack.

Terminal window
pip install curl_cffi
from curl_cffi import requests
response = requests.get(
"https://example.com/api/products",
impersonate="chrome120", # mimics Chrome's TLS fingerprint
params={"page": 1}
)
data = response.json()

Available impersonation targets: chrome110, chrome120, safari17, firefox120, and more.

  • Yes → You have TLS fingerprinting confirmed. Use curl_cffi for all requests. Then go back to Step 3 and minimize your headers.
  • No → There’s session-based logic at play. Move to Step 5.

Some sites issue challenge cookies that must be obtained via a valid browser-like session. Strategies in order of complexity:

Some sites set a cookie on the first request that must be echoed back. Use a session object:

from curl_cffi import requests
session = requests.Session()
session.get("https://example.com/") # Triggers Set-Cookie
data = session.get("https://example.com/api/products").json()

Some sites invalidate cookies after each use and issue a new one in the response. You need to update your cookie jar between requests:

session = requests.Session()
for page in range(1, 10):
r = session.get(f"https://example.com/api/products?page={page}", impersonate="chrome120")
# session automatically handles the updated cookies
data = r.json()

5c. JavaScript challenge cookies (e.g., Cloudflare)

Section titled “5c. JavaScript challenge cookies (e.g., Cloudflare)”

If the site issues a cf_clearance or similar JS-challenge cookie, it requires actual JavaScript execution to solve. This is where you finally need a headless browser — but only to obtain the cookie, not to scrape data:

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/")
page.wait_for_timeout(3000) # Let the challenge resolve
cookies = page.context.cookies()
cf_clearance = next(c["value"] for c in cookies if c["name"] == "cf_clearance")
browser.close()
# Now use the cookie in curl_cffi
response = requests.get(
"https://example.com/api/products",
impersonate="chrome120",
cookies={"cf_clearance": cf_clearance}
)

Step 6: When There’s No API (Server-Side Rendered Pages)

Section titled “Step 6: When There’s No API (Server-Side Rendered Pages)”

If the Network tab shows no useful XHR/Fetch requests, the HTML is the data (server-rendered pages). In this case, parse the HTML directly:

import requests
from bs4 import BeautifulSoup
r = requests.get("https://example.com/products", headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "lxml")
items = soup.select(".product-card .title")

Start
Find XHR/Fetch request in Network tab
├─ No useful request found ──────────────────────► Parse HTML with BeautifulSoup
Copy as cURL → paste in terminal
├─ Works ──► Whittle down headers ──► Use requests or httpx
├─ Fails ──► Try curl_cffi with impersonate="chrome120"
│ │
│ ├─ Works ──► TLS fingerprinting confirmed. Use curl_cffi + minimize headers
│ │
│ └─ Fails ──► Session logic required
│ │
│ ├─ Set-Cookie emulation ──► requests.Session()
│ ├─ Rotating cookies ──────► Update jar each request
│ └─ JS challenge cookie ───► Playwright to get cookie, then curl_cffi
└─ All else fails ──────────────────────────────► Full headless browser (Playwright/Selenium)

  • Rate limiting: Always add delays between requests (time.sleep(1–3)). Even open APIs will ban aggressive scrapers.
  • Endpoint instability: Internal APIs have no stability guarantees — your scraper can break without warning. Add error handling.
  • Auth tokens: Look for an early request to /auth or /session that returns a Bearer token used in subsequent calls.
  • Respect robots.txt: Check https://example.com/robots.txt before scraping. Honor it where appropriate.
  • Legal: Scraping terms of service vary. Always check the site’s ToS and applicable laws before scraping at scale.

SituationTool
Simple API, no authrequests / httpx
TLS fingerprinting detectedcurl_cffi
Session/cookie handlingrequests.Session() or curl_cffi Session
JS challenge (Cloudflare, etc.)playwright (for cookie only) + curl_cffi
Server-rendered HTMLrequests + BeautifulSoup / lxml
Full JS rendering requiredplaywright or selenium

Scraping AliExpress Ecommerce Data With Cloudflare Browser Rendering

Scraping AliExpress Search Results With Cloudflare Browser Rendering

Section titled “Scraping AliExpress Search Results With Cloudflare Browser Rendering”

AliExpress is one of those sites that looks simple on the surface but fights you at every turn when you try to scrape it. It’s a fully client-rendered SPA — a plain HTTP request gives you a loading skeleton with zero product data. On top of that, AliExpress geo-localizes aggressively, so your results might come back in Japanese or Russian depending on which data center handles the request.

Here’s how to get clean, structured product data from AliExpress search results using Cloudflare Browser Rendering — no local browser needed.

Same two-step pattern as any browser-rendering scraping workflow:

  1. Fetch the fully rendered HTML via Cloudflare’s /content API
  2. Parse the embedded JSON data locally

The interesting part with AliExpress is that it doesn’t use Next.js like Walmart does. There’s no convenient __NEXT_DATA__ tag. Instead, AliExpress embeds its page data inside a window._dida_config_._init_data_ JavaScript variable — a massive JSON blob buried in a script tag. Once you know where to look, it’s just as clean to parse.

Start with the usual Cloudflare Browser Rendering setup:

import os
import requests
from dotenv import load_dotenv
load_dotenv()
ACCOUNT_ID = os.environ.get("CF_ACCOUNT_ID")
API_TOKEN = os.environ.get("CF_API_TOKEN")
endpoint = f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/browser-rendering/content"

Build the payload. AliExpress search URLs follow the pattern /w/wholesale-{query}.html. The waitForSelector targets .search-item-card-wrapper-gallery, which is the container for the product grid:

SEARCH_QUERY = "sunglasses"
payload = {
"url": f"https://www.aliexpress.com/w/wholesale-{SEARCH_QUERY}.html?g=y&SearchText={SEARCH_QUERY}",
"gotoOptions": {
"waitUntil": "domcontentloaded",
"timeout": 60000
},
"waitForSelector": {
"selector": ".search-item-card-wrapper-gallery",
"timeout": 30000
},
"viewport": {
"width": 1280,
"height": 720
}
}

Here’s a gotcha that’ll waste your time if you’re not ready for it. Cloudflare’s browser instances run on their edge network, and AliExpress geo-localizes based on the requesting IP. If the edge node is in Tokyo, you’ll get Japanese product titles and yen prices. In São Paulo? Portuguese and reais.

The Accept-Language header alone doesn’t fix this — AliExpress mostly ignores it. What works is setting locale cookies before the page loads. Cloudflare’s API supports a cookies parameter for exactly this:

payload = {
# ... url, gotoOptions, etc.
"setExtraHTTPHeaders": {
"Accept-Language": "en-US,en;q=0.9"
},
"cookies": [
{
"name": "aep_usuc_f",
"value": "site=glo&c_tp=USD&region=US&b_locale=en_US",
"domain": ".aliexpress.com"
},
{
"name": "intl_locale",
"value": "en_US",
"domain": ".aliexpress.com"
}
]
}

The aep_usuc_f cookie is the important one — it tells AliExpress which regional site to serve (glo for global), what currency to use (USD), and what locale to render (en_US). Without it, you’re at the mercy of IP geolocation.

Fire it off and save:

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_TOKEN}"
}
resp = requests.post(endpoint, json=payload, headers=headers)
if resp.status_code == 200:
with open("aliexpress_search_raw.html", "w", encoding="utf-8") as f:
f.write(resp.text)
print(f"Saved {len(resp.text):,} chars of rendered HTML")
else:
print(f"Error {resp.status_code}: {resp.text[:500]}")

AliExpress doesn’t use Next.js, so there’s no __NEXT_DATA__ to grab. Instead, it uses a custom framework (internally called “dida”) that injects page data into a JavaScript variable: window._dida_config_._init_data_.

The tricky part is extracting it. The data is assigned as a JavaScript object literal inside a <script> tag — not a standalone JSON string. You can’t just grab it with a regex because the JSON is hundreds of thousands of characters long and nested deeply. Bracket counting works reliably:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
init_data = None
for script in soup.find_all("script"):
text = script.string or ""
marker = "_init_data_= { data: "
idx = text.find(marker)
if idx == -1:
continue
idx += len(marker)
# Find matching closing brace
depth = 0
for j in range(idx, len(text)):
if text[j] == "{":
depth += 1
elif text[j] == "}":
depth -= 1
if depth == 0:
init_data = json.loads(text[idx:j + 1])
break
break

You scan every script tag looking for the _init_data_ marker, then count curly braces to find where the JSON object ends. Once you’ve got the boundaries, json.loads handles the rest.

The init data has a clear hierarchy. Product listings live at data.root.fields.mods.itemList.content — an array of 60 items per page:

root_fields = init_data["data"]["root"]["fields"]
page_info = root_fields.get("pageInfo", {})
item_list = root_fields["mods"]["itemList"]["content"]

The pageInfo object gives you pagination metadata — total results, current page, page size. Useful if you want to paginate through all results.

Each item in the list has a consistent shape. Here’s how to extract the useful fields:

products = []
for item in item_list:
prices = item.get("prices") or {}
sale_price = prices.get("salePrice") or {}
original_price = prices.get("originalPrice") or {}
evaluation = item.get("evaluation") or {}
trade = item.get("trade") or {}
image = item.get("image") or {}
title = item.get("title") or {}
img_url = image.get("imgUrl", "")
if img_url and img_url.startswith("//"):
img_url = "https:" + img_url
product = {
"product_id": item.get("productId"),
"title": title.get("displayTitle"),
"url": f"https://www.aliexpress.com/item/{item.get('productId')}.html",
"image": img_url or None,
"price": sale_price.get("formattedPrice"),
"price_original": original_price.get("formattedPrice"),
"discount_pct": sale_price.get("discount"),
"currency": sale_price.get("currencyCode"),
"rating": evaluation.get("starRating"),
"sold": trade.get("tradeDesc"),
"is_ad": item.get("productType") == "ad",
}
products.append(product)

A few things worth noting:

  • Image URLs are protocol-relative — they start with // instead of https://, so you need to prepend the scheme yourself.
  • Prices come in two flavorssalePrice is the discounted price, originalPrice is the pre-discount price. Not every item has both.
  • Ads are identified by productType — sponsored listings have "productType": "ad" while organic results are "natural".
  • The tradeDesc field is a human-readable sold count — strings like "10,000+ sold" rather than raw numbers.

Deduplicate and save:

seen = set()
unique = []
for p in products:
pid = p["product_id"]
if pid and pid not in seen:
seen.add(pid)
unique.append(p)
with open("aliexpress_products.json", "w", encoding="utf-8") as f:
json.dump(unique, f, indent=2, ensure_ascii=False)

After running both scripts, you get a clean JSON file with 60 products per page:

{
"product_id": "1005007595844944",
"title": "Men Classical Square Polarized Sports Sunglasses Lightweight PC Frame UV400",
"url": "https://www.aliexpress.com/item/1005007595844944.html",
"image": "https://ae-pic-a1.aliexpress-media.com/kf/S0af78966b3614be2b75b067b0c7c51e1z.jpg",
"price": "US $0.99",
"price_original": "US $6.32",
"discount_pct": 84,
"currency": "USD",
"rating": 4.7,
"sold": "5,000+ sold",
"is_ad": false
}
  • The locale cookie is essential. Without aep_usuc_f, you’ll get results in whatever language matches the Cloudflare edge node’s geography. The intl_locale cookie reinforces it but aep_usuc_f does the heavy lifting.

  • The cookies API parameter — not setCookies. Cloudflare’s Browser Rendering API uses cookies as the parameter name. Using setCookies (which is the Puppeteer method name) will give you a 400 error with an “unrecognized keys” message.

  • The response might be JSON-wrapped. Depending on the endpoint configuration, Cloudflare may return the HTML wrapped in a JSON object like {"success": true, "result": "<html>..."}. Handle both formats:

try:
wrapper = json.load(f)
html = wrapper["result"]
except (json.JSONDecodeError, KeyError):
html = f.read()
  • Bracket counting beats regex for extraction. The _init_data_ blob can be 400KB+ of deeply nested JSON. Regex will choke on it. Counting { and } to find the matching brace is simple and bulletproof.

  • 60 items per page is the default. AliExpress returns 60 products per search page. To get more, you’d need to paginate by adding &page=2, &page=3, etc. to the URL and making additional requests.

AliExpress vs Walmart: Different Frameworks, Same Idea

Section titled “AliExpress vs Walmart: Different Frameworks, Same Idea”

The core technique is identical — find the embedded data blob, parse it, skip DOM scraping entirely. The difference is just where the data lives:

WalmartAliExpress
FrameworkNext.jsCustom (“dida”)
Data location<script id="__NEXT_DATA__">window._dida_config_._init_data_
Extraction methodsoup.find("script", id=...)Bracket counting in script tags
Items per page~4060
Locale handlingAutomatic (US site)Requires cookies

Both give you structured JSON with prices, ratings, images, and seller info. Both are far more reliable than parsing HTML elements. And both run entirely on Cloudflare’s infrastructure — no local browser needed.

AliExpress is a trickier target than Walmart — the data is harder to find, the locale problem adds a step, and the extraction requires bracket counting instead of a simple ID lookup. But once you know the pattern, it’s just as clean. You get structured product data from a single API call, with no browser dependencies on your machine.

The _init_data_ blob contains everything the page renders — products, filters, pagination, related searches, SEO metadata. If AliExpress shows it on the page, it’s in that JSON.

Scraping Walmart Ecommerce Data With Cloudflare Browser Rendering

If you’ve ever tried to scrape ecommerce apps like walmart, you know the pain. It’s a React app. The HTML you get from a plain requests.get() is basically an empty shell — no products, no prices, nothing useful. The actual data only shows up after JavaScript runs, hydrates the page, and renders everything client-side.

So you need a real browser. But spinning up Puppeteer or Playwright locally is slow, fragile, and annoying to deploy. That’s where Cloudflare Browser Rendering comes in.

Instead of managing headless browsers yourself, you make an API call to Cloudflare. They spin up a browser instance on their edge network, navigate to the URL, wait for the page to fully render, and hand you back the HTML. It’s like having a browser-as-a-service.

The approach here is dead simple — two steps:

  1. Fetch the fully rendered HTML via Cloudflare’s API
  2. Parse the structured data out of it locally

No browser dependencies on your machine. No Selenium. No Docker containers running Chrome. Just HTTP requests and JSON parsing.

The first script hits Cloudflare’s /content endpoint. You give it a URL and some options, and it gives you back rendered HTML.

First, grab your Cloudflare credentials from environment variables (or a .env file):

import os
import json
import requests
from dotenv import load_dotenv
load_dotenv()
ACCOUNT_ID = os.environ.get("CF_ACCOUNT_ID")
API_TOKEN = os.environ.get("CF_API_TOKEN")
endpoint = f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/browser-rendering/content"

Now build the payload. The key detail is the waitForSelector option. Walmart’s product grid doesn’t appear instantly — React needs a moment to mount and render the components. By telling Cloudflare to wait for [data-testid='item-stack'] to appear in the DOM, we make sure the products have actually loaded before the HTML gets captured.

SEARCH_QUERY = "sunglasses"
payload = {
"url": f"https://www.walmart.com/search?q={SEARCH_QUERY}",
"gotoOptions": {
"waitUntil": "domcontentloaded",
"timeout": 60000
},
"waitForSelector": {
"selector": "[data-testid='item-stack']",
"timeout": 30000
},
"viewport": {
"width": 1280,
"height": 720
}
}

Then fire it off and save the result:

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_TOKEN}"
}
resp = requests.post(endpoint, json=payload, headers=headers)
if resp.status_code == 200:
with open("walmart_search_raw.html", "w", encoding="utf-8") as f:
f.write(resp.text)
print(f"Saved {len(resp.text):,} chars of rendered HTML")
else:
print(f"Error {resp.status_code}: {resp.text[:500]}")

That’s the entire fetch step. One POST request, and you get back the full page HTML as if you’d opened it in Chrome and hit “View Source” after everything loaded.

Here’s where it gets interesting. You could parse the HTML with BeautifulSoup, find all the product cards, and extract text from each element. But there’s a much better way.

Walmart is built with Next.js, and Next.js apps embed all their page data in a <script id="__NEXT_DATA__"> tag. It’s a giant JSON blob containing everything the page needs to render — including all the product data in a clean, structured format.

So instead of wrestling with CSS selectors and fragile DOM traversal, you just grab that JSON blob directly:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
script = soup.find("script", id="__NEXT_DATA__")
next_data = json.loads(script.string)

From there, the product data lives at a predictable path:

search_result = next_data["props"]["pageProps"]["initialData"]["searchResult"]
item_stacks = search_result.get("itemStacks", [])

Now loop through the stacks and pull out what you need. One thing to watch for — not every item in a stack is an actual product. Walmart mixes in ads, placeholders, and other junk, so you need to filter:

products = []
for stack in item_stacks:
for item in stack.get("items", []):
# Skip non-product entries
if item.get("__typename") not in ("Product", "SearchProduct"):
if not item.get("usItemId"):
continue
price_info = item.get("priceInfo") or {}
image_info = item.get("imageInfo") or {}
product = {
"name": item.get("name"),
"brand": item.get("brand"),
"usItemId": item.get("usItemId"),
"url": "https://www.walmart.com" + item["canonicalUrl"]
if item.get("canonicalUrl") else None,
"image": image_info.get("thumbnailUrl"),
"price_current": price_info.get("linePrice"),
"price_was": price_info.get("wasPrice"),
"rating": item.get("averageRating"),
"review_count": item.get("numberOfReviews"),
"seller": item.get("sellerName"),
"in_stock": not item.get("isOutOfStock", True),
"is_sponsored": item.get("isSponsoredFlag", False),
}
products.append(product)

The same product can show up in multiple stacks (once in organic results, once as a sponsored listing), so deduplicate before saving:

seen = set()
unique = []
for p in products:
uid = p["usItemId"]
if uid and uid not in seen:
seen.add(uid)
unique.append(p)
with open("walmart_products.json", "w", encoding="utf-8") as f:
json.dump(unique, f, indent=2, ensure_ascii=False)

Names, prices, ratings, review counts, seller info, availability, images — all neatly organized. No regex. No “find the third div inside the second span” nonsense.

It’s resilient. Walmart can redesign their entire UI and change every CSS class name, but as long as they’re using Next.js, the __NEXT_DATA__ structure stays consistent. You’re reading the same data source that React itself uses to render the page.

It’s clean. You get structured JSON instead of messy HTML. Prices come as actual values, not strings you need to strip dollar signs from. Ratings are numbers. URLs are relative paths you can easily make absolute.

It’s fast. The Cloudflare browser call takes 10-20 seconds (it’s rendering a full page), but the parsing step is nearly instant. And since the browser runs on Cloudflare’s infrastructure, you’re not burning your own CPU.

It’s simple to deploy. No browser binaries to install. No Playwright or Puppeteer dependencies. The fetch step is just an HTTP POST. The parse step only needs beautifulsoup4 and the standard library.

A few gotchas I ran into:

  • The response format can vary. Sometimes Cloudflare wraps the HTML in a JSON object with a result key, sometimes it returns raw HTML. Handle both:
try:
with open("walmart_search_raw.html", encoding="utf-8") as f:
wrapper = json.load(f)
html = wrapper["result"]
except (json.JSONDecodeError, KeyError):
with open("walmart_search_raw.html", encoding="utf-8") as f:
html = f.read()
  • Cloudflare has usage limits. Browser Rendering isn’t free — you get a certain number of requests per month on the free tier. For bulk scraping, keep that in mind.

  • Timeouts need to be generous. Walmart’s page is heavy. The 60-second gotoOptions timeout and 30-second waitForSelector timeout aren’t arbitrary — shorter values will fail intermittently.

After running both scripts, you end up with a clean JSON file. Each product looks something like this:

{
"name": "Ray-Ban RB2132 New Wayfarer Sunglasses",
"brand": "Ray-Ban",
"usItemId": "123456789",
"url": "https://www.walmart.com/ip/...",
"image": "https://i5.walmartimages.com/...",
"price_current": "$129.99",
"rating": 4.6,
"review_count": 342,
"seller": "Walmart.com",
"in_stock": true,
"is_sponsored": false
}

From here you can do whatever you want — feed it into a price tracker, build a comparison tool, run analytics, or just browse products without the ad clutter.

The combination of Cloudflare Browser Rendering and Next.js’s __NEXT_DATA__ is genuinely one of the cleanest scraping patterns I’ve come across. You offload the hard part (running a browser) to Cloudflare, and you get structured data for free because of how Next.js works.

It’s not going to work for every site — only Next.js apps have that convenient data blob. But for the ones that do, it beats traditional DOM scraping by a mile.

Hello World

This is the first blog post on Browserflare. Stay tuned for more updates!