Scraping AliExpress Ecommerce Data With Cloudflare Browser Rendering
Scraping AliExpress Search Results With Cloudflare Browser Rendering
Section titled “Scraping AliExpress Search Results With Cloudflare Browser Rendering”AliExpress is one of those sites that looks simple on the surface but fights you at every turn when you try to scrape it. It’s a fully client-rendered SPA — a plain HTTP request gives you a loading skeleton with zero product data. On top of that, AliExpress geo-localizes aggressively, so your results might come back in Japanese or Russian depending on which data center handles the request.
Here’s how to get clean, structured product data from AliExpress search results using Cloudflare Browser Rendering — no local browser needed.
The Approach
Section titled “The Approach”Same two-step pattern as any browser-rendering scraping workflow:
- Fetch the fully rendered HTML via Cloudflare’s
/contentAPI - Parse the embedded JSON data locally
The interesting part with AliExpress is that it doesn’t use Next.js like Walmart does. There’s no convenient __NEXT_DATA__ tag. Instead, AliExpress embeds its page data inside a window._dida_config_._init_data_ JavaScript variable — a massive JSON blob buried in a script tag. Once you know where to look, it’s just as clean to parse.
Step 1: Fetching the HTML
Section titled “Step 1: Fetching the HTML”Start with the usual Cloudflare Browser Rendering setup:
import osimport requestsfrom dotenv import load_dotenv
load_dotenv()
ACCOUNT_ID = os.environ.get("CF_ACCOUNT_ID")API_TOKEN = os.environ.get("CF_API_TOKEN")
endpoint = f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/browser-rendering/content"Build the payload. AliExpress search URLs follow the pattern /w/wholesale-{query}.html. The waitForSelector targets .search-item-card-wrapper-gallery, which is the container for the product grid:
SEARCH_QUERY = "sunglasses"
payload = { "url": f"https://www.aliexpress.com/w/wholesale-{SEARCH_QUERY}.html?g=y&SearchText={SEARCH_QUERY}", "gotoOptions": { "waitUntil": "domcontentloaded", "timeout": 60000 }, "waitForSelector": { "selector": ".search-item-card-wrapper-gallery", "timeout": 30000 }, "viewport": { "width": 1280, "height": 720 }}The Locale Problem
Section titled “The Locale Problem”Here’s a gotcha that’ll waste your time if you’re not ready for it. Cloudflare’s browser instances run on their edge network, and AliExpress geo-localizes based on the requesting IP. If the edge node is in Tokyo, you’ll get Japanese product titles and yen prices. In São Paulo? Portuguese and reais.
The Accept-Language header alone doesn’t fix this — AliExpress mostly ignores it. What works is setting locale cookies before the page loads. Cloudflare’s API supports a cookies parameter for exactly this:
payload = { # ... url, gotoOptions, etc. "setExtraHTTPHeaders": { "Accept-Language": "en-US,en;q=0.9" }, "cookies": [ { "name": "aep_usuc_f", "value": "site=glo&c_tp=USD®ion=US&b_locale=en_US", "domain": ".aliexpress.com" }, { "name": "intl_locale", "value": "en_US", "domain": ".aliexpress.com" } ]}The aep_usuc_f cookie is the important one — it tells AliExpress which regional site to serve (glo for global), what currency to use (USD), and what locale to render (en_US). Without it, you’re at the mercy of IP geolocation.
Fire it off and save:
headers = { "Content-Type": "application/json", "Authorization": f"Bearer {API_TOKEN}"}
resp = requests.post(endpoint, json=payload, headers=headers)
if resp.status_code == 200: with open("aliexpress_search_raw.html", "w", encoding="utf-8") as f: f.write(resp.text) print(f"Saved {len(resp.text):,} chars of rendered HTML")else: print(f"Error {resp.status_code}: {resp.text[:500]}")Step 2: Parsing the Data
Section titled “Step 2: Parsing the Data”AliExpress doesn’t use Next.js, so there’s no __NEXT_DATA__ to grab. Instead, it uses a custom framework (internally called “dida”) that injects page data into a JavaScript variable: window._dida_config_._init_data_.
The tricky part is extracting it. The data is assigned as a JavaScript object literal inside a <script> tag — not a standalone JSON string. You can’t just grab it with a regex because the JSON is hundreds of thousands of characters long and nested deeply. Bracket counting works reliably:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
init_data = Nonefor script in soup.find_all("script"): text = script.string or "" marker = "_init_data_= { data: " idx = text.find(marker) if idx == -1: continue idx += len(marker) # Find matching closing brace depth = 0 for j in range(idx, len(text)): if text[j] == "{": depth += 1 elif text[j] == "}": depth -= 1 if depth == 0: init_data = json.loads(text[idx:j + 1]) break breakYou scan every script tag looking for the _init_data_ marker, then count curly braces to find where the JSON object ends. Once you’ve got the boundaries, json.loads handles the rest.
The Data Structure
Section titled “The Data Structure”The init data has a clear hierarchy. Product listings live at data.root.fields.mods.itemList.content — an array of 60 items per page:
root_fields = init_data["data"]["root"]["fields"]page_info = root_fields.get("pageInfo", {})item_list = root_fields["mods"]["itemList"]["content"]The pageInfo object gives you pagination metadata — total results, current page, page size. Useful if you want to paginate through all results.
Each item in the list has a consistent shape. Here’s how to extract the useful fields:
products = []for item in item_list: prices = item.get("prices") or {} sale_price = prices.get("salePrice") or {} original_price = prices.get("originalPrice") or {} evaluation = item.get("evaluation") or {} trade = item.get("trade") or {} image = item.get("image") or {} title = item.get("title") or {}
img_url = image.get("imgUrl", "") if img_url and img_url.startswith("//"): img_url = "https:" + img_url
product = { "product_id": item.get("productId"), "title": title.get("displayTitle"), "url": f"https://www.aliexpress.com/item/{item.get('productId')}.html", "image": img_url or None, "price": sale_price.get("formattedPrice"), "price_original": original_price.get("formattedPrice"), "discount_pct": sale_price.get("discount"), "currency": sale_price.get("currencyCode"), "rating": evaluation.get("starRating"), "sold": trade.get("tradeDesc"), "is_ad": item.get("productType") == "ad", } products.append(product)A few things worth noting:
- Image URLs are protocol-relative — they start with
//instead ofhttps://, so you need to prepend the scheme yourself. - Prices come in two flavors —
salePriceis the discounted price,originalPriceis the pre-discount price. Not every item has both. - Ads are identified by
productType— sponsored listings have"productType": "ad"while organic results are"natural". - The
tradeDescfield is a human-readable sold count — strings like"10,000+ sold"rather than raw numbers.
Deduplicate and save:
seen = set()unique = []for p in products: pid = p["product_id"] if pid and pid not in seen: seen.add(pid) unique.append(p)
with open("aliexpress_products.json", "w", encoding="utf-8") as f: json.dump(unique, f, indent=2, ensure_ascii=False)The Output
Section titled “The Output”After running both scripts, you get a clean JSON file with 60 products per page:
{ "product_id": "1005007595844944", "title": "Men Classical Square Polarized Sports Sunglasses Lightweight PC Frame UV400", "url": "https://www.aliexpress.com/item/1005007595844944.html", "image": "https://ae-pic-a1.aliexpress-media.com/kf/S0af78966b3614be2b75b067b0c7c51e1z.jpg", "price": "US $0.99", "price_original": "US $6.32", "discount_pct": 84, "currency": "USD", "rating": 4.7, "sold": "5,000+ sold", "is_ad": false}Things to Watch Out For
Section titled “Things to Watch Out For”-
The locale cookie is essential. Without
aep_usuc_f, you’ll get results in whatever language matches the Cloudflare edge node’s geography. Theintl_localecookie reinforces it butaep_usuc_fdoes the heavy lifting. -
The
cookiesAPI parameter — notsetCookies. Cloudflare’s Browser Rendering API usescookiesas the parameter name. UsingsetCookies(which is the Puppeteer method name) will give you a 400 error with an “unrecognized keys” message. -
The response might be JSON-wrapped. Depending on the endpoint configuration, Cloudflare may return the HTML wrapped in a JSON object like
{"success": true, "result": "<html>..."}. Handle both formats:
try: wrapper = json.load(f) html = wrapper["result"]except (json.JSONDecodeError, KeyError): html = f.read()-
Bracket counting beats regex for extraction. The
_init_data_blob can be 400KB+ of deeply nested JSON. Regex will choke on it. Counting{and}to find the matching brace is simple and bulletproof. -
60 items per page is the default. AliExpress returns 60 products per search page. To get more, you’d need to paginate by adding
&page=2,&page=3, etc. to the URL and making additional requests.
AliExpress vs Walmart: Different Frameworks, Same Idea
Section titled “AliExpress vs Walmart: Different Frameworks, Same Idea”The core technique is identical — find the embedded data blob, parse it, skip DOM scraping entirely. The difference is just where the data lives:
| Walmart | AliExpress | |
|---|---|---|
| Framework | Next.js | Custom (“dida”) |
| Data location | <script id="__NEXT_DATA__"> | window._dida_config_._init_data_ |
| Extraction method | soup.find("script", id=...) | Bracket counting in script tags |
| Items per page | ~40 | 60 |
| Locale handling | Automatic (US site) | Requires cookies |
Both give you structured JSON with prices, ratings, images, and seller info. Both are far more reliable than parsing HTML elements. And both run entirely on Cloudflare’s infrastructure — no local browser needed.
Wrapping Up
Section titled “Wrapping Up”AliExpress is a trickier target than Walmart — the data is harder to find, the locale problem adds a step, and the extraction requires bracket counting instead of a simple ID lookup. But once you know the pattern, it’s just as clean. You get structured product data from a single API call, with no browser dependencies on your machine.
The _init_data_ blob contains everything the page renders — products, filters, pagination, related searches, SEO metadata. If AliExpress shows it on the page, it’s in that JSON.