Skip to content

Scraping AliExpress Ecommerce Data With Cloudflare Browser Rendering

Scraping AliExpress Search Results With Cloudflare Browser Rendering

Section titled “Scraping AliExpress Search Results With Cloudflare Browser Rendering”

AliExpress is one of those sites that looks simple on the surface but fights you at every turn when you try to scrape it. It’s a fully client-rendered SPA — a plain HTTP request gives you a loading skeleton with zero product data. On top of that, AliExpress geo-localizes aggressively, so your results might come back in Japanese or Russian depending on which data center handles the request.

Here’s how to get clean, structured product data from AliExpress search results using Cloudflare Browser Rendering — no local browser needed.

Same two-step pattern as any browser-rendering scraping workflow:

  1. Fetch the fully rendered HTML via Cloudflare’s /content API
  2. Parse the embedded JSON data locally

The interesting part with AliExpress is that it doesn’t use Next.js like Walmart does. There’s no convenient __NEXT_DATA__ tag. Instead, AliExpress embeds its page data inside a window._dida_config_._init_data_ JavaScript variable — a massive JSON blob buried in a script tag. Once you know where to look, it’s just as clean to parse.

Start with the usual Cloudflare Browser Rendering setup:

import os
import requests
from dotenv import load_dotenv
load_dotenv()
ACCOUNT_ID = os.environ.get("CF_ACCOUNT_ID")
API_TOKEN = os.environ.get("CF_API_TOKEN")
endpoint = f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/browser-rendering/content"

Build the payload. AliExpress search URLs follow the pattern /w/wholesale-{query}.html. The waitForSelector targets .search-item-card-wrapper-gallery, which is the container for the product grid:

SEARCH_QUERY = "sunglasses"
payload = {
"url": f"https://www.aliexpress.com/w/wholesale-{SEARCH_QUERY}.html?g=y&SearchText={SEARCH_QUERY}",
"gotoOptions": {
"waitUntil": "domcontentloaded",
"timeout": 60000
},
"waitForSelector": {
"selector": ".search-item-card-wrapper-gallery",
"timeout": 30000
},
"viewport": {
"width": 1280,
"height": 720
}
}

Here’s a gotcha that’ll waste your time if you’re not ready for it. Cloudflare’s browser instances run on their edge network, and AliExpress geo-localizes based on the requesting IP. If the edge node is in Tokyo, you’ll get Japanese product titles and yen prices. In São Paulo? Portuguese and reais.

The Accept-Language header alone doesn’t fix this — AliExpress mostly ignores it. What works is setting locale cookies before the page loads. Cloudflare’s API supports a cookies parameter for exactly this:

payload = {
# ... url, gotoOptions, etc.
"setExtraHTTPHeaders": {
"Accept-Language": "en-US,en;q=0.9"
},
"cookies": [
{
"name": "aep_usuc_f",
"value": "site=glo&c_tp=USD&region=US&b_locale=en_US",
"domain": ".aliexpress.com"
},
{
"name": "intl_locale",
"value": "en_US",
"domain": ".aliexpress.com"
}
]
}

The aep_usuc_f cookie is the important one — it tells AliExpress which regional site to serve (glo for global), what currency to use (USD), and what locale to render (en_US). Without it, you’re at the mercy of IP geolocation.

Fire it off and save:

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_TOKEN}"
}
resp = requests.post(endpoint, json=payload, headers=headers)
if resp.status_code == 200:
with open("aliexpress_search_raw.html", "w", encoding="utf-8") as f:
f.write(resp.text)
print(f"Saved {len(resp.text):,} chars of rendered HTML")
else:
print(f"Error {resp.status_code}: {resp.text[:500]}")

AliExpress doesn’t use Next.js, so there’s no __NEXT_DATA__ to grab. Instead, it uses a custom framework (internally called “dida”) that injects page data into a JavaScript variable: window._dida_config_._init_data_.

The tricky part is extracting it. The data is assigned as a JavaScript object literal inside a <script> tag — not a standalone JSON string. You can’t just grab it with a regex because the JSON is hundreds of thousands of characters long and nested deeply. Bracket counting works reliably:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
init_data = None
for script in soup.find_all("script"):
text = script.string or ""
marker = "_init_data_= { data: "
idx = text.find(marker)
if idx == -1:
continue
idx += len(marker)
# Find matching closing brace
depth = 0
for j in range(idx, len(text)):
if text[j] == "{":
depth += 1
elif text[j] == "}":
depth -= 1
if depth == 0:
init_data = json.loads(text[idx:j + 1])
break
break

You scan every script tag looking for the _init_data_ marker, then count curly braces to find where the JSON object ends. Once you’ve got the boundaries, json.loads handles the rest.

The init data has a clear hierarchy. Product listings live at data.root.fields.mods.itemList.content — an array of 60 items per page:

root_fields = init_data["data"]["root"]["fields"]
page_info = root_fields.get("pageInfo", {})
item_list = root_fields["mods"]["itemList"]["content"]

The pageInfo object gives you pagination metadata — total results, current page, page size. Useful if you want to paginate through all results.

Each item in the list has a consistent shape. Here’s how to extract the useful fields:

products = []
for item in item_list:
prices = item.get("prices") or {}
sale_price = prices.get("salePrice") or {}
original_price = prices.get("originalPrice") or {}
evaluation = item.get("evaluation") or {}
trade = item.get("trade") or {}
image = item.get("image") or {}
title = item.get("title") or {}
img_url = image.get("imgUrl", "")
if img_url and img_url.startswith("//"):
img_url = "https:" + img_url
product = {
"product_id": item.get("productId"),
"title": title.get("displayTitle"),
"url": f"https://www.aliexpress.com/item/{item.get('productId')}.html",
"image": img_url or None,
"price": sale_price.get("formattedPrice"),
"price_original": original_price.get("formattedPrice"),
"discount_pct": sale_price.get("discount"),
"currency": sale_price.get("currencyCode"),
"rating": evaluation.get("starRating"),
"sold": trade.get("tradeDesc"),
"is_ad": item.get("productType") == "ad",
}
products.append(product)

A few things worth noting:

  • Image URLs are protocol-relative — they start with // instead of https://, so you need to prepend the scheme yourself.
  • Prices come in two flavorssalePrice is the discounted price, originalPrice is the pre-discount price. Not every item has both.
  • Ads are identified by productType — sponsored listings have "productType": "ad" while organic results are "natural".
  • The tradeDesc field is a human-readable sold count — strings like "10,000+ sold" rather than raw numbers.

Deduplicate and save:

seen = set()
unique = []
for p in products:
pid = p["product_id"]
if pid and pid not in seen:
seen.add(pid)
unique.append(p)
with open("aliexpress_products.json", "w", encoding="utf-8") as f:
json.dump(unique, f, indent=2, ensure_ascii=False)

After running both scripts, you get a clean JSON file with 60 products per page:

{
"product_id": "1005007595844944",
"title": "Men Classical Square Polarized Sports Sunglasses Lightweight PC Frame UV400",
"url": "https://www.aliexpress.com/item/1005007595844944.html",
"image": "https://ae-pic-a1.aliexpress-media.com/kf/S0af78966b3614be2b75b067b0c7c51e1z.jpg",
"price": "US $0.99",
"price_original": "US $6.32",
"discount_pct": 84,
"currency": "USD",
"rating": 4.7,
"sold": "5,000+ sold",
"is_ad": false
}
  • The locale cookie is essential. Without aep_usuc_f, you’ll get results in whatever language matches the Cloudflare edge node’s geography. The intl_locale cookie reinforces it but aep_usuc_f does the heavy lifting.

  • The cookies API parameter — not setCookies. Cloudflare’s Browser Rendering API uses cookies as the parameter name. Using setCookies (which is the Puppeteer method name) will give you a 400 error with an “unrecognized keys” message.

  • The response might be JSON-wrapped. Depending on the endpoint configuration, Cloudflare may return the HTML wrapped in a JSON object like {"success": true, "result": "<html>..."}. Handle both formats:

try:
wrapper = json.load(f)
html = wrapper["result"]
except (json.JSONDecodeError, KeyError):
html = f.read()
  • Bracket counting beats regex for extraction. The _init_data_ blob can be 400KB+ of deeply nested JSON. Regex will choke on it. Counting { and } to find the matching brace is simple and bulletproof.

  • 60 items per page is the default. AliExpress returns 60 products per search page. To get more, you’d need to paginate by adding &page=2, &page=3, etc. to the URL and making additional requests.

AliExpress vs Walmart: Different Frameworks, Same Idea

Section titled “AliExpress vs Walmart: Different Frameworks, Same Idea”

The core technique is identical — find the embedded data blob, parse it, skip DOM scraping entirely. The difference is just where the data lives:

WalmartAliExpress
FrameworkNext.jsCustom (“dida”)
Data location<script id="__NEXT_DATA__">window._dida_config_._init_data_
Extraction methodsoup.find("script", id=...)Bracket counting in script tags
Items per page~4060
Locale handlingAutomatic (US site)Requires cookies

Both give you structured JSON with prices, ratings, images, and seller info. Both are far more reliable than parsing HTML elements. And both run entirely on Cloudflare’s infrastructure — no local browser needed.

AliExpress is a trickier target than Walmart — the data is harder to find, the locale problem adds a step, and the extraction requires bracket counting instead of a simple ID lookup. But once you know the pattern, it’s just as clean. You get structured product data from a single API call, with no browser dependencies on your machine.

The _init_data_ blob contains everything the page renders — products, filters, pagination, related searches, SEO metadata. If AliExpress shows it on the page, it’s in that JSON.