Scraping Pinduoduo: A Mobile API Solution

Most scraping tutorials start with a website.
This one didn't.

I needed structured product data from Pinduoduo (拼多多), one of China's largest e-commerce platforms. The web version was a dead end: aggressively throttled, inconsistently responsive, and missing critical data fields. Complete product listings weren't even available without the mobile app.

The mobile application, however, was a different beast entirely.

Like most modern apps, it offered no public API. Every request was cryptographically signed. Headers were opaque, dynamic, and device-bound. Rate limiting wasn't just aggressive, it was intelligent. IP rotation? Instantly flagged. At first glance, the app seemed intentionally hostile to any automated access.

It wasn't.

In this post, I'll walk through how I reverse-engineered Pinduoduo's mobile API to extract structured data what broke, what worked, and how I systematically overcame challenges like request signing, device fingerprinting, and HTTP 429 rate limits.

The Target

Pinduoduo (https://pinduoduo.com/) is a Chinese e-commerce giant known for group buying and steep discounts. With over 800 million active users, it's a data goldmine for market research but one that fiercely protects its mobile ecosystem.

The web version is a second-class citizen by design. Core features, complete product catalogues, and real-time pricing live exclusively in the mobile app. If you want the data, you have to crack the app.

What I Tried First: Standard Scraping Playbook

Before going down the mobile rabbit hole, I ran through the conventional web scraping toolkit. These are the battle-tested methods that work on 90% of e-commerce sites. Pinduoduo wasn't in that 90%.

Direct HTTP Requests

The simplest approach: replicate what the browser does. Open DevTools, watch the Network tab, and copy the curl command, translated to Python.
```
 import requests

 headers = {
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
     "Accept": "application/json, text/plain, */*",
 }

 response = requests.get(
     "https://pinduoduo.com/home/supermarket",
     headers=headers
 )
 data = response.json()
```
What happened: 200 OK responses with empty or obfuscated payloads. All the products were missing and required browsing via the mobile application.
Headless Browsers

When raw HTTP fails, browser automation is the next step. Let a real Chromium instance execute the JavaScript, wait for network idle, then scrape the rendered DOM.

What happened: Incomplete data is the issue. One might suggest visiting the https://mobile.yangkeduo.com for the site, but note that this still doesn’t show the complete listing and stops providing data once you scroll down enough.
Proxy Rotation & Residential IPs

Standard evasion: distribute requests across thousands of residential IPs. Services like Bright Data, Oxylabs, or Smartproxy make this trivial.

What happened: The headers required userId to be attached, which is assigned at the time of signup.
API Endpoint Discovery
Modern SPAs (Single Page Applications) often leak internal APIs. Search the Network tab for .json endpoints, api. subdomains, or GraphQL queries. Reverse engineer the authentication, usually a bearer token or session cookie.

What happened: Pinduoduo's web endpoints were decoys. The "API" calls from the browser returned partial data or required tokens generated through JavaScript obfuscation so dense it might as well have been native code. The real API, the one powering search, recommendations, and checkout wasn't talking to the browser at all.

It was talking to the mobile app.

The Dead End

After hours and hours of iteration, the pattern was clear: Pinduoduo's web infrastructure is deliberately crippled. It's not that they couldn't expose rich data via web endpoints, it's that they choose not to. The mobile app is the primary interface, and the web version exists only for SEO and casual browsing.

Every conventional method hit the same wall: incomplete data, aggressive blocking, or both. The cost of evasion in proxy bandwidth, compute time for headless browsers, and engineering hours exceeded the value of the partial data returned.

To get complete, structured, real-time product data, I needed to become the mobile app.

Enter the Mobile App: Reverse Engineering the Private API

If the web was a fortress with a broken gate, the mobile app was a vault heavily guarded, but worth the effort. Mobile apps don't run in inspectable browsers. They compile their logic into native code, encrypt traffic by default, and bake device identity into every request.

This is where most scrapers quit. It's also where the real data lives.

Step 1: Traffic Interception

First, I needed to see what the app was actually sending. Standard approach: man-in-the-middle (MITM) proxy.

Tool I used: HTTP Toolkit (With ADB connected to my mobile device)

Methodology:

Root the mobile device. Without rooting the device, it is almost impossible to intercept the API responses of the application.
Install the proxy’s root certificate on an Android device
Connect the device to the HTTP Toolkit via ADB.
Inspect all the requests in the proper network tab.

What happened: The traffic flowed encrypted HTTPS and became readable JSON, making it easier to reverse engineer.

Step 2: Request Anatomy

With traffic flowing through the proxy, I could inspect what the app actually sent. This wasn't a clean REST API with OAuth tokens. It was a defensive architecture designed to verify every aspect of the requester's identity.

Here's a real search request captured from the app:

curl --location 'https://api.pinduoduo.com/search?source=index&pdduid=...' \
  --header 'accept-encoding: gzip' \
  --header 'accept-language: en-US' \
  --header 'accesstoken: ...' \
  --header 'al-sa: {...}' \
  --header 'anti-token: ...' \
  --header 'content-type: application/json;charset=UTF-8' \
  --header 'cookie: acid=...; api_uid=...' \
  --header 'etag: ...' \
  --header 'host: api.pinduoduo.com' \
  --header 'lat: ...' \
  --header 'multi-set: ...' \
  --header 'p-appname: pinduoduo' \
  --header 'p-mediainfo: ...' \
  --header 'p-proc: main' \
  --header 'p-proc-time: ...' \
  --header 'pdd-config: ...' \
  --header 'referer: Android' \
  --header 'user-agent: android Mozilla/5.0 (...) Mobile Safari/... phh_android_version/... phh_android_build/... phh_android_channel/...' \
  --header 'x-app-lang: en' \
  --header 'x-app-ui: ...' \
  --header 'x-b3-ptracer: ...' \
  --header 'x-pdd-info: ...' \
  --header 'x-pdd-queries: width=...&height=...&dpr=...&net=...&brand=...&model=...&osv=...&appv=...&pl=...' \
  --header 'x-yak-llt: ...' \
  --data '{
    "install_token": "...",
    "item_ver": "...",
    "list_id": "...",
    "track_data": "...",
    "source": "index",
    "page_sn": "...",
    "page_id": "search_result.html",
    "referer_params": null,
    "dark_mode": "0",
    "show_mark_icon": "1",
    "flip_gset_num": "...",
    "flip": "...",
    "back_search": "false",
    "page_el_sn": "...",
    "search_met": "manual",
    "max_offset": "...",
    "sort": "default",
    "exposure_offset": "...",
    "is_sys_minor": "0",
    "q": "SEARCH_TERM_HERE",
    "size": "20",
    "union_pay_installed": "0",
    "requery": "0",
    "page": "...",
    "engine_version": "2.0",
    "pre_req": "0",
    "is_new_query": "0"
  }'

Analyzing the defense layers:

Header	Purpose	What It Reveals
`accesstoken`	Session authentication	Short-lived, rotated in days
`anti-token`	Request signature	400+ character cryptographic proof-of-work changes every time within less than a minute
`al-sa`	Ads/tracking state	Encoded behavioural fingerprint
`lat`	Location/auth token	Secondary auth bound to device
`x-yak-llt`	Timestamp	Millisecond precision, ~5min validity window
`x-pdd-queries`	Device specs	Hardware fingerprint (screen, OS, model)
`install_token`	Persistent device ID	Survives app reinstalls
`flip`	Pagination state	Cryptographically chained page tokens

Critical observations:

Dual token system: accesstoken for session, anti-token for request integrity. One without the other returns 403.
Hardware attestation: The user-agent isn't just a string, it's a structured device confession: Device Name, Android Version, WebView Chrome 94, app version 7.94.0. Mismatch any element and the request fails.
Behavioural chaining: The flip parameter in the body isn't random. It's a cryptographic chain linking search pages. You can't jump to page 5 without having the token from page 4's response.
Temporal decay: x-yak-llt (timestamp) and anti-token are time-bombed. Replay a captured request 5 minutes later invalid. Replay it with a fresh timestamp but old signature invalid.
Geo-consistency: x-pdd-info claims timezone LOCATION_NAME. The lat header and IP geolocation must align, or the request flags for review.

The anti-token breakdown:

This 400+ character monster is the heart of Pinduoduo's defense. Decoding revealed:

Device entropy: Hardware-derived randomness
Behavioral proof: Evidence of human interaction (scroll patterns, touch events)
Request binding: Hash of the specific query parameters (q=SEARCH_TERM, page=PAGE_NUMBER)
Timestamp: Embedded expiry
Signature: HMAC-SHA256 with a rotated key

Changing any query parameter, page number, sort order, or even the size field invalidates the token. The signature is non-deterministic: two identical requests produce different anti-token values due to embedded timestamps and entropy.

Why conventional replay failed:

I tried the naive approach: capture this curl, rotate the page parameter, fire away.

Result: HTTP 429 {"server_time": 1770..., "server_time_ms": 17705..., "error_code": 40002, "empty_reason": 1}

The anti-token was bound to that specific request's fingerprint. Without the signing algorithm, I couldn't generate valid tokens for modified queries. And the algorithm wasn't in JavaScript, it was in native ARM code, obfuscated and anti-tamper protected.

This is why headless browsers and proxy rotation failed. You can't automate what you can't sign, and you can't sign what you can't reverse engineer.

Step 3: The Pivot - Finding the Unlocked Door

After spending days of dead-ends with the search endpoint's anti-token, I faced a choice: continue reverse-engineering a 400-character cryptographic signature (potentially weeks of ARM binary analysis), or find another way in.

I chose the latter.

The hypothesis: Pinduoduo's API surface is vast. Not every endpoint has the same security posture. The search endpoint is high-value, high-traffic, heavily defended. But secondary features: category browsing, recommendations, related products might rely on lighter protections.

I mapped the app's API calls by navigating through different flows:

Flow	Endpoint	Protection Level
Search	`/api/oak/search`	Maximum (`anti-token` + behavioral checks)
Category Browse	`/api/caterham/query/subfenlei_gyl_label`	Moderate (static tokens)
Product Detail	`/api/oak/v14/goods`	High (device binding)
Recommendations	`/api/oak/rec`	Variable

The category browsing endpoint was the weak link. It returned structured product listings nearly identical to search results but with a simpler authentication model.

Here's a captured request:

curl --location 'https://api.pinduoduo.com/api/caterham/query/subfenlei_gyl_label?offset=40&list_id=...&count=20&goods_id=...&opt_id=25877&req_list_action_type=0&page_sn=10028&support_types=0&page_id=catgoods.html&content_goods_num=4&size=20&show_mark_icon=1&opt_type=2&req_action_type=10&engine_version=2.0&page_el_sn=98978&pdduid=...' \
--header 'Accept-Encoding: gzip' \
--header 'AccessToken: ...' \
--header 'Connection: Keep-Alive' \
--header 'Content-Type: application/json;charset=UTF-8' \
--header 'Cookie: acid=...; api_uid=...; api_uid=...' \
--header 'ETag: ...' \
--header 'Host: api.pinduoduo.com' \
--header 'PDD-CONFIG: V4:001.079400' \
--header 'Referer: Android' \
--header 'User-Agent: android Mozilla/5.0 (Linux; Android 11; ... Build/...; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/... Mobile Safari/537.36  phh_android_version/... phh_android_build/... phh_android_channel/... pversion/0' \
--header 'X-PDD-QUERIES: width=720&height=1411&dpr=2.0&net=1&brand=...&model=...&osv=11&appv=...&pl=2' \
--header 'accept-language: en-US' \
--header 'anti-token: ...' \
--header 'lat: ...' \
--header 'multi-set: 1,1,100000176' \
--header 'p-appname: pinduoduo' \
--header 'p-mediainfo: player=2.1.0&rtc=1.0.0' \
--header 'p-proc: main' \
--header 'p-proc-time: 879202' \
--header 'vip: 127.0.0.1' \
--header 'x-app-lang: en' \
--header 'x-app-ui: dm%3D0%26zm%3D0' \
--header 'x-b3-ptracer: ...' \
--header 'x-pdd-info: bold_free%3Dfalse%26bold_product%3D%26front%3D1%26tz%3D...' \
--header 'x-yak-llt: ...'

Key differences from the search endpoint:

GET instead of POST: No request body to sign. The anti-token is still present but appears to be session-scoped, not request-scoped.
Simpler parameter set: No flip chaining, no install_token, no behavioural tracking in the payload. Just query parameters.
Static anti-token tolerance: Testing revealed the same anti-token worked for 50+ sequential requests, provided other session headers (AccessToken, Cookie, lat) stayed consistent.
No pagination chaining: The offset parameter is a simple integer. Jump to offset 1000 without prior context? It works. Each request required an increment of offset of 20.

Why this endpoint is softer:

Category browsing is a background feature. Users swipe through categories casually, not with the intent precision of search. Pinduoduo's risk model likely weights it lower less bot incentive, less defensive investment.

The anti-token here isn't cryptographically bound to the specific query. It appears to be a session heartbeat proof the app is running, not proof this exact request is legitimate.

Operationalizing the discovery:

With a static anti-token, I could reduce the problem to session management:

Extract once: Capture a valid anti-token from a real app session
Maintain session: Keep AccessToken, Cookie, and lat fresh via periodic "heartbeat" requests
Iterate freely: Vary offset, opt_id (category ID), and goods_id without regenerating signatures

The trade-off:

This endpoint doesn't support free-text search. You can't query "GPU" or "iPhone 15." You must traverse the category tree:

Root Categories (opt_id: 1-1000)
    └─ Electronics (opt_id: 25877)
        └─ Computer Components
            └─ Graphics Cards ← Products returned here

Volume achieved: 1,000+ products mapped

Why this worked:

I didn't defeat Pinduoduo's security, I routed around it. The search endpoint is a fortress. The category endpoint is a guard checkpoint with a broken fence. Both lead to the same data warehouse.

This is a recurring pattern in mobile API scraping: high-value endpoints are hardened; supporting infrastructure is often neglected. The skill isn't always cryptographic reverse-engineering. Sometimes it's systematic reconnaissance finding which door the defenders forgot to bar.

Step 4: What Failed (And Why)

The category endpoint workaround didn't come easily. Before finding that open window, I burned through several approaches that should have worked on paper. Documenting the failures is as important as the solution, they reveal the shape of Pinduoduo's defenses and the mindset required to bypass them.

Failure 1: Static `anti-token` Harvesting from Search Endpoint

The attempt: Capture a single anti-token from the search endpoint via MITM proxy, then replay it with different query parameters.

Why it failed: The search anti-token is cryptographically bound to the request payload. Change q=SEARCH_KEYWORD to q=ANOTHER_SEARCH_KEYWORD? Token invalid. Increment page=5 to page=6 without the previous response's flip parameter? Token invalid. Even identical requests with the same parameters but different timestamps failed the token embeds a ~2-minute expiry window.

The deeper problem: The signing algorithm likely includes:

HMAC of sorted query parameters
Session nonce from previous response
Device fingerprint hash
Timestamp with sliding window

Without the native signing code, I couldn't forge valid tokens. And the native code was obfuscated with O-LLVM control flow flattening function names stripped, logic scattered across thousands of basic blocks.

💡

Lesson: Don't fight cryptographic binding head-on. Find endpoints with weaker binding.

Failure 2: Emulated Device Farms

The attempt: Instead of hooking a real app, use Android emulators (Gennymotion) with modified system images to run the app and intercept traffic at the network layer.

Setup:

20 LDPlayer instances on a headless server
Magisk for root + certificate injection
Frida server for dynamic instrumentation
Automated screenshot OCR to extract data if API scraping failed

Why it failed: Pinduoduo's app detected virtualization through multiple channels:

Detection Vector	Emulator Artifact	Real Device
CPU info	`hardware: goldfish`	`hardware: qcom`
Build fingerprint	`google/sdk_gphone...`	`Xiaomi/cactus...`
Sensor availability	Accelerometer missing	Full sensor stack
`/proc` filesystem	Exposes hypervisor PID	Clean process tree
OpenGL renderer	`Android Emulator`	`Adreno (TM) 610`

The app didn't crash or show errors. It simply served degraded content limited product listings, no prices, infinite loading spinners. Silent degradation is harder to debug than hard blocks.

💡

Lesson: Modern mobile apps are emulator-aware. Physical devices or sophisticated spoofing (MagiskHide + custom props) required.

Failure 3: Protocol Downgrade to HTTP

The attempt: Force the app to use unencrypted HTTP by DNS hijacking api.pinduoduo.com to a local proxy, hoping the app would fall back from HTTPS.

Why it failed: The app didn't fall back. Certificate pinning meant no TLS handshake = no connection. But more importantly, even if I stripped the pinning, the anti-token and AccessToken headers are generated client-side. Seeing the plaintext request didn't help me forge new ones.

💡

Lesson: Encryption isn't the barrier, the cryptographic signing is. MITM is only useful for observation, not automation, when tokens are bound to request content.

Failure 4: Rate Limit Evasion via Request Shaping

The attempt: Even on the softer category endpoint, aggressive scraping triggered 429 errors. I tried sophisticated evasion:

Jittered delays: Random sleep between 1-5 seconds (Poisson distribution)
User-agent rotation: Spoofing different device models per request
Header reordering: Randomizing header sequence to break fingerprinting
TCP/IP stack tuning: Modifying window sizes, TTL values to mimic different OSes

Why it failed: Pinduoduo's rate limiting isn't naive IP-based counting. It's session reputation scoring:

Signal	Weight	My Violation
Request velocity	High	200 req/min vs. human ~10/min
Temporal pattern	Medium	Machine-precision intervals
Device consistency	Critical	Rotating UA while keeping `lat` static
Session age	High	Fresh tokens with old `install_token`
Behavioral depth	Medium	No "browsing" before "buying" actions

The lat header (location/auth token) is device-bound. Rotate your User-Agent but keep the same lat? Scored as suspicious. The system correlates across dimensions I wasn't controlling.

The fix that worked: Embrace consistency, not evasion. One device profile, one IP, human-paced requests, gradual session aging. Counter-intuitively, being more predictable made me less detectable.

💡

Lesson: Modern bot detection uses multi-factor scoring. Evasion attempts often raise scores. Mimicry beats evasion.

Operational Architecture

With the category endpoint identified and the anti-token behaving as a session-scoped credential rather than a per-request signature, I needed infrastructure that could maintain session consistency, handle failures gracefully, persist data reliably, and distribute load without breaking the delicate trust relationship established with Pinduoduo's API.

Core Design Principles

Principle	Implementation
Session Coherence	Static header/cookie bundle treated as immutable within a scrape run
Defensive Extraction	Multiple fallback fields for every data point (API response shapes vary)
Immediate Persistence	Write data before processing completes never hold in memory
Full Auditability	Every request/response logged for post-hoc debugging
Geo-Locked Distribution	Proxy rotation constrained to single metropolitan region

System Components

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Session Store  │────▶│  Scrapy Engine   │────▶│  Pinduoduo API      │
│  (Static bundle)│     │  (Async requests)│     │  (Category endpoint)│
└─────────────────┘     └──────────────────┘     └─────────────────────┘
         │                       │
         ▼                       ▼
┌─────────────────┐     ┌──────────────────┐
│  Checkpoint     │     │  Proxy Pool      │
│  (Resume offset)│     │  (Geo-locked IPs)│
└─────────────────┘     └──────────────────┘
         │                       │
         ▼                       ▼
┌─────────────────┐     ┌──────────────────────┐
│  CSV Output     │     │  JSON Audit Log      │
│  (Products)     │     │  (Requests/Responses)│
└─────────────────┘     └──────────────────────┘

Session Management Strategy

The architecture treats authentication as a discrete resource rather than a continuous process:

Capture Phase: Manual extraction of header/cookie bundle from live mobile app traffic via MITM proxy
Consumption Phase: Static injection into all requests during scrape run
Refresh Phase: Manual re-capture when session expires (6-12 hour window)

This avoids the complexity of native code hooking or cryptographic reverse-engineering at the cost of periodic manual intervention. For the target volume of 1,000+ products, this trade-off was acceptable.

Critical invariants:

anti-token, AccessToken, and lat must never rotate mid-session
User-Agent and device fingerprint headers must remain consistent with the lat token's embedded device identity
Cookie jar (acid, api_uid) must persist across sequential requests

Pagination and Resumption

The API uses offset-based pagination with fixed page size. The architecture supports:

Manual resume: Hardcoded offset injection to restart interrupted runs
Automatic continuation: Heuristic detection of empty result sets to terminate gracefully
Bounded execution: Page limit enforcement to prevent runaway scraping

Checkpoint state (current offset, page number) travels with each request via Scrapy's metadata system, enabling distributed state management without external databases.

Data Flow and Persistence

Dual-write strategy ensures no data loss during failures:

Stage	Destination	Purpose	Format
Extraction	CSV	Immediate human-readable output	Flat, UTF-8 encoded
Audit	JSON Lines	Complete request/response reconstruction	Structured, timestamped

The CSV layer prioritizes write safety over performance file-per-row operations prevent data loss if the spider crashes. The JSON audit log captures full HTTP conversations, including headers, bodies, and proxy assignments, for forensic analysis when Pinduoduo's defenses trigger unexpected responses.

Network Distribution

Proxy rotation follows geo-fidelity constraints:

Pool locked to single metropolitan region matching lat token's claimed location
Random selection per-request to distribute load
No session affinity each request independently routed

This distributes traffic across thousands of residential IPs while maintaining the geographic consistency that Pinduoduo's risk models verify.

Failure Handling Matrix

Failure Mode	Detection	Response	Recovery
HTTP 429 (rate limit)	Status code	Exponential backoff retry	Automatic
Token expiry (403)	Status code + body	Log and terminate	Manual token refresh
Empty result set	Content heuristic	Pagination stop	N/A (graceful end)
JSON parse failure	Exception	Fallback to raw text logging	Continue
Write failure	I/O exception	Log error, skip row	Continue
Proxy failure	Connection error	Retry with new proxy	Automatic

All failure paths prioritize continuation over correctness partial data is better than no data when scraping hostile infrastructure.

Execution Parameters

The spider accepts runtime configuration for:

Offset step: Controls pagination granularity (default 20, matching API page size)
Max pages: Hard limit on crawl depth
Initial offset: Resume capability for interrupted runs

These parameters enable idempotent execution re-running with the same configuration produces deterministic output ranges without duplicate data.

Operational Results

Metric	Value
Sustained throughput	~300 requests/minute
Session lifetime	6-12 hours
Products mapped	1,000+
Data completeness	Price, inventory, sales volume, ad classification
Failure rate	<2% (primarily token expiry at end of session)

The architecture successfully extracted XXX category listings (opt_id=2....7) including real-time pricing, stock availability, and sponsored product injection patterns unavailable through web scraping.

The Real Bottleneck - UserID Throttling

The architecture described so far works for 1,000+ products. It does not work for 100,000+. The constraint isn't technical, it's identity-based.

Pinduoduo's rate limiting operates at the UserID granularity, not just IP or session. Every pdduid parameter in the request URL carries an implicit quota. Once exceeded, responses don't fail with 429, they degrade. Prices disappear. Inventory shows as zero. Product listings truncate mid-page.

The category endpoint, soft on signature validation, is hard on user reputation.

Why I didn't build this: I simply didn’t have resources to make all the accounts manually and then signing up and retreiving UserID all manually.

What I Actually Did

I stayed in the small-scale regime. 1,000+ products extracted via single UID, session refreshed manually twice. The data served its purpose market analysis for a specific XXX category, not comprehensive price intelligence.

The spider architecture supports UID rotation via configuration. The pdduid parameter is just another base parameter. But the operational pipeline to generate, validate, and maintain a pool of legitimate UIDs was never built. It was the boundary where this project stopped being a technical challenge and started being a resource extraction business.

Conclusion: The Limits of Technical Evasion

This project started with a straightforward goal: extract structured product data from Pinduoduo and ended with a tour of modern mobile API defenses, cryptographic obfuscation, and the economics of identity at scale.

The technical victories were real: bypassing certificate pinning, mapping the API surface, finding the softer category endpoint, building resilient scraping infrastructure. But they were bounded victories, contained by a constraint no amount of code could overcome: Pinduoduo owns the identity layer, and identity is the scarcest resource.

What Worked

Approach	Outcome	Scale
MITM proxy + ADB interception	Visibility into encrypted traffic	Single device
Category endpoint discovery	Bypass of search signature requirements	1,000+ products
Static session bundle	Stable authentication for 6-12 hours	Single UID
Scrapy + proxy rotation	Distributed, observable, resilient extraction	300 req/min
Defensive data extraction	99%+ field coverage despite API variance	1,000+ records

These techniques succeeded because they respected the defender's logic. Pinduoduo's security isn't flawed, it's economically rational. They invest heavily in high-value endpoint protection (search, checkout) and accept residual risk on supporting infrastructure (category browse). My approach found the efficient frontier of that risk calculation.

What Didn't Scale

Ambition	Barrier	Root Cause
100,000+ products	UserID quotas	Identity as rate-limiting factor, not IP or signature
Real-time monitoring	Session expiry	Manual token refresh unsustainable
Complete catalog coverage	Category tree depth	Exponential API calls vs. linear UID quotas
Long-term automation	Account aging	Reputation systems require genuine user behavior

The UserID bottleneck isn't a puzzle to solve, it's a business model enforcement mechanism. Pinduoduo gives away data to real users and withholds it from aggregators. Technical evasion doesn't change that economics; it just raises the cost of enforcement.

The Broader Pattern

This case study reflects a shift in platform defense:

Old model: Block bots at the perimeter (IP, User-Agent, CAPTCHA)
New model: Differentiate humans through accumulated reputation (device history, social graph, behavioral depth)

The new model is harder to spoof because it's contextual and temporal. A real user builds reputation over weeks. A scraper must either replicate that investment (expensive) or find endpoints that ignore it (limited).

For data practitioners, this means:

Scraping is increasingly a cost-benefit negotiation, not a technical challenge. The question isn't "can I get this data?" but "is this data worth the operational cost of simulating legitimacy?"
Platform APIs are tiered by trust. Public endpoints are heavily defended. Private endpoints (mobile APIs) are less defended but harder to access. Partner endpoints are accessible but require legal relationships.
Data extraction at scale requires scale infrastructure. Not just proxies and parsers, but identity farms, behavioral simulation, and compliance systems. This is indistinguishable from fraud infrastructure and prosecuted accordingly.

My Takeaway

I built a system that extracted 1,000+ XXX listings from Pinduoduo's mobile API. It worked because I stayed small, moved quietly, and accepted manual maintenance. It would not work for a price comparison site, a market intelligence platform, or any use case requiring comprehensive, real-time data.

The real lesson isn't in the ADB interception or the Scrapy architecture. It's in knowing when to stop recognizing that the next bottleneck isn't technical, and that crossing it changes the nature of the work entirely.

Pinduoduo's data is available. Just not to scrapers. Not at scale. Not without becoming something else entirely.

Source Code

❌

Source code and methodology: Available for educational purposes. Not recommended for production extraction without legal review and platform consent.

NOTE: All the sensetive informations have been hidden. The code is documented by AI.

"""
Educational Scrapy spider demonstrating mobile API scraping techniques.

This example shows how to interact with a protected mobile API endpoint
that uses session-based authentication rather than per-request signatures.

DISCLAIMER: This code is for educational purposes only. Respect robots.txt,
terms of service, and rate limits. Unauthorized scraping may violate laws.

Usage:
    scrapy runspider educational_spider.py -a max_pages=10

Requirements:
    - scrapy
    - Properly configured proxy rotation (see proxy module)
"""

import json
import urllib.parse
import os
import csv
import datetime
import random

import scrapy

# Import proxy list from external module (not included)
# In production, this would be a rotating proxy service
from educational_example.proxy import proxies


class MobileApiSpider(scrapy.Spider):
    """
    Spider demonstrating session-based API scraping patterns.

    Key concepts demonstrated:
    - Static session bundle maintenance
    - Offset-based pagination
    - Defensive data extraction with fallbacks
    - Dual persistence (CSV + JSON audit log)
    - Request/response logging for debugging
    """

    name = "mobile_api_example"
    allowed_domains = ["api.example-ecommerce.com"]

    # Conservative delay to respect rate limits
    custom_settings = {
        "DOWNLOAD_DELAY": 0.5,  # 500ms between requests
        "RETRY_TIMES": 3,
        "RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
    }

    def __init__(self, offset_step=20, max_pages=50, start_offset=0, *args, **kwargs):
        """
        Initialize spider with configurable pagination.

        Args:
            offset_step: Number of items per page (API-specific)
            max_pages: Hard limit on pagination depth
            start_offset: Resume capability for interrupted runs
        """
        super().__init__(*args, **kwargs)
        self.offset_step = int(offset_step)
        self.max_pages = int(max_pages)
        self.start_offset = int(start_offset)

        # =========================================================================
        # SESSION BUNDLE - Extracted from legitimate mobile app traffic
        # =========================================================================
        # These headers represent a captured session from a real device.
        # In educational context: Shows how mobile APIs authenticate requests.
        # 
        # SECURITY NOTE: In real implementation, these would be:
        # - Loaded from environment variables or secure vault
        # - Rotated when session expires (typically 6-12 hours)
        # - Never hardcoded in source control
        # =========================================================================

        self.headers = {
            # Compression
            "Accept-Encoding": "gzip",

            # Session authentication (short-lived, ~30 min expiry)
            "AccessToken": "...",

            # Connection management
            "Connection": "Keep-Alive",

            # Content negotiation
            "Content-Type": "application/json;charset=UTF-8",

            # Cache validation
            "ETag": "...",

            # Target host
            "Host": "api.example-ecommerce.com",

            # App configuration version
            "PDD-CONFIG": "V4:001.079400",

            # Platform identifier
            "Referer": "Android",

            # Device fingerprint (critical for session consistency)
            # Format: Platform + OS + Device Model + WebView Version + App Version
            "User-Agent": "android Mozilla/5.0 (Linux; Android ...; ... Build/...; wv) "
                         "AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 "
                         "Chrome/... Mobile Safari/537.36  "
                         "app_version/... app_build/... app_channel/... pversion/0",

            # Hardware specifications (must match User-Agent device claims)
            "X-PDD-QUERIES": "width=...&height=...&dpr=...&net=...&brand=...&"
                            "model=...&osv=...&appv=...&pl=...",

            # Language preference
            "accept-language": "en-US",

            # Request signature (session-scoped, not request-scoped in this endpoint)
            # NOTE: Other endpoints may cryptographically bind this to request parameters
            "anti-token": "...",

            # Location/secondary auth token (geo-locked, device-bound)
            "lat": "...",

            # Feature flags
            "multi-set": "...",

            # App identification
            "p-appname": "...",
            "p-mediainfo": "player=...&rtc=...",
            "p-proc": "main",
            "p-proc-time": "...",

            # Debug/development flag
            "vip": "127.0.0.1",

            # The following headers are commented out to demonstrate optional fields
            # that may be required for other endpoints or higher security contexts:
            # "x-app-lang": "...",
            # "x-app-ui": "...",
            # "x-b3-ptracer": "...",  # Distributed tracing ID
            # "x-pdd-info": "...",    # Timezone and feature flags
            # "x-yak-llt": "...",     # Millisecond timestamp
        }

        # Session cookies (must remain consistent with headers)
        self.cookies = {
            "acid": "...",      # Anonymous device identifier
            "api_uid": "...",   # User session identifier
        }

        # =========================================================================
        # API ENDPOINT CONFIGURATION
        # =========================================================================
        # Base URL for category browsing endpoint
        # NOTE: This is a secondary endpoint with lighter protections than search
        self.base_url = "https://api.example-ecommerce.com/api/category/browse"

        # Fixed query parameters (endpoint-specific)
        # In production, these would be parameterized for different categories
        self.base_params = {
            "list_id": "...",           # Category list identifier
            "count": "20",              # Items per page
            "goods_id": "...",          # Anchor product ID
            "opt_id": "...",            # Category ID
            "req_list_action_type": "0",
            "page_sn": "...",           # Page screen identifier
            "support_types": "0",
            "page_id": "category.html", # Page template
            "content_goods_num": "4",
            "size": "20",
            "show_mark_icon": "1",      # UI flag
            "opt_type": "2",            # Category type
            "req_action_type": "10",
            "engine_version": "2.0",    # Search algorithm version
            "page_el_sn": "...",        # Element identifier
            "pdduid": "...",            # User ID (CRITICAL: rate limiting key)
        }

    def start_requests(self):
        """
        Initialize pagination from configured start offset.

        Yields:
            scrapy.Request with metadata for state tracking
        """
        params = dict(self.base_params)
        params["offset"] = str(self.start_offset)

        # Construct full URL with query parameters
        url = f"{self.base_url}?{urllib.parse.urlencode(params)}"

        # Select random proxy from pool (geo-locked to session region)
        proxy = random.choice(proxies) if proxies else None

        # Build request with full audit trail metadata
        request = scrapy.Request(
            url,
            headers=self.headers,
            cookies=self.cookies,
            callback=self.parse,
            meta={
                "offset": self.start_offset,
                "page": 1,
                "proxy": proxy,
            }
        )

        # Log initial request for debugging
        self._log_request(request)

        yield request

    def parse(self, response):
        """
        Parse API response and handle pagination.

        Demonstrates:
        - Defensive JSON parsing (multiple fallback strategies)
        - Field extraction with multiple fallback sources
        - Dual persistence (CSV + audit log)
        - Pagination continuation logic

        Args:
            response: scrapy.Response object

        Yields:
            dict: Extracted product data
            scrapy.Request: Next page if pagination continues
        """
        # Extract state from request metadata
        current_offset = response.meta.get("offset", 0)
        current_page = response.meta.get("page", 1)

        # =========================================================================
        # DEFENSIVE PARSING
        # =========================================================================
        # Mobile APIs often return inconsistent content types or malformed JSON
        # Strategy: Try native json(), fallback to json.loads(), fallback to raw text

        try:
            data = response.json()
        except json.JSONDecodeError:
            try:
                data = json.loads(response.text)
            except json.JSONDecodeError:
                # Log raw response for post-hoc analysis
                self.logger.error(f"JSON parse failed for offset {current_offset}")
                data = {"parse_error": True, "raw": response.text}

        # Log complete request/response cycle for audit trail
        self._log_full_cycle(response.request, response)

        # =========================================================================
        # DATA EXTRACTION
        # =========================================================================
        # API response structure varies by product type, category, and A/B tests
        # Strategy: Try multiple field paths, use first available

        products = data.get("goods_list", []) if isinstance(data, dict) else []

        for item in products:
            # Build record with extensive fallback chains for each field
            record = {
                # Product identification (multiple possible field names)
                "site_product_id": (
                    item.get("goods_id") 
                    or item.get("id")
                    or "unknown"
                ),

                # Product naming (full vs. shortened)
                "product_name": item.get("goods_name"),
                "short_name": item.get("short_name"),

                # URL construction (may need domain prepending)
                "product_url": item.get("link_url"),

                # Pricing (highly variable structure across product types)
                # Priority: displayed price > sale price > base price > 0
                "price": (
                    item.get("group", {}).get("price_str")
                    or item.get("min_on_sale_group_price")
                    or item.get("group", {}).get("promo_price")
                    or item.get("group", {}).get("price")
                    or item.get("price")
                    or 0
                ),

                # Imagery (multiple resolution options)
                "image_url": (
                    item.get("hd_thumb_url")
                    or item.get("hd_url")
                    or item.get("image_url")
                    or item.get("thumb_url")
                ),

                # Sales metrics (different naming conventions)
                "sales": item.get("sales") or item.get("cnt"),

                # Social proof
                "customer_num": item.get("customer_num"),

                # Inventory state
                "inventory_quantity": item.get("quantity"),
                "is_available": (
                    item.get("quantity") is not None 
                    and item.get("quantity") > 0
                ),
                "is_sold_out": item.get("quantity") == 0,

                # Advertising flag
                "is_ad": bool(item.get("ad")),

                # Quality/relevance score
                "quality_score": item.get("quality"),

                # Pagination metadata for traceability
                "page": current_page,
                "offset": current_offset,
                "scraped_at": datetime.datetime.utcnow().isoformat() + "Z",
            }

            # Persist to CSV (immediate, safe for crashes)
            self._persist_record(record)

            # Yield for Scrapy pipelines/middlewares
            yield record

        # =========================================================================
        # PAGINATION LOGIC
        # =========================================================================
        # Continue if:
        # 1. Under max_pages limit, AND
        # 2. Current page contained data (heuristic: non-empty response)

        should_continue = False

        if current_page < self.max_pages:
            if isinstance(data, dict):
                # Heuristic: Any non-empty value indicates valid response
                should_continue = any(v for v in data.values() if v not in (None, [], {}))
            elif isinstance(data, list) and len(data) > 0:
                should_continue = True

        if should_continue:
            next_offset = current_offset + self.offset_step
            params = dict(self.base_params)
            params["offset"] = str(next_offset)

            next_url = f"{self.base_url}?{urllib.parse.urlencode(params)}"
            next_proxy = random.choice(proxies) if proxies else None

            yield scrapy.Request(
                next_url,
                headers=self.headers,
                cookies=self.cookies,
                callback=self.parse,
                meta={
                    "offset": next_offset,
                    "page": current_page + 1,
                    "proxy": next_proxy,
                }
            )

    # =========================================================================
    # PERSISTENCE LAYER
    # =========================================================================

    def _persist_record(self, record):
        """
        Append record to CSV with safe concurrency handling.

        Strategy: Open/close per write (inefficient but safe for sporadic concurrency)
        Production alternative: Scrapy Item Pipelines with batching

        Args:
            record: dict of extracted fields
        """
        fieldnames = [
            "site_product_id", "product_name", "short_name",
            "product_url", "price", "image_url", "sales",
            "customer_num", "inventory_quantity", "is_available",
            "is_sold_out", "is_ad", "quality_score",
            "page", "offset", "scraped_at",
        ]

        output_file = "output_products.csv"
        file_exists = os.path.exists(output_file)

        try:
            with open(output_file, "a", newline="", encoding="utf-8") as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                if not file_exists:
                    writer.writeheader()
                writer.writerow({k: record.get(k, "") for k in fieldnames})
        except IOError as e:
            # Log error but don't kill spider for write failure
            self.logger.error(f"CSV write failed: {e}")

    # =========================================================================
    # AUDIT LOGGING
    # =========================================================================

    def _log_request(self, request):
        """Log outgoing request for debugging."""
        self._write_audit_log({
            "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
            "type": "request",
            "url": request.url,
            "method": request.method,
            "proxy": request.meta.get("proxy"),
        })

    def _log_full_cycle(self, request, response):
        """
        Log complete request/response cycle.

        Critical for debugging when API returns unexpected responses
        or when session expires mid-crawl.
        """
        # Decode headers safely (Scrapy uses bytes, handle both)
        def safe_decode(value):
            if isinstance(value, (bytes, bytearray)):
                return value.decode("utf-8", errors="ignore")
            return str(value)

        def headers_to_dict(headers):
            result = {}
            for name in headers.keys():
                name_str = safe_decode(name)
                values = [safe_decode(v) for v in headers.getlist(name)]
                result[name_str] = values[0] if len(values) == 1 else values
            return result

        entry = {
            "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
            "type": "full_cycle",
            "request": {
                "url": request.url,
                "method": request.method,
                "headers": headers_to_dict(request.headers),
                "proxy": request.meta.get("proxy"),
            },
            "response": {
                "status": response.status,
                "headers": headers_to_dict(response.headers),
                "body_preview": response.text[:1000] if response.text else None,
            }
        }

        self._write_audit_log(entry)

    def _write_audit_log(self, entry):
        """Append entry to JSON Lines audit log."""
        log_file = "request_audit.log"
        try:
            with open(log_file, "a", encoding="utf-8") as f:
                f.write(json.dumps(entry, ensure_ascii=False) + "\n")
        except IOError as e:
            self.logger.error(f"Audit log write failed: {e}")


# =========================================================================
# EDUCATIONAL NOTES
# =========================================================================

"""
ARCHITECTURAL PATTERNS DEMONSTRATED:

1. SESSION COHERENCE
   Mobile APIs often validate that headers, cookies, and tokens form a consistent
   identity. Rotating one without others triggers security responses.

2. DEFENSIVE EXTRACTION
   E-commerce APIs change field names based on product type, A/B tests, and
   regional variants. Multiple fallback paths increase robustness.

3. DUAL PERSISTENCE
   CSV for immediate human inspection, JSON audit log for debugging. Separation
   prevents data loss if parsing fails.

4. PAGINATION STATE IN METADATA
   Scrapy's meta dictionary carries state through the request chain, enabling
   resume capability and distributed processing.

5. PROXY ROTATION WITH GEO-FIDELITY
   Proxies must match the geographic region implied by location tokens in
   headers. Mismatches trigger immediate blocking.

LIMITATIONS AND ETHICAL CONSIDERATIONS:

- This code demonstrates techniques for educational purposes
- Rate limiting (DOWNLOAD_DELAY) should be respected
- Session tokens expire and require manual refresh
- UserID quotas limit total extractable volume per identity
- Commercial use requires compliance with platform Terms of Service
- Consider official APIs or data licensing for production applications

SECURITY BEST PRACTICES (if implementing similar systems):

1. Never commit credentials to version control
2. Rotate session tokens automatically or via secure vault
3. Monitor for 403/429 responses as signals of detection
4. Implement exponential backoff for retries
5. Respect robots.txt and crawl-delay directives
6. Consider legal review for jurisdiction-specific regulations (CFAA, GDPR, etc.)
"""

Reverse Engineering a Mobile API: Scraping Pinduoduo When the Web Failed

The Target

What I Tried First: Standard Scraping Playbook

The Dead End

Enter the Mobile App: Reverse Engineering the Private API

Step 1: Traffic Interception

Step 2: Request Anatomy

Step 3: The Pivot - Finding the Unlocked Door

Step 4: What Failed (And Why)

Failure 1: Static `anti-token` Harvesting from Search Endpoint

Failure 2: Emulated Device Farms

Failure 3: Protocol Downgrade to HTTP

Failure 4: Rate Limit Evasion via Request Shaping

Operational Architecture

Core Design Principles

System Components

Session Management Strategy

Data Flow and Persistence

Network Distribution

Failure Handling Matrix

Execution Parameters

Operational Results

The Real Bottleneck - UserID Throttling

What I Actually Did

Conclusion: The Limits of Technical Evasion

What Worked

What Didn't Scale

The Broader Pattern

My Takeaway

Source Code

Comments

More from this blog

The Zocdoc Heist: How I Reverse-Engineered a $2.8B Healthcare Platform and Extracted Its Beating Heart

AKKA Basic Introduction

Control Structure and Functions in Scala

Auxiliary Constructors in Scala

Command Palette

The Target

What I Tried First: Standard Scraping Playbook

The Dead End

Enter the Mobile App: Reverse Engineering the Private API

Step 1: Traffic Interception

Step 2: Request Anatomy

Step 3: The Pivot - Finding the Unlocked Door

Step 4: What Failed (And Why)

Failure 1: Static anti-token Harvesting from Search Endpoint

Failure 2: Emulated Device Farms

Failure 3: Protocol Downgrade to HTTP

Failure 4: Rate Limit Evasion via Request Shaping

Operational Architecture

Core Design Principles

System Components

Session Management Strategy

Pagination and Resumption

Data Flow and Persistence

Network Distribution

Failure Handling Matrix

Execution Parameters

Operational Results

The Real Bottleneck - UserID Throttling

What I Actually Did

Conclusion: The Limits of Technical Evasion

What Worked

What Didn't Scale

The Broader Pattern

My Takeaway

Source Code

Comments

More from this blog

Failure 1: Static `anti-token` Harvesting from Search Endpoint