Reverse Engineering a Mobile API: Scraping Pinduoduo When the Web Failed

Most scraping tutorials start with a website.
This one didn't.
I needed structured product data from Pinduoduo (拼多多), one of China's largest e-commerce platforms. The web version was a dead end: aggressively throttled, inconsistently responsive, and missing critical data fields. Complete product listings weren't even available without the mobile app.
The mobile application, however, was a different beast entirely.
Like most modern apps, it offered no public API. Every request was cryptographically signed. Headers were opaque, dynamic, and device-bound. Rate limiting wasn't just aggressive, it was intelligent. IP rotation? Instantly flagged. At first glance, the app seemed intentionally hostile to any automated access.
It wasn't.
In this post, I'll walk through how I reverse-engineered Pinduoduo's mobile API to extract structured data what broke, what worked, and how I systematically overcame challenges like request signing, device fingerprinting, and HTTP 429 rate limits.
The Target
Pinduoduo (https://pinduoduo.com/) is a Chinese e-commerce giant known for group buying and steep discounts. With over 800 million active users, it's a data goldmine for market research but one that fiercely protects its mobile ecosystem.
The web version is a second-class citizen by design. Core features, complete product catalogues, and real-time pricing live exclusively in the mobile app. If you want the data, you have to crack the app.
What I Tried First: Standard Scraping Playbook
Before going down the mobile rabbit hole, I ran through the conventional web scraping toolkit. These are the battle-tested methods that work on 90% of e-commerce sites. Pinduoduo wasn't in that 90%.
Direct HTTP Requests
The simplest approach: replicate what the browser does. Open DevTools, watch the Network tab, and copy the
curlcommand, translated to Python.import requests headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...", "Accept": "application/json, text/plain, */*", } response = requests.get( "https://pinduoduo.com/home/supermarket", headers=headers ) data = response.json()What happened: 200 OK responses with empty or obfuscated payloads. All the products were missing and required browsing via the mobile application.
Headless Browsers
When raw HTTP fails, browser automation is the next step. Let a real Chromium instance execute the JavaScript, wait for network idle, then scrape the rendered DOM.
What happened: Incomplete data is the issue. One might suggest visiting the
https://mobile.yangkeduo.comfor the site, but note that this still doesn’t show the complete listing and stops providing data once you scroll down enough.Proxy Rotation & Residential IPs
Standard evasion: distribute requests across thousands of residential IPs. Services like Bright Data, Oxylabs, or Smartproxy make this trivial.
What happened: The headers required userId to be attached, which is assigned at the time of signup.
API Endpoint Discovery
Modern SPAs (Single Page Applications) often leak internal APIs. Search the Network tab for .json endpoints, api. subdomains, or GraphQL queries. Reverse engineer the authentication, usually a bearer token or session cookie.What happened: Pinduoduo's web endpoints were decoys. The "API" calls from the browser returned partial data or required tokens generated through JavaScript obfuscation so dense it might as well have been native code. The real API, the one powering search, recommendations, and checkout wasn't talking to the browser at all.
It was talking to the mobile app.
The Dead End
After hours and hours of iteration, the pattern was clear: Pinduoduo's web infrastructure is deliberately crippled. It's not that they couldn't expose rich data via web endpoints, it's that they choose not to. The mobile app is the primary interface, and the web version exists only for SEO and casual browsing.
Every conventional method hit the same wall: incomplete data, aggressive blocking, or both. The cost of evasion in proxy bandwidth, compute time for headless browsers, and engineering hours exceeded the value of the partial data returned.
To get complete, structured, real-time product data, I needed to become the mobile app.
Enter the Mobile App: Reverse Engineering the Private API
If the web was a fortress with a broken gate, the mobile app was a vault heavily guarded, but worth the effort. Mobile apps don't run in inspectable browsers. They compile their logic into native code, encrypt traffic by default, and bake device identity into every request.
This is where most scrapers quit. It's also where the real data lives.
Step 1: Traffic Interception
First, I needed to see what the app was actually sending. Standard approach: man-in-the-middle (MITM) proxy.
Tool I used: HTTP Toolkit (With ADB connected to my mobile device)
Methodology:
Root the mobile device. Without rooting the device, it is almost impossible to intercept the API responses of the application.
Install the proxy’s root certificate on an Android device
Connect the device to the HTTP Toolkit via ADB.
Inspect all the requests in the proper network tab.
What happened: The traffic flowed encrypted HTTPS and became readable JSON, making it easier to reverse engineer.
Step 2: Request Anatomy
With traffic flowing through the proxy, I could inspect what the app actually sent. This wasn't a clean REST API with OAuth tokens. It was a defensive architecture designed to verify every aspect of the requester's identity.
Here's a real search request captured from the app:
curl --location 'https://api.pinduoduo.com/search?source=index&pdduid=...' \
--header 'accept-encoding: gzip' \
--header 'accept-language: en-US' \
--header 'accesstoken: ...' \
--header 'al-sa: {...}' \
--header 'anti-token: ...' \
--header 'content-type: application/json;charset=UTF-8' \
--header 'cookie: acid=...; api_uid=...' \
--header 'etag: ...' \
--header 'host: api.pinduoduo.com' \
--header 'lat: ...' \
--header 'multi-set: ...' \
--header 'p-appname: pinduoduo' \
--header 'p-mediainfo: ...' \
--header 'p-proc: main' \
--header 'p-proc-time: ...' \
--header 'pdd-config: ...' \
--header 'referer: Android' \
--header 'user-agent: android Mozilla/5.0 (...) Mobile Safari/... phh_android_version/... phh_android_build/... phh_android_channel/...' \
--header 'x-app-lang: en' \
--header 'x-app-ui: ...' \
--header 'x-b3-ptracer: ...' \
--header 'x-pdd-info: ...' \
--header 'x-pdd-queries: width=...&height=...&dpr=...&net=...&brand=...&model=...&osv=...&appv=...&pl=...' \
--header 'x-yak-llt: ...' \
--data '{
"install_token": "...",
"item_ver": "...",
"list_id": "...",
"track_data": "...",
"source": "index",
"page_sn": "...",
"page_id": "search_result.html",
"referer_params": null,
"dark_mode": "0",
"show_mark_icon": "1",
"flip_gset_num": "...",
"flip": "...",
"back_search": "false",
"page_el_sn": "...",
"search_met": "manual",
"max_offset": "...",
"sort": "default",
"exposure_offset": "...",
"is_sys_minor": "0",
"q": "SEARCH_TERM_HERE",
"size": "20",
"union_pay_installed": "0",
"requery": "0",
"page": "...",
"engine_version": "2.0",
"pre_req": "0",
"is_new_query": "0"
}'
Analyzing the defense layers:
| Header | Purpose | What It Reveals |
accesstoken | Session authentication | Short-lived, rotated in days |
anti-token | Request signature | 400+ character cryptographic proof-of-work changes every time within less than a minute |
al-sa | Ads/tracking state | Encoded behavioural fingerprint |
lat | Location/auth token | Secondary auth bound to device |
x-yak-llt | Timestamp | Millisecond precision, ~5min validity window |
x-pdd-queries | Device specs | Hardware fingerprint (screen, OS, model) |
install_token | Persistent device ID | Survives app reinstalls |
flip | Pagination state | Cryptographically chained page tokens |
Critical observations:
Dual token system:
accesstokenfor session,anti-tokenfor request integrity. One without the other returns 403.Hardware attestation: The
user-agentisn't just a string, it's a structured device confession: Device Name, Android Version, WebView Chrome 94, app version 7.94.0. Mismatch any element and the request fails.Behavioural chaining: The
flipparameter in the body isn't random. It's a cryptographic chain linking search pages. You can't jump to page 5 without having the token from page 4's response.Temporal decay:
x-yak-llt(timestamp) andanti-tokenare time-bombed. Replay a captured request 5 minutes later invalid. Replay it with a fresh timestamp but old signature invalid.Geo-consistency:
x-pdd-infoclaims timezone LOCATION_NAME. Thelatheader and IP geolocation must align, or the request flags for review.
The anti-token breakdown:
This 400+ character monster is the heart of Pinduoduo's defense. Decoding revealed:
Device entropy: Hardware-derived randomness
Behavioral proof: Evidence of human interaction (scroll patterns, touch events)
Request binding: Hash of the specific query parameters (
q=SEARCH_TERM,page=PAGE_NUMBER)Timestamp: Embedded expiry
Signature: HMAC-SHA256 with a rotated key
Changing any query parameter, page number, sort order, or even the size field invalidates the token. The signature is non-deterministic: two identical requests produce different anti-token values due to embedded timestamps and entropy.
Why conventional replay failed:
I tried the naive approach: capture this curl, rotate the page parameter, fire away.
Result: HTTP 429 {"server_time": 1770..., "server_time_ms": 17705..., "error_code": 40002, "empty_reason": 1}
The anti-token was bound to that specific request's fingerprint. Without the signing algorithm, I couldn't generate valid tokens for modified queries. And the algorithm wasn't in JavaScript, it was in native ARM code, obfuscated and anti-tamper protected.
This is why headless browsers and proxy rotation failed. You can't automate what you can't sign, and you can't sign what you can't reverse engineer.
Step 3: The Pivot - Finding the Unlocked Door
After spending days of dead-ends with the search endpoint's anti-token, I faced a choice: continue reverse-engineering a 400-character cryptographic signature (potentially weeks of ARM binary analysis), or find another way in.
I chose the latter.
The hypothesis: Pinduoduo's API surface is vast. Not every endpoint has the same security posture. The search endpoint is high-value, high-traffic, heavily defended. But secondary features: category browsing, recommendations, related products might rely on lighter protections.
I mapped the app's API calls by navigating through different flows:
| Flow | Endpoint | Protection Level |
| Search | /api/oak/search | Maximum (anti-token + behavioral checks) |
| Category Browse | /api/caterham/query/subfenlei_gyl_label | Moderate (static tokens) |
| Product Detail | /api/oak/v14/goods | High (device binding) |
| Recommendations | /api/oak/rec | Variable |
The category browsing endpoint was the weak link. It returned structured product listings nearly identical to search results but with a simpler authentication model.
Here's a captured request:
curl --location 'https://api.pinduoduo.com/api/caterham/query/subfenlei_gyl_label?offset=40&list_id=...&count=20&goods_id=...&opt_id=25877&req_list_action_type=0&page_sn=10028&support_types=0&page_id=catgoods.html&content_goods_num=4&size=20&show_mark_icon=1&opt_type=2&req_action_type=10&engine_version=2.0&page_el_sn=98978&pdduid=...' \
--header 'Accept-Encoding: gzip' \
--header 'AccessToken: ...' \
--header 'Connection: Keep-Alive' \
--header 'Content-Type: application/json;charset=UTF-8' \
--header 'Cookie: acid=...; api_uid=...; api_uid=...' \
--header 'ETag: ...' \
--header 'Host: api.pinduoduo.com' \
--header 'PDD-CONFIG: V4:001.079400' \
--header 'Referer: Android' \
--header 'User-Agent: android Mozilla/5.0 (Linux; Android 11; ... Build/...; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/... Mobile Safari/537.36 phh_android_version/... phh_android_build/... phh_android_channel/... pversion/0' \
--header 'X-PDD-QUERIES: width=720&height=1411&dpr=2.0&net=1&brand=...&model=...&osv=11&appv=...&pl=2' \
--header 'accept-language: en-US' \
--header 'anti-token: ...' \
--header 'lat: ...' \
--header 'multi-set: 1,1,100000176' \
--header 'p-appname: pinduoduo' \
--header 'p-mediainfo: player=2.1.0&rtc=1.0.0' \
--header 'p-proc: main' \
--header 'p-proc-time: 879202' \
--header 'vip: 127.0.0.1' \
--header 'x-app-lang: en' \
--header 'x-app-ui: dm%3D0%26zm%3D0' \
--header 'x-b3-ptracer: ...' \
--header 'x-pdd-info: bold_free%3Dfalse%26bold_product%3D%26front%3D1%26tz%3D...' \
--header 'x-yak-llt: ...'
Key differences from the search endpoint:
GET instead of POST: No request body to sign. The
anti-tokenis still present but appears to be session-scoped, not request-scoped.Simpler parameter set: No
flipchaining, noinstall_token, no behavioural tracking in the payload. Just query parameters.Static
anti-tokentolerance: Testing revealed the sameanti-tokenworked for 50+ sequential requests, provided other session headers (AccessToken,Cookie,lat) stayed consistent.No pagination chaining: The
offsetparameter is a simple integer. Jump to offset 1000 without prior context? It works. Each request required an increment of offset of 20.
Why this endpoint is softer:
Category browsing is a background feature. Users swipe through categories casually, not with the intent precision of search. Pinduoduo's risk model likely weights it lower less bot incentive, less defensive investment.
The anti-token here isn't cryptographically bound to the specific query. It appears to be a session heartbeat proof the app is running, not proof this exact request is legitimate.
Operationalizing the discovery:
With a static anti-token, I could reduce the problem to session management:
Extract once: Capture a valid
anti-tokenfrom a real app sessionMaintain session: Keep
AccessToken,Cookie, andlatfresh via periodic "heartbeat" requestsIterate freely: Vary
offset,opt_id(category ID), andgoods_idwithout regenerating signatures
The trade-off:
This endpoint doesn't support free-text search. You can't query "GPU" or "iPhone 15." You must traverse the category tree:
Root Categories (opt_id: 1-1000)
└─ Electronics (opt_id: 25877)
└─ Computer Components
└─ Graphics Cards ← Products returned here
Volume achieved: 1,000+ products mapped
Why this worked:
I didn't defeat Pinduoduo's security, I routed around it. The search endpoint is a fortress. The category endpoint is a guard checkpoint with a broken fence. Both lead to the same data warehouse.
This is a recurring pattern in mobile API scraping: high-value endpoints are hardened; supporting infrastructure is often neglected. The skill isn't always cryptographic reverse-engineering. Sometimes it's systematic reconnaissance finding which door the defenders forgot to bar.
Step 4: What Failed (And Why)
The category endpoint workaround didn't come easily. Before finding that open window, I burned through several approaches that should have worked on paper. Documenting the failures is as important as the solution, they reveal the shape of Pinduoduo's defenses and the mindset required to bypass them.
Failure 1: Static anti-token Harvesting from Search Endpoint
The attempt: Capture a single anti-token from the search endpoint via MITM proxy, then replay it with different query parameters.
Why it failed: The search anti-token is cryptographically bound to the request payload. Change q=SEARCH_KEYWORD to q=ANOTHER_SEARCH_KEYWORD? Token invalid. Increment page=5 to page=6 without the previous response's flip parameter? Token invalid. Even identical requests with the same parameters but different timestamps failed the token embeds a ~2-minute expiry window.
The deeper problem: The signing algorithm likely includes:
HMAC of sorted query parameters
Session nonce from previous response
Device fingerprint hash
Timestamp with sliding window
Without the native signing code, I couldn't forge valid tokens. And the native code was obfuscated with O-LLVM control flow flattening function names stripped, logic scattered across thousands of basic blocks.
Failure 2: Emulated Device Farms
The attempt: Instead of hooking a real app, use Android emulators (Gennymotion) with modified system images to run the app and intercept traffic at the network layer.
Setup:
20 LDPlayer instances on a headless server
Magisk for root + certificate injection
Frida server for dynamic instrumentation
Automated screenshot OCR to extract data if API scraping failed
Why it failed: Pinduoduo's app detected virtualization through multiple channels:
| Detection Vector | Emulator Artifact | Real Device |
| CPU info | hardware: goldfish | hardware: qcom |
| Build fingerprint | google/sdk_gphone... | Xiaomi/cactus... |
| Sensor availability | Accelerometer missing | Full sensor stack |
/proc filesystem | Exposes hypervisor PID | Clean process tree |
| OpenGL renderer | Android Emulator | Adreno (TM) 610 |
The app didn't crash or show errors. It simply served degraded content limited product listings, no prices, infinite loading spinners. Silent degradation is harder to debug than hard blocks.
Failure 3: Protocol Downgrade to HTTP
The attempt: Force the app to use unencrypted HTTP by DNS hijacking api.pinduoduo.com to a local proxy, hoping the app would fall back from HTTPS.
Why it failed: The app didn't fall back. Certificate pinning meant no TLS handshake = no connection. But more importantly, even if I stripped the pinning, the anti-token and AccessToken headers are generated client-side. Seeing the plaintext request didn't help me forge new ones.
Failure 4: Rate Limit Evasion via Request Shaping
The attempt: Even on the softer category endpoint, aggressive scraping triggered 429 errors. I tried sophisticated evasion:
Jittered delays: Random sleep between 1-5 seconds (Poisson distribution)
User-agent rotation: Spoofing different device models per request
Header reordering: Randomizing header sequence to break fingerprinting
TCP/IP stack tuning: Modifying window sizes, TTL values to mimic different OSes
Why it failed: Pinduoduo's rate limiting isn't naive IP-based counting. It's session reputation scoring:
| Signal | Weight | My Violation |
| Request velocity | High | 200 req/min vs. human ~10/min |
| Temporal pattern | Medium | Machine-precision intervals |
| Device consistency | Critical | Rotating UA while keeping lat static |
| Session age | High | Fresh tokens with old install_token |
| Behavioral depth | Medium | No "browsing" before "buying" actions |
The lat header (location/auth token) is device-bound. Rotate your User-Agent but keep the same lat? Scored as suspicious. The system correlates across dimensions I wasn't controlling.
The fix that worked: Embrace consistency, not evasion. One device profile, one IP, human-paced requests, gradual session aging. Counter-intuitively, being more predictable made me less detectable.
Operational Architecture
With the category endpoint identified and the anti-token behaving as a session-scoped credential rather than a per-request signature, I needed infrastructure that could maintain session consistency, handle failures gracefully, persist data reliably, and distribute load without breaking the delicate trust relationship established with Pinduoduo's API.
Core Design Principles
| Principle | Implementation |
| Session Coherence | Static header/cookie bundle treated as immutable within a scrape run |
| Defensive Extraction | Multiple fallback fields for every data point (API response shapes vary) |
| Immediate Persistence | Write data before processing completes never hold in memory |
| Full Auditability | Every request/response logged for post-hoc debugging |
| Geo-Locked Distribution | Proxy rotation constrained to single metropolitan region |
System Components
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ Session Store │────▶│ Scrapy Engine │────▶│ Pinduoduo API │
│ (Static bundle)│ │ (Async requests)│ │ (Category endpoint)│
└─────────────────┘ └──────────────────┘ └─────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Checkpoint │ │ Proxy Pool │
│ (Resume offset)│ │ (Geo-locked IPs)│
└─────────────────┘ └──────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────────┐
│ CSV Output │ │ JSON Audit Log │
│ (Products) │ │ (Requests/Responses)│
└─────────────────┘ └──────────────────────┘
Session Management Strategy
The architecture treats authentication as a discrete resource rather than a continuous process:
Capture Phase: Manual extraction of header/cookie bundle from live mobile app traffic via MITM proxy
Consumption Phase: Static injection into all requests during scrape run
Refresh Phase: Manual re-capture when session expires (6-12 hour window)
This avoids the complexity of native code hooking or cryptographic reverse-engineering at the cost of periodic manual intervention. For the target volume of 1,000+ products, this trade-off was acceptable.
Critical invariants:
anti-token,AccessToken, andlatmust never rotate mid-sessionUser-Agentand device fingerprint headers must remain consistent with thelattoken's embedded device identityCookie jar (
acid,api_uid) must persist across sequential requests
Pagination and Resumption
The API uses offset-based pagination with fixed page size. The architecture supports:
Manual resume: Hardcoded offset injection to restart interrupted runs
Automatic continuation: Heuristic detection of empty result sets to terminate gracefully
Bounded execution: Page limit enforcement to prevent runaway scraping
Checkpoint state (current offset, page number) travels with each request via Scrapy's metadata system, enabling distributed state management without external databases.
Data Flow and Persistence
Dual-write strategy ensures no data loss during failures:
| Stage | Destination | Purpose | Format |
| Extraction | CSV | Immediate human-readable output | Flat, UTF-8 encoded |
| Audit | JSON Lines | Complete request/response reconstruction | Structured, timestamped |
The CSV layer prioritizes write safety over performance file-per-row operations prevent data loss if the spider crashes. The JSON audit log captures full HTTP conversations, including headers, bodies, and proxy assignments, for forensic analysis when Pinduoduo's defenses trigger unexpected responses.
Network Distribution
Proxy rotation follows geo-fidelity constraints:
Pool locked to single metropolitan region matching
lattoken's claimed locationRandom selection per-request to distribute load
No session affinity each request independently routed
This distributes traffic across thousands of residential IPs while maintaining the geographic consistency that Pinduoduo's risk models verify.
Failure Handling Matrix
| Failure Mode | Detection | Response | Recovery |
| HTTP 429 (rate limit) | Status code | Exponential backoff retry | Automatic |
| Token expiry (403) | Status code + body | Log and terminate | Manual token refresh |
| Empty result set | Content heuristic | Pagination stop | N/A (graceful end) |
| JSON parse failure | Exception | Fallback to raw text logging | Continue |
| Write failure | I/O exception | Log error, skip row | Continue |
| Proxy failure | Connection error | Retry with new proxy | Automatic |
All failure paths prioritize continuation over correctness partial data is better than no data when scraping hostile infrastructure.
Execution Parameters
The spider accepts runtime configuration for:
Offset step: Controls pagination granularity (default 20, matching API page size)
Max pages: Hard limit on crawl depth
Initial offset: Resume capability for interrupted runs
These parameters enable idempotent execution re-running with the same configuration produces deterministic output ranges without duplicate data.
Operational Results
| Metric | Value |
| Sustained throughput | ~300 requests/minute |
| Session lifetime | 6-12 hours |
| Products mapped | 1,000+ |
| Data completeness | Price, inventory, sales volume, ad classification |
| Failure rate | <2% (primarily token expiry at end of session) |
The architecture successfully extracted XXX category listings (opt_id=2....7) including real-time pricing, stock availability, and sponsored product injection patterns unavailable through web scraping.
The Real Bottleneck - UserID Throttling
The architecture described so far works for 1,000+ products. It does not work for 100,000+. The constraint isn't technical, it's identity-based.
Pinduoduo's rate limiting operates at the UserID granularity, not just IP or session. Every pdduid parameter in the request URL carries an implicit quota. Once exceeded, responses don't fail with 429, they degrade. Prices disappear. Inventory shows as zero. Product listings truncate mid-page.
The category endpoint, soft on signature validation, is hard on user reputation.
Why I didn't build this: I simply didn’t have resources to make all the accounts manually and then signing up and retreiving UserID all manually.
What I Actually Did
I stayed in the small-scale regime. 1,000+ products extracted via single UID, session refreshed manually twice. The data served its purpose market analysis for a specific XXX category, not comprehensive price intelligence.
The spider architecture supports UID rotation via configuration. The pdduid parameter is just another base parameter. But the operational pipeline to generate, validate, and maintain a pool of legitimate UIDs was never built. It was the boundary where this project stopped being a technical challenge and started being a resource extraction business.
Conclusion: The Limits of Technical Evasion
This project started with a straightforward goal: extract structured product data from Pinduoduo and ended with a tour of modern mobile API defenses, cryptographic obfuscation, and the economics of identity at scale.
The technical victories were real: bypassing certificate pinning, mapping the API surface, finding the softer category endpoint, building resilient scraping infrastructure. But they were bounded victories, contained by a constraint no amount of code could overcome: Pinduoduo owns the identity layer, and identity is the scarcest resource.
What Worked
| Approach | Outcome | Scale |
| MITM proxy + ADB interception | Visibility into encrypted traffic | Single device |
| Category endpoint discovery | Bypass of search signature requirements | 1,000+ products |
| Static session bundle | Stable authentication for 6-12 hours | Single UID |
| Scrapy + proxy rotation | Distributed, observable, resilient extraction | 300 req/min |
| Defensive data extraction | 99%+ field coverage despite API variance | 1,000+ records |
These techniques succeeded because they respected the defender's logic. Pinduoduo's security isn't flawed, it's economically rational. They invest heavily in high-value endpoint protection (search, checkout) and accept residual risk on supporting infrastructure (category browse). My approach found the efficient frontier of that risk calculation.
What Didn't Scale
| Ambition | Barrier | Root Cause |
| 100,000+ products | UserID quotas | Identity as rate-limiting factor, not IP or signature |
| Real-time monitoring | Session expiry | Manual token refresh unsustainable |
| Complete catalog coverage | Category tree depth | Exponential API calls vs. linear UID quotas |
| Long-term automation | Account aging | Reputation systems require genuine user behavior |
The UserID bottleneck isn't a puzzle to solve, it's a business model enforcement mechanism. Pinduoduo gives away data to real users and withholds it from aggregators. Technical evasion doesn't change that economics; it just raises the cost of enforcement.
The Broader Pattern
This case study reflects a shift in platform defense:
Old model: Block bots at the perimeter (IP, User-Agent, CAPTCHA)
New model: Differentiate humans through accumulated reputation (device history, social graph, behavioral depth)
The new model is harder to spoof because it's contextual and temporal. A real user builds reputation over weeks. A scraper must either replicate that investment (expensive) or find endpoints that ignore it (limited).
For data practitioners, this means:
Scraping is increasingly a cost-benefit negotiation, not a technical challenge. The question isn't "can I get this data?" but "is this data worth the operational cost of simulating legitimacy?"
Platform APIs are tiered by trust. Public endpoints are heavily defended. Private endpoints (mobile APIs) are less defended but harder to access. Partner endpoints are accessible but require legal relationships.
Data extraction at scale requires scale infrastructure. Not just proxies and parsers, but identity farms, behavioral simulation, and compliance systems. This is indistinguishable from fraud infrastructure and prosecuted accordingly.
My Takeaway
I built a system that extracted 1,000+ XXX listings from Pinduoduo's mobile API. It worked because I stayed small, moved quietly, and accepted manual maintenance. It would not work for a price comparison site, a market intelligence platform, or any use case requiring comprehensive, real-time data.
The real lesson isn't in the ADB interception or the Scrapy architecture. It's in knowing when to stop recognizing that the next bottleneck isn't technical, and that crossing it changes the nature of the work entirely.
Pinduoduo's data is available. Just not to scrapers. Not at scale. Not without becoming something else entirely.
Source Code
NOTE: All the sensetive informations have been hidden. The code is documented by AI.
"""
Educational Scrapy spider demonstrating mobile API scraping techniques.
This example shows how to interact with a protected mobile API endpoint
that uses session-based authentication rather than per-request signatures.
DISCLAIMER: This code is for educational purposes only. Respect robots.txt,
terms of service, and rate limits. Unauthorized scraping may violate laws.
Usage:
scrapy runspider educational_spider.py -a max_pages=10
Requirements:
- scrapy
- Properly configured proxy rotation (see proxy module)
"""
import json
import urllib.parse
import os
import csv
import datetime
import random
import scrapy
# Import proxy list from external module (not included)
# In production, this would be a rotating proxy service
from educational_example.proxy import proxies
class MobileApiSpider(scrapy.Spider):
"""
Spider demonstrating session-based API scraping patterns.
Key concepts demonstrated:
- Static session bundle maintenance
- Offset-based pagination
- Defensive data extraction with fallbacks
- Dual persistence (CSV + JSON audit log)
- Request/response logging for debugging
"""
name = "mobile_api_example"
allowed_domains = ["api.example-ecommerce.com"]
# Conservative delay to respect rate limits
custom_settings = {
"DOWNLOAD_DELAY": 0.5, # 500ms between requests
"RETRY_TIMES": 3,
"RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
}
def __init__(self, offset_step=20, max_pages=50, start_offset=0, *args, **kwargs):
"""
Initialize spider with configurable pagination.
Args:
offset_step: Number of items per page (API-specific)
max_pages: Hard limit on pagination depth
start_offset: Resume capability for interrupted runs
"""
super().__init__(*args, **kwargs)
self.offset_step = int(offset_step)
self.max_pages = int(max_pages)
self.start_offset = int(start_offset)
# =========================================================================
# SESSION BUNDLE - Extracted from legitimate mobile app traffic
# =========================================================================
# These headers represent a captured session from a real device.
# In educational context: Shows how mobile APIs authenticate requests.
#
# SECURITY NOTE: In real implementation, these would be:
# - Loaded from environment variables or secure vault
# - Rotated when session expires (typically 6-12 hours)
# - Never hardcoded in source control
# =========================================================================
self.headers = {
# Compression
"Accept-Encoding": "gzip",
# Session authentication (short-lived, ~30 min expiry)
"AccessToken": "...",
# Connection management
"Connection": "Keep-Alive",
# Content negotiation
"Content-Type": "application/json;charset=UTF-8",
# Cache validation
"ETag": "...",
# Target host
"Host": "api.example-ecommerce.com",
# App configuration version
"PDD-CONFIG": "V4:001.079400",
# Platform identifier
"Referer": "Android",
# Device fingerprint (critical for session consistency)
# Format: Platform + OS + Device Model + WebView Version + App Version
"User-Agent": "android Mozilla/5.0 (Linux; Android ...; ... Build/...; wv) "
"AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 "
"Chrome/... Mobile Safari/537.36 "
"app_version/... app_build/... app_channel/... pversion/0",
# Hardware specifications (must match User-Agent device claims)
"X-PDD-QUERIES": "width=...&height=...&dpr=...&net=...&brand=...&"
"model=...&osv=...&appv=...&pl=...",
# Language preference
"accept-language": "en-US",
# Request signature (session-scoped, not request-scoped in this endpoint)
# NOTE: Other endpoints may cryptographically bind this to request parameters
"anti-token": "...",
# Location/secondary auth token (geo-locked, device-bound)
"lat": "...",
# Feature flags
"multi-set": "...",
# App identification
"p-appname": "...",
"p-mediainfo": "player=...&rtc=...",
"p-proc": "main",
"p-proc-time": "...",
# Debug/development flag
"vip": "127.0.0.1",
# The following headers are commented out to demonstrate optional fields
# that may be required for other endpoints or higher security contexts:
# "x-app-lang": "...",
# "x-app-ui": "...",
# "x-b3-ptracer": "...", # Distributed tracing ID
# "x-pdd-info": "...", # Timezone and feature flags
# "x-yak-llt": "...", # Millisecond timestamp
}
# Session cookies (must remain consistent with headers)
self.cookies = {
"acid": "...", # Anonymous device identifier
"api_uid": "...", # User session identifier
}
# =========================================================================
# API ENDPOINT CONFIGURATION
# =========================================================================
# Base URL for category browsing endpoint
# NOTE: This is a secondary endpoint with lighter protections than search
self.base_url = "https://api.example-ecommerce.com/api/category/browse"
# Fixed query parameters (endpoint-specific)
# In production, these would be parameterized for different categories
self.base_params = {
"list_id": "...", # Category list identifier
"count": "20", # Items per page
"goods_id": "...", # Anchor product ID
"opt_id": "...", # Category ID
"req_list_action_type": "0",
"page_sn": "...", # Page screen identifier
"support_types": "0",
"page_id": "category.html", # Page template
"content_goods_num": "4",
"size": "20",
"show_mark_icon": "1", # UI flag
"opt_type": "2", # Category type
"req_action_type": "10",
"engine_version": "2.0", # Search algorithm version
"page_el_sn": "...", # Element identifier
"pdduid": "...", # User ID (CRITICAL: rate limiting key)
}
def start_requests(self):
"""
Initialize pagination from configured start offset.
Yields:
scrapy.Request with metadata for state tracking
"""
params = dict(self.base_params)
params["offset"] = str(self.start_offset)
# Construct full URL with query parameters
url = f"{self.base_url}?{urllib.parse.urlencode(params)}"
# Select random proxy from pool (geo-locked to session region)
proxy = random.choice(proxies) if proxies else None
# Build request with full audit trail metadata
request = scrapy.Request(
url,
headers=self.headers,
cookies=self.cookies,
callback=self.parse,
meta={
"offset": self.start_offset,
"page": 1,
"proxy": proxy,
}
)
# Log initial request for debugging
self._log_request(request)
yield request
def parse(self, response):
"""
Parse API response and handle pagination.
Demonstrates:
- Defensive JSON parsing (multiple fallback strategies)
- Field extraction with multiple fallback sources
- Dual persistence (CSV + audit log)
- Pagination continuation logic
Args:
response: scrapy.Response object
Yields:
dict: Extracted product data
scrapy.Request: Next page if pagination continues
"""
# Extract state from request metadata
current_offset = response.meta.get("offset", 0)
current_page = response.meta.get("page", 1)
# =========================================================================
# DEFENSIVE PARSING
# =========================================================================
# Mobile APIs often return inconsistent content types or malformed JSON
# Strategy: Try native json(), fallback to json.loads(), fallback to raw text
try:
data = response.json()
except json.JSONDecodeError:
try:
data = json.loads(response.text)
except json.JSONDecodeError:
# Log raw response for post-hoc analysis
self.logger.error(f"JSON parse failed for offset {current_offset}")
data = {"parse_error": True, "raw": response.text}
# Log complete request/response cycle for audit trail
self._log_full_cycle(response.request, response)
# =========================================================================
# DATA EXTRACTION
# =========================================================================
# API response structure varies by product type, category, and A/B tests
# Strategy: Try multiple field paths, use first available
products = data.get("goods_list", []) if isinstance(data, dict) else []
for item in products:
# Build record with extensive fallback chains for each field
record = {
# Product identification (multiple possible field names)
"site_product_id": (
item.get("goods_id")
or item.get("id")
or "unknown"
),
# Product naming (full vs. shortened)
"product_name": item.get("goods_name"),
"short_name": item.get("short_name"),
# URL construction (may need domain prepending)
"product_url": item.get("link_url"),
# Pricing (highly variable structure across product types)
# Priority: displayed price > sale price > base price > 0
"price": (
item.get("group", {}).get("price_str")
or item.get("min_on_sale_group_price")
or item.get("group", {}).get("promo_price")
or item.get("group", {}).get("price")
or item.get("price")
or 0
),
# Imagery (multiple resolution options)
"image_url": (
item.get("hd_thumb_url")
or item.get("hd_url")
or item.get("image_url")
or item.get("thumb_url")
),
# Sales metrics (different naming conventions)
"sales": item.get("sales") or item.get("cnt"),
# Social proof
"customer_num": item.get("customer_num"),
# Inventory state
"inventory_quantity": item.get("quantity"),
"is_available": (
item.get("quantity") is not None
and item.get("quantity") > 0
),
"is_sold_out": item.get("quantity") == 0,
# Advertising flag
"is_ad": bool(item.get("ad")),
# Quality/relevance score
"quality_score": item.get("quality"),
# Pagination metadata for traceability
"page": current_page,
"offset": current_offset,
"scraped_at": datetime.datetime.utcnow().isoformat() + "Z",
}
# Persist to CSV (immediate, safe for crashes)
self._persist_record(record)
# Yield for Scrapy pipelines/middlewares
yield record
# =========================================================================
# PAGINATION LOGIC
# =========================================================================
# Continue if:
# 1. Under max_pages limit, AND
# 2. Current page contained data (heuristic: non-empty response)
should_continue = False
if current_page < self.max_pages:
if isinstance(data, dict):
# Heuristic: Any non-empty value indicates valid response
should_continue = any(v for v in data.values() if v not in (None, [], {}))
elif isinstance(data, list) and len(data) > 0:
should_continue = True
if should_continue:
next_offset = current_offset + self.offset_step
params = dict(self.base_params)
params["offset"] = str(next_offset)
next_url = f"{self.base_url}?{urllib.parse.urlencode(params)}"
next_proxy = random.choice(proxies) if proxies else None
yield scrapy.Request(
next_url,
headers=self.headers,
cookies=self.cookies,
callback=self.parse,
meta={
"offset": next_offset,
"page": current_page + 1,
"proxy": next_proxy,
}
)
# =========================================================================
# PERSISTENCE LAYER
# =========================================================================
def _persist_record(self, record):
"""
Append record to CSV with safe concurrency handling.
Strategy: Open/close per write (inefficient but safe for sporadic concurrency)
Production alternative: Scrapy Item Pipelines with batching
Args:
record: dict of extracted fields
"""
fieldnames = [
"site_product_id", "product_name", "short_name",
"product_url", "price", "image_url", "sales",
"customer_num", "inventory_quantity", "is_available",
"is_sold_out", "is_ad", "quality_score",
"page", "offset", "scraped_at",
]
output_file = "output_products.csv"
file_exists = os.path.exists(output_file)
try:
with open(output_file, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists:
writer.writeheader()
writer.writerow({k: record.get(k, "") for k in fieldnames})
except IOError as e:
# Log error but don't kill spider for write failure
self.logger.error(f"CSV write failed: {e}")
# =========================================================================
# AUDIT LOGGING
# =========================================================================
def _log_request(self, request):
"""Log outgoing request for debugging."""
self._write_audit_log({
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"type": "request",
"url": request.url,
"method": request.method,
"proxy": request.meta.get("proxy"),
})
def _log_full_cycle(self, request, response):
"""
Log complete request/response cycle.
Critical for debugging when API returns unexpected responses
or when session expires mid-crawl.
"""
# Decode headers safely (Scrapy uses bytes, handle both)
def safe_decode(value):
if isinstance(value, (bytes, bytearray)):
return value.decode("utf-8", errors="ignore")
return str(value)
def headers_to_dict(headers):
result = {}
for name in headers.keys():
name_str = safe_decode(name)
values = [safe_decode(v) for v in headers.getlist(name)]
result[name_str] = values[0] if len(values) == 1 else values
return result
entry = {
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"type": "full_cycle",
"request": {
"url": request.url,
"method": request.method,
"headers": headers_to_dict(request.headers),
"proxy": request.meta.get("proxy"),
},
"response": {
"status": response.status,
"headers": headers_to_dict(response.headers),
"body_preview": response.text[:1000] if response.text else None,
}
}
self._write_audit_log(entry)
def _write_audit_log(self, entry):
"""Append entry to JSON Lines audit log."""
log_file = "request_audit.log"
try:
with open(log_file, "a", encoding="utf-8") as f:
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
except IOError as e:
self.logger.error(f"Audit log write failed: {e}")
# =========================================================================
# EDUCATIONAL NOTES
# =========================================================================
"""
ARCHITECTURAL PATTERNS DEMONSTRATED:
1. SESSION COHERENCE
Mobile APIs often validate that headers, cookies, and tokens form a consistent
identity. Rotating one without others triggers security responses.
2. DEFENSIVE EXTRACTION
E-commerce APIs change field names based on product type, A/B tests, and
regional variants. Multiple fallback paths increase robustness.
3. DUAL PERSISTENCE
CSV for immediate human inspection, JSON audit log for debugging. Separation
prevents data loss if parsing fails.
4. PAGINATION STATE IN METADATA
Scrapy's meta dictionary carries state through the request chain, enabling
resume capability and distributed processing.
5. PROXY ROTATION WITH GEO-FIDELITY
Proxies must match the geographic region implied by location tokens in
headers. Mismatches trigger immediate blocking.
LIMITATIONS AND ETHICAL CONSIDERATIONS:
- This code demonstrates techniques for educational purposes
- Rate limiting (DOWNLOAD_DELAY) should be respected
- Session tokens expire and require manual refresh
- UserID quotas limit total extractable volume per identity
- Commercial use requires compliance with platform Terms of Service
- Consider official APIs or data licensing for production applications
SECURITY BEST PRACTICES (if implementing similar systems):
1. Never commit credentials to version control
2. Rotate session tokens automatically or via secure vault
3. Monitor for 403/429 responses as signals of detection
4. Implement exponential backoff for retries
5. Respect robots.txt and crawl-delay directives
6. Consider legal review for jurisdiction-specific regulations (CFAA, GDPR, etc.)
"""



