淘宝/天猫商品评论爬虫 —— 教学设计文档

📚 文档目标

本文档面向爬虫初学者或有一定 Python 基础的开发者，详细讲解如何设计并实现一个真实的淘宝/天猫商品评论爬虫。通过拆解脚本的每一个模块，你将学会：

如何通过浏览器开发者工具（抓包）分析目标接口
如何模拟请求头、Cookie 和签名（sign）绕过反爬
如何处理 JSONP 响应格式
如何解析嵌套 JSON 并提取有用字段
如何将数据保存为 CSV 文件
如何通过命令行参数控制爬虫行为（平台、页数、评价类型）

完整脚本源码见本文底部，本文以逐模块讲解的方式呈现。

一、整体设计思路

1.1 需求分析

我们需要实现一个命令行爬虫，能够：

输入一个淘宝或天猫商品的 ID
自动选择平台（淘宝 / 天猫）
爬取商品评价，支持分页
可筛选评价类型（全部、好评、差评、带图）
提取以下字段：用户昵称、评价内容、评分、购买规格、评价时间、图片链接、卖家回复
输出为 CSV 文件，方便 Excel 打开

1.2 设计原则

模块化：Cookie 加载、签名生成、请求、解析、存储各自独立。
容错性：字段缺失时使用默认值，请求失败时给出提示并停止。
用户友好：提供清晰的命令行参数和进度日志。

1.3 模块划分


┌─────────────────┐
│   参数解析模块   │  (argparse)
└────────┬────────┘
         ▼
┌─────────────────┐
│  Cookie 加载模块 │  (从文件读取 Header String)
└────────┬────────┘
         ▼
┌─────────────────┐
│   签名生成模块   │  (MD5)
└────────┬────────┘
         ▼
┌─────────────────┐
│  请求与解析模块  │  (requests + 正则解析 JSONP)
└────────┬────────┘
         ▼
┌─────────────────┐
│  数据提取模块    │  (从 rateList 中提取字段)
└────────┬────────┘
         ▼
┌─────────────────┐
│   CSV 存储模块   │
└─────────────────┘

二、核心模块详解

2.1 参数解析与用户接口

使用 argparse 提供灵活的命令行参数：

python
parser = argparse.ArgumentParser(description="淘宝/天猫商品评论爬虫")
parser.add_argument("--item_id", "-i", required=True, help="商品ID")
parser.add_argument("--pages", "-p", type=int, default=3, help="爬取页数")
parser.add_argument("--rate_type", "-t", default="-8", help="-8=全部,1=好评,-1=差评")
parser.add_argument("--platform", "-P", default="tmall", choices=["taobao","tmall"], help="平台")
parser.add_argument("--cookie", "-c", default="cookies.txt", help="Cookie文件路径")
parser.add_argument("--output", "-o", default="comments.csv", help="输出CSV文件")
parser.add_argument("--interval", "-it", type=int, default=2, help="请求间隔(秒)")

教学要点：

required=True 表示必须提供该参数。
choices 限制平台只能为 taobao 或 tmall。
默认值让常用参数可省略。

淘宝/天猫的评论接口需要登录后的 Cookie。我们从 cookies.txt 读取一行字符串，格式如 name1=value1; name2=value2（直接从浏览器复制）。

python
def parse_cookie_string(cookie_str: str) -> dict:
    cookie_dict = {}
    for item in cookie_str.split(';'):
        item = item.strip()
        if '=' not in item:
            continue
        k, v = item.split('=', 1)
        cookie_dict[k] = v
    return cookie_dict

def load_cookies(file_path: str) -> requests.Session:
    session = requests.Session()
    with open(file_path, 'r', encoding='utf-8') as f:
        cookie_str = f.read().strip()
    if not cookie_str:
        raise ValueError("Cookie 文件为空")
    cookie_dict = parse_cookie_string(cookie_str)
    session.cookies.update(cookie_dict)
    print(f"加载了 {len(cookie_dict)} 个 Cookie")
    return session

教学要点：

不使用 MozillaCookieJar，直接解析分号分隔的字符串，更简单。
将 Cookie 更新到 requests.Session 中，后续请求自动携带。

2.3 提取签名所需 Token（`_m_h5_tk`）

淘宝接口的签名需要 Cookie 中的 _m_h5_tk 字段，格式为 token_timestamp，我们只需下划线前的部分。

python
def get_token(session: requests.Session) -> str:
    for cookie in session.cookies:
        if cookie.name == '_m_h5_tk':
            m = re.match(r'([a-f0-9]+)_', cookie.value)
            if m:
                return m.group(1)
    raise ValueError("未找到 _m_h5_tk，Cookie 可能无效")

教学要点：

遍历 Session 中的 Cookie 对象，比手动解析文件更可靠。
使用正则提取 32 位十六进制 token。

2.4 签名生成（MD5）

淘宝 H5 接口的签名算法（从页面 JS 逆向得出）：


sign = md5( token + "&" + timestamp + "&" + appKey + "&" + JSON.stringify(data) )

其中 data 是请求体对象，JSON 序列化时不能有空格（separators=(',', ':')）。

python
def build_sign(token: str, timestamp: str, app_key: str, data: dict) -> tuple:
    data_str = json.dumps(data, separators=(',', ':'))
    sign_str = f"{token}&{timestamp}&{app_key}&{data_str}"
    sign = hashlib.md5(sign_str.encode()).hexdigest()
    return sign, data_str

教学要点：

json.dumps 的 separators 参数去除多余空格，保证签名与服务器端一致。
时间戳使用毫秒级（int(time.time() * 1000)）。
appKey 固定为 "12574478"。

2.5 请求参数构造与发送

评论接口的 URL 和参数因平台而异（淘宝/天猫），我们提前配置好：

python
PLATFORM_CONFIG = {
    "taobao": {
        "url": "https://h5api.m.taobao.com/h5/mtop.taobao.rate.detaillist.get/6.0/",
        "referer": "https://item.taobao.com/",
        "api": "mtop.taobao.rate.detaillist.get",
        "v": "6.0"
    },
    "tmall": {
        "url": "https://h5api.m.tmall.com/h5/mtop.taobao.rate.detaillist.get/6.0/",
        "referer": "https://detail.tmall.com/",
        "api": "mtop.taobao.rate.detaillist.get",
        "v": "6.0"
    }
}

构造请求体 data（即接口的 data 参数）：

python
data = {
    "showTrueCount": False,
    "auctionNumId": item_id,
    "pageNo": page,
    "pageSize": page_size,
    "orderType": "",
    "searchImpr": "-8",
    "expression": "",
    "skuVids": "",
    "rateSrc": "pc_rate_list",
    "rateType": rate_type,      # "-8"=全部, "1"=好评, "-1"=差评
    "foldFlag": "0"
}

然后生成签名，拼接到 URL 查询参数中：

python
sign, data_str = build_sign(token, timestamp, app_key, data)
params = {
    'jsv': '2.7.5', 'appKey': app_key, 't': timestamp, 'sign': sign,
    '_bx-login': 'new', 'api': config['api'], 'v': config['v'],
    'isSec': '0', 'ecode': '1', 'timeout': '20000', 'dataType': 'jsonp',
    'valueType': 'string', 'type': 'jsonp', 'callback': 'mtopjsonp', 'data': data_str
}

发送 GET 请求，注意设置 Referer 和 Origin：

python
headers = HEADERS.copy()
headers['Referer'] = f"{config['referer']}item.htm?id={item_id}"
headers['Origin'] = config['referer'].rstrip('/')
resp = session.get(url, params=params, headers=headers, timeout=10)

教学要点：

rateType 参数是关键：-8 表示全部评价，1 表示好评，-1 表示差评。
必须携带完整的浏览器请求头（Sec-Ch-Ua、Sec-Fetch-* 等），否则可能触发风控。
请求间隔使用 random.uniform(0.5, 1.5) 模拟人类行为。

2.6 解析 JSONP 响应

淘宝接口返回的是 JSONP 格式，如 mtopjsonp13({...})。我们需要用正则提取大括号内的 JSON 字符串。

python
match = re.search(r'mtopjsonp\d+\s*\(\s*(\{.*\})\s*\)', text, re.DOTALL)
if not match:
    match = re.search(r'(\{.*\})', text, re.DOTALL)
if not match:
    return []
resp_data = json.loads(match.group(1))

教学要点：

使用 re.DOTALL 使 . 匹配换行符。
兼容无回调函数名的纯 JSON（某些情况）。
检查返回状态 ret[0] == 'SUCCESS::调用成功'。

2.7 提取评论列表（`rateList`）

成功响应后，评论数据位于 data['data']['rateList'] 中（注意不是 rates）。

python
rate_list = resp_data.get('data', {}).get('rateList', [])

教学要点：

通过抓包确认字段名，不要凭猜测。
使用 .get() 避免 KeyError。

2.8 解析单条评论字段

根据真实 JSON 结构，提取关键字段：

python
def parse_comment(rate: dict) -> dict:
    user_nick = rate.get('userNick', '匿名')
    content = rate.get('feedback', '')
    rate_type = rate.get('rateType', '')
    # 评分映射：好评=5分，差评=1分
    if rate_type == '1':
        score = 5
    elif rate_type == '-1':
        score = 1
    else:
        score = 0
    sku_info = rate.get('skuValueStr', '')
    rate_date = rate.get('feedbackDate', '') or rate.get('createTime', '')
    pic_list = rate.get('feedPicPathList', [])
    has_pic = len(pic_list) > 0
    pic_urls = json.dumps(pic_list, ensure_ascii=False)
    seller_reply = rate.get('reply', '')
    return {
        'nick': user_nick,
        'content': content,
        'score': score,
        'sku_info': sku_info,
        'rate_date': rate_date,
        'has_pic': has_pic,
        'pic_urls': pic_urls,
        'seller_reply': seller_reply,
        'rate_type': rate_type
    }

教学要点：

用户昵称可能被脱敏（如“小**憨”），这是正常现象。
图片 URL 以 // 开头，可直接使用（浏览器自动补全协议）。
追加评论（追评）字段在 JSON 中为 append，本例未提取，可作为扩展练习。

2.9 保存到 CSV

使用 csv.DictWriter 写入 UTF-8 with BOM 编码，确保 Excel 不乱码。

python
def save_csv(comments: list, filename: str):
    if not comments:
        print("无数据，不保存")
        return
    fields = ['nick', 'content', 'score', 'sku_info', 'rate_date',
              'has_pic', 'pic_urls', 'seller_reply', 'rate_type']
    with open(filename, 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(comments)
    print(f"保存 {len(comments)} 条评论到 {filename}")

教学要点：

newline='' 避免 Windows 下出现多余空行。
utf-8-sig 编码使 Excel 打开时自动识别中文。

2.10 主流程控制

python
def main():
    args = parser.parse_args()
    session = load_cookies(args.cookie)
    token = get_token(session)
    all_comments = []
    for page in range(1, args.pages + 1):
        print(f"爬取第 {page} 页，平台：{args.platform}")
        rate_list = fetch_comments_page(...)
        if not rate_list:
            break
        for rate in rate_list:
            all_comments.append(parse_comment(rate))
        print(f"第 {page} 页获取 {len(rate_list)} 条评论")
        time.sleep(args.interval)
    save_csv(all_comments, args.output)

教学要点：

分页爬取，直到无数据或达到指定页数。
每次请求后休眠指定秒数，避免请求过快被封。

三、异常处理与健壮性设计

可能问题	处理方式
Cookie 文件不存在或格式错误	抛出异常，提示用户检查
`_m_h5_tk` 缺失	提示 Cookie 无效，退出
网络请求超时或状态码异常	捕获异常，打印错误，返回空列表
JSONP 解析失败	打印响应前 200 字符，返回空列表
接口返回错误（如风控）	打印 `ret` 信息，停止翻页
某个字段缺失	使用 `.get()` 提供默认值，不会崩溃

四、扩展建议

基于本脚本，可以进一步扩展：

多线程爬取详情：获取 rateList 后，多线程请求详情接口（注意控制频率）。
增量爬取：记录已爬取的评论 ID，避免重复。
数据库存储：将 CSV 替换为 SQLite 或 MySQL，支持更复杂的查询。
Web 可视化：用 Flask + ECharts 展示评价情感分析结果。

五、总结

本脚本完整演示了淘宝/天猫 H5 评论接口的调用流程，涵盖：

命令行参数设计
Cookie 加载与 Token 提取
签名生成（MD5）
请求构造与 JSONP 解析
数据提取与 CSV 存储

通过学习本案例，你将能够独立开发类似的电商爬虫，并理解如何应对常见的反爬机制（签名、Cookie、请求头模拟）。

下一步练习：

修改脚本，增加“只爬取带图评价”功能（rate_type=7）。
增加“追加评论（追评）”字段的提取。
将数据存入 SQLite 数据库，并实现“去重”逻辑。

六、完整源码

python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
淘宝/天猫商品评论爬虫
用法: python comment_crawler.py --item_id 商品ID --platform taobao --pages 3
"""

import csv
import hashlib
import json
import re
import time
import random
import argparse
import requests

# ======================== 配置 ========================
DEFAULT_COOKIE_FILE = "cookies.txt"
DEFAULT_MAX_PAGE = 3
DEFAULT_PAGE_SIZE = 20
DEFAULT_REQUEST_INTERVAL = 2
DEFAULT_OUTPUT_CSV = "comments.csv"

PLATFORM_CONFIG = {
    "taobao": {
        "url": "https://h5api.m.taobao.com/h5/mtop.taobao.rate.detaillist.get/6.0/",
        "referer": "https://item.taobao.com/",
        "api": "mtop.taobao.rate.detaillist.get",
        "v": "6.0"
    },
    "tmall": {
        "url": "https://h5api.m.tmall.com/h5/mtop.taobao.rate.detaillist.get/6.0/",
        "referer": "https://detail.tmall.com/",
        "api": "mtop.taobao.rate.detaillist.get",
        "v": "6.0"
    }
}

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
    "Sec-Ch-Ua": '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Sec-Fetch-Dest": "script",
    "Sec-Fetch-Mode": "no-cors",
    "Sec-Fetch-Site": "same-site",
}

def parse_cookie_string(cookie_str: str) -> dict:
    cookie_dict = {}
    for item in cookie_str.split(';'):
        item = item.strip()
        if '=' not in item:
            continue
        k, v = item.split('=', 1)
        cookie_dict[k] = v
    return cookie_dict

def load_cookies(file_path: str) -> requests.Session:
    session = requests.Session()
    with open(file_path, 'r', encoding='utf-8') as f:
        cookie_str = f.read().strip()
    if not cookie_str:
        raise ValueError("Cookie 文件为空")
    cookie_dict = parse_cookie_string(cookie_str)
    session.cookies.update(cookie_dict)
    print(f"加载了 {len(cookie_dict)} 个 Cookie")
    return session

def get_token(session: requests.Session) -> str:
    for cookie in session.cookies:
        if cookie.name == '_m_h5_tk':
            m = re.match(r'([a-f0-9]+)_', cookie.value)
            if m:
                return m.group(1)
    raise ValueError("未找到 _m_h5_tk，Cookie 可能无效")

def build_sign(token: str, timestamp: str, app_key: str, data: dict) -> tuple:
    data_str = json.dumps(data, separators=(',', ':'))
    sign_str = f"{token}&{timestamp}&{app_key}&{data_str}"
    sign = hashlib.md5(sign_str.encode()).hexdigest()
    return sign, data_str

def fetch_comments_page(session, item_id, page, token, platform, page_size=20, rate_type="-8"):
    config = PLATFORM_CONFIG[platform]
    timestamp = str(int(time.time() * 1000))
    app_key = "12574478"
    data = {
        "showTrueCount": False,
        "auctionNumId": item_id,
        "pageNo": page,
        "pageSize": page_size,
        "orderType": "",
        "searchImpr": "-8",
        "expression": "",
        "skuVids": "",
        "rateSrc": "pc_rate_list",
        "rateType": rate_type,
        "foldFlag": "0"
    }
    sign, data_str = build_sign(token, timestamp, app_key, data)
    params = {
        'jsv': '2.7.5', 'appKey': app_key, 't': timestamp, 'sign': sign,
        '_bx-login': 'new', 'api': config['api'], 'v': config['v'],
        'isSec': '0', 'ecode': '1', 'timeout': '20000', 'dataType': 'jsonp',
        'valueType': 'string', 'type': 'jsonp', 'callback': 'mtopjsonp', 'data': data_str
    }
    url = config['url']
    headers = HEADERS.copy()
    headers['Referer'] = f"{config['referer']}item.htm?id={item_id}"
    headers['Origin'] = config['referer'].rstrip('/')
    time.sleep(random.uniform(0.5, 1.5))
    resp = session.get(url, params=params, headers=headers, timeout=10)
    resp.raise_for_status()
    text = resp.text
    match = re.search(r'mtopjsonp\d+\s*\(\s*(\{.*\})\s*\)', text, re.DOTALL)
    if not match:
        match = re.search(r'(\{.*\})', text, re.DOTALL)
    if not match:
        print(f"第 {page} 页响应格式异常")
        return []
    try:
        resp_data = json.loads(match.group(1))
    except json.JSONDecodeError:
        print(f"JSON解析失败: {text[:200]}")
        return []
    ret = resp_data.get('ret', [])
    if not ret or ret[0] != 'SUCCESS::调用成功':
        print(f"接口返回错误: {ret}")
        return []
    rate_list = resp_data.get('data', {}).get('rateList', [])
    if not rate_list:
        print(f"第 {page} 页没有 rateList")
    return rate_list

def parse_comment(rate: dict) -> dict:
    user_nick = rate.get('userNick', '匿名')
    content = rate.get('feedback', '')
    rate_type = rate.get('rateType', '')
    if rate_type == '1':
        score = 5
    elif rate_type == '-1':
        score = 1
    else:
        score = 0
    sku_info = rate.get('skuValueStr', '')
    rate_date = rate.get('feedbackDate', '') or rate.get('createTime', '')
    pic_list = rate.get('feedPicPathList', [])
    has_pic = len(pic_list) > 0
    pic_urls = json.dumps(pic_list, ensure_ascii=False)
    seller_reply = rate.get('reply', '')
    return {
        'nick': user_nick,
        'content': content,
        'score': score,
        'sku_info': sku_info,
        'rate_date': rate_date,
        'has_pic': has_pic,
        'pic_urls': pic_urls,
        'seller_reply': seller_reply,
        'rate_type': rate_type
    }

def save_csv(comments: list, filename: str):
    if not comments:
        print("无数据，不保存")
        return
    fields = ['nick', 'content', 'score', 'sku_info', 'rate_date',
              'has_pic', 'pic_urls', 'seller_reply', 'rate_type']
    with open(filename, 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(comments)
    print(f"保存 {len(comments)} 条评论到 {filename}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--item_id", "-i", required=True)
    parser.add_argument("--pages", "-p", type=int, default=DEFAULT_MAX_PAGE)
    parser.add_argument("--rate_type", "-t", default="-8", help="-8=全部,1=好评,-1=差评")
    parser.add_argument("--platform", "-P", default="tmall", choices=["taobao","tmall"])
    parser.add_argument("--cookie", "-c", default=DEFAULT_COOKIE_FILE)
    parser.add_argument("--output", "-o", default=DEFAULT_OUTPUT_CSV)
    parser.add_argument("--interval", "-it", type=int, default=DEFAULT_REQUEST_INTERVAL)
    args = parser.parse_args()

    session = load_cookies(args.cookie)
    token = get_token(session)
    print(f"Token: {token}")

    all_comments = []
    for page in range(1, args.pages + 1):
        print(f"爬取第 {page} 页，平台：{args.platform}")
        rate_list = fetch_comments_page(
            session, args.item_id, page, token, args.platform,
            page_size=DEFAULT_PAGE_SIZE, rate_type=args.rate_type
        )
        if not rate_list:
            print("无更多数据，停止")
            break
        for rate in rate_list:
            all_comments.append(parse_comment(rate))
        print(f"第 {page} 页获取 {len(rate_list)} 条评论")
        time.sleep(args.interval)

    save_csv(all_comments, args.output)

if __name__ == "__main__":
    main()

七、运行示例

7.1 准备工作

将浏览器复制的 Cookie 字符串保存为 cookies.txt（一行）。
安装依赖：pip install requests

7.2 运行命令

bash
# 爬取天猫商品 823823340733 的前 3 页全部评价
python comment_crawler.py -i 823823340733 -P tmall -p 3

# 爬取淘宝商品 981568322282 的差评（2 页）
python comment_crawler.py -i 981568322282 -P taobao -t -1 -p 2

# 输出到指定文件
python comment_crawler.py -i 823823340733 -o my_comments.csv

7.3 输出示例（CSV 内容）

nick	content	score	sku_info	rate_date	has_pic	pic_urls	seller_reply	rate_type
小**憨	发货很快，收到很精致...	5	足金小蛮腰和田玉手串【品牌礼盒】	2026年3月27日	True	["//img.alicdn.com/..."]		1
无**阳	和田玉珠子不透...	1	足金小蛮腰和田玉手串【赠中国黄金礼盒】	2025年10月21日	True	["//img.alicdn.com/..."]		-1

通过本教程，你不仅学会了如何爬取淘宝/天猫评论，更重要的是掌握了分析接口、模拟请求、处理反爬的通用方法。祝你在爬虫的道路上越走越远！

目录

淘宝/天猫商品评论爬虫 —— 教学设计文档

📚 文档目标

一、整体设计思路

1.1 需求分析

1.2 设计原则

1.3 模块划分

二、核心模块详解

2.1 参数解析与用户接口

2.2 Cookie 加载（Header String 格式）

2.3 提取签名所需 Token（_m_h5_tk）

2.4 签名生成（MD5）

2.5 请求参数构造与发送

2.6 解析 JSONP 响应

2.7 提取评论列表（rateList）

2.8 解析单条评论字段

2.9 保存到 CSV

2.10 主流程控制

三、异常处理与健壮性设计

四、扩展建议

五、总结

六、完整源码

七、运行示例

7.1 准备工作

7.2 运行命令

7.3 输出示例（CSV 内容）

2.3 提取签名所需 Token（`_m_h5_tk`）

2.7 提取评论列表（`rateList`）