本文档面向编程学习者或爬虫初学者,详细讲解如何设计并编写一个通用的淘宝/天猫商品监控脚本。通过拆解脚本的每一个模块,你将学会:
完整脚本源码见本文底部,本文以逐模块讲解的方式呈现。
我们要实现一个命令行工具,能够:
┌─────────────────┐ │ 参数解析模块 │ (argparse) └────────┬────────┘ ▼ ┌─────────────────┐ │ Cookie 加载模块 │ └────────┬────────┘ ▼ ┌─────────────────┐ │ 网络请求模块 │ (requests) └────────┬────────┘ ▼ ┌─────────────────┐ │ HTML 解析模块 │ (正则 + json) └────────┬────────┘ ▼ ┌─────────────────┐ │ 数据提取模块 │ (safe_get, 价格计算) └────────┬────────┘ ▼ ┌─────────────────┐ │ CSV 存储模块 │ └─────────────────┘
使用 Python 标准库 argparse 提供命令行接口:
pythonparser = argparse.ArgumentParser(description="统一商品监控脚本")
parser.add_argument("--url", "-u", required=True, help="商品链接")
parser.add_argument("--cookie", "-c", default="cookies.txt", help="Cookie文件")
parser.add_argument("--output", "-o", default="item_full_data.csv", help="输出CSV")
parser.add_argument("--interval", "-t", type=int, default=0, help="监控间隔(秒)")
教学要点:
required=True 表示必须提供该参数。type=int 自动转换数据类型。淘宝/天猫的许多数据需要登录后才能获取(如会员价、部分 SKU 信息)。脚本通过 cookies.txt 文件读取 Cookie 字符串。
pythondef load_cookie_from_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
cookie_str = f.read().strip()
cookie_dict = {}
for item in cookie_str.split(';'):
if '=' not in item: continue
name, value = item.strip().split('=', 1)
cookie_dict[name] = value
return cookie_dict
设计亮点:
name1=value1; name2=value2)。requests.get(cookies=cookie_dict)。教学要点:
不同平台(淘宝/天猫)的请求头略有差异(例如天猫需要 referer)。通过解析 URL 的域名自动选择:
pythondef parse_url_and_params(url):
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
query_params = parse_qs(parsed.query)
params = {k: v[0] for k, v in query_params.items()}
if "taobao.com" in parsed.netloc:
platform = "taobao"
elif "tmall.com" in parsed.netloc:
platform = "tmall"
else:
platform = "unknown"
return platform, base_url, params
教学要点:
urllib.parse 模块用于分解 URL。id、spm 等),因为这些参数可能影响页面渲染。HEADERS 字典。使用 requests.get 发送请求,设置超时和异常捕获:
pythondef fetch_page(url, params, headers, cookies):
resp = requests.get(url, headers=headers, cookies=cookies, params=params, timeout=15)
resp.raise_for_status() # 非200状态码抛出异常
return resp
教学要点:
timeout 防止长时间阻塞。raise_for_status() 简化错误处理。淘宝/天猫的商品数据并不是直接写在 HTML 标签中,而是嵌在 <script> 标签内的一个全局变量 window.__ICE_APP_CONTEXT__。我们需要用正则表达式将其提取并解析为 JSON。
pythondef extract_ice_context(html):
patterns = [
r'window\.__ICE_APP_CONTEXT__\s*=\s*(\{[\s\S]*?\});',
r'var\s+b\s*=\s*(\{[\s\S]*?\});' # 备选模式
]
for pattern in patterns:
match = re.search(pattern, html)
if match:
json_str = match.group(1).rstrip(';')
try:
return json.loads(json_str)
except json.JSONDecodeError:
continue
return None
设计思路:
[\s\S]*? 表示匹配任意字符(包括换行),非贪婪模式。教学要点:
safe_get由于 JSON 嵌套层次深且字段可能缺失,编写一个安全取值函数:
pythondef safe_get(data, *keys, default=''):
temp = data
for key in keys:
if isinstance(temp, dict):
temp = temp.get(key)
if temp is None:
return default
else:
return default
return temp if temp is not None else default
使用示例:
pythonshop_name = safe_get(res, 'seller', 'shopName')
title = safe_get(res, 'item', 'title')
教学要点:
*keys 允许传入任意多级键名。try...except 或多次 get() 调用。SKU 信息位于 res['skuCore']['sku2info'] 中,是一个以 SKU ID 为键的对象。每个 SKU 可能包含多个价格字段(券后价、原价、促销价等)。我们需要遍历所有 SKU,提取有效价格,取最小值。
pythondef parse_sku_min_price(sku2info):
real_skus = {k: v for k, v in sku2info.items() if k != '0'} # 排除默认SKU
min_price = None
for sku_data in real_skus.values():
price_value = None
# 优先级1: 券后价
sub_price = sku_data.get('subPrice', {})
if sub_price:
price_text = sub_price.get('priceText', '')
if price_text:
price_value = _extract_price_from_text(price_text)
# 优先级2: 原价
if price_value is None:
price_info = sku_data.get('price', {})
if price_info:
price_text = price_info.get('priceText', '')
if price_text:
price_value = _extract_price_from_text(price_text)
# 优先级3: 直接 price 字段
if price_value is None:
direct = sku_data.get('price')
if direct is not None:
price_value = _extract_price_from_text(str(direct))
# 若得到有效价格,更新最小值
if price_value is not None and price_value > 0:
if min_price is None or price_value < min_price:
min_price = price_value
return min_price if min_price is not None else 0
辅助函数 _extract_price_from_text 使用正则清理价格字符串(如 ¥99.00 → 99.00):
pythondef _extract_price_from_text(price_str):
if not price_str:
return None
cleaned = re.sub(r'[^0-9.]', '', str(price_str))
return float(cleaned) if cleaned else None
设计亮点:
sku_id = '0' 的默认 SKU,因为它通常不代表真实可选规格。当 SKU 信息缺失或解析失败时(例如商品无 SKU,或页面结构变化),我们从页面右上角显示的 priceVO.price.priceText 提取备选价格。
pythonrightBarPriceText = safe_get(res, 'componentsVO', 'priceVO', 'price', 'priceText', default='')
if min_price == 0 and rightBarPriceText:
price_match = re.search(r'(\d+(?:\.\d+)?)', rightBarPriceText)
if price_match:
min_price = float(price_match.group(1))
教学要点:
商品参数位于 componentsVO.extensionInfoVO.infos 中,其中 type 为 'BASE_PROPS' 的项包含参数列表。同样,保障信息有 'GUARANTEE' 和 'GUARANTEE_NEW' 两种类型。
pythondef extract_extension_info(infos):
result = {'guarantee': [], 'guarantee_new': [], 'params': {}}
for item in infos:
if item.get('type') == 'GUARANTEE':
for sub in item.get('items', []):
result['guarantee'].extend(sub.get('text', []))
elif item.get('type') == 'GUARANTEE_NEW':
for sub in item.get('items', []):
result['guarantee_new'].append({
'title': sub.get('title'),
'description': sub.get('text', [''])[0]
})
elif item.get('type') == 'BASE_PROPS':
for sub in item.get('items', []):
name = sub.get('title')
values = sub.get('text', [])
if name:
result['params'][name] = values[0] if len(values) == 1 else values
return result
教学要点:
使用 csv.DictWriter 将记录写入文件,自动处理复杂类型(如图片列表、参数对象)的 JSON 序列化。
pythondef append_full_record_to_csv(record, csv_file):
fieldnames = ['timestamp', 'item_id', 'platform', ...] # 18个字段
file_exists = os.path.isfile(csv_file)
with open(csv_file, 'a', newline='', encoding='utf-8-sig') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists:
writer.writeheader()
# 将列表和字典转为 JSON 字符串
record['images'] = json.dumps(record.get('images', []), ensure_ascii=False)
record['guarantee'] = json.dumps(record.get('guarantee', []), ensure_ascii=False)
record['guarantee_new'] = json.dumps(record.get('guarantee_new', []), ensure_ascii=False)
record['params'] = json.dumps(record.get('params', {}), ensure_ascii=False)
writer.writerow(record)
教学要点:
utf-8-sig 编码使 Excel 能正确识别中文。os.path.isfile 判断是否首次写入,自动添加表头。main 函数中根据 interval 参数决定是单次运行还是循环监控。
pythondef main():
args = parser.parse_args()
cookies = load_cookie_from_file(args.cookie)
while True:
success = monitor_once(args.url, cookies, args.output)
if not success:
print("[ERROR] 监控失败")
if args.interval <= 0:
break
print(f"[INFO] 等待 {args.interval} 秒后继续...")
time.sleep(args.interval)
教学要点:
requests.RequestException,打印错误并返回 False,由上层决定是否重试。__ICE_APP_CONTEXT__ 解析失败,尝试第二种模式,否则返回 None 并退出。safe_get 并提供默认值,不会因 KeyError 崩溃。基于本脚本,可以进一步扩展的功能:
本脚本是一个完整的、可投入使用的爬虫教学案例,涵盖了:
通过学习本脚本,你将掌握如何从零开始设计一个针对动态网页的爬虫,并理解如何编写健壮、可维护的代码。
下一步练习:尝试修改脚本,增加“与上次价格对比并输出变化”的功能,或者将其改造为 Flask API 服务。
附:完整脚本源码(参见用户提供的代码)。建议配合调试工具(如浏览器开发者工具)实际运行,观察每一步的输出,加深理解。
python#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
统一商品监控脚本(支持淘宝 / 天猫)
功能:通过商品链接自动解析请求参数,抓取完整数据(基础信息、SKU最低价、参数、保障)
支持单次运行或循环监控
"""
import re
import json
import os
import time
import csv
import argparse
import requests
from urllib.parse import urlparse, parse_qs
from datetime import datetime
# ======================== 配置 ========================
COOKIE_FILE = "cookies.txt" # Cookie文件路径
MONITOR_INTERVAL = 3600 # 默认监控间隔(秒)
DEFAULT_CSV_FILE = "item_full_data.csv" # 默认输出CSV文件名
# 平台对应的请求头(仅必要头,不含任何商品参数)
PLATFORM_HEADERS = {
"taobao": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "no-cache",
"pragma": "no-cache",
"priority": "u=0, i",
"sec-ch-ua": "\"Google Chrome\";v=\"147\", \"Not.A/Brand\";v=\"8\", \"Chromium\";v=\"147\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36"
},
"tmall": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "no-cache",
"pragma": "no-cache",
"priority": "u=0, i",
"sec-ch-ua": "\"Google Chrome\";v=\"147\", \"Not.A/Brand\";v=\"8\", \"Chromium\";v=\"147\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
"referer": "https://www.tmall.com/",
"origin": "https://detail.tmall.com"
}
}
# ======================== 工具函数 ========================
def load_cookie_from_file(file_path):
"""从文本文件读取Cookie字符串,返回字典"""
try:
with open(file_path, 'r', encoding='utf-8') as f:
cookie_str = f.read().strip()
if not cookie_str:
raise ValueError("Cookie文件为空")
cookie_dict = {}
for item in cookie_str.split(';'):
item = item.strip()
if not item or '=' not in item:
continue
name, value = item.split('=', 1)
cookie_dict[name] = value
print("[INFO] Cookie导入成功")
return cookie_dict
except FileNotFoundError:
print(f"[ERROR] Cookie文件不存在: {file_path}")
raise
except Exception as e:
print(f"[ERROR] 加载Cookie失败: {e}")
raise
def parse_url_and_params(url):
"""
从商品链接解析平台、基础URL、请求参数
返回 (platform, base_url, params_dict)
"""
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
query_params = parse_qs(parsed.query)
params = {k: v[0] for k, v in query_params.items()}
host = parsed.netloc.lower()
if "taobao.com" in host:
platform = "taobao"
elif "tmall.com" in host:
platform = "tmall"
else:
platform = "unknown"
return platform, base_url, params
def fetch_page(url, params, headers, cookies):
"""请求商品页面,返回响应对象"""
try:
resp = requests.get(url, headers=headers, cookies=cookies, params=params, timeout=15)
resp.raise_for_status()
return resp
except requests.RequestException as e:
print(f"[ERROR] 网络请求失败: {e}")
raise
def extract_ice_context(html):
"""从HTML中提取 __ICE_APP_CONTEXT__ JSON对象"""
patterns = [
r'window\.__ICE_APP_CONTEXT__\s*=\s*(\{[\s\S]*?\});',
r'var\s+b\s*=\s*(\{[\s\S]*?\});'
]
for pattern in patterns:
match = re.search(pattern, html)
if match:
json_str = match.group(1).rstrip(';')
try:
return json.loads(json_str)
except json.JSONDecodeError:
continue
print("[ERROR] 未找到 __ICE_APP_CONTEXT__ 或 var b")
return None
def safe_get(data, *keys, default=''):
"""安全获取嵌套字典的值"""
temp = data
for key in keys:
if isinstance(temp, dict):
temp = temp.get(key)
if temp is None:
return default
else:
return default
return temp if temp is not None else default
def parse_sku_min_price(sku2info):
"""
从 sku2info 中提取所有SKU的价格,返回最低价(仅最低价)
"""
real_skus = {k: v for k, v in sku2info.items() if k != '0'}
min_price = None
for sku_id, sku_data in real_skus.items():
price_value = None
# 1. 券后价 subPrice.priceText
sub_price = sku_data.get('subPrice', {})
if sub_price:
price_text = sub_price.get('priceText', '')
if price_text:
price_value = _extract_price_from_text(price_text)
# 2. 原价 price.priceText
if price_value is None:
price_info = sku_data.get('price', {})
if price_info:
price_text = price_info.get('priceText', '')
if price_text:
price_value = _extract_price_from_text(price_text)
# 3. 直接 price 字段
if price_value is None:
direct_price = sku_data.get('price')
if direct_price is not None:
price_value = _extract_price_from_text(str(direct_price))
# 4. amount 字段
if price_value is None:
amount = sku_data.get('amount')
if amount is not None:
price_value = _extract_price_from_text(str(amount))
# 5. promotionPrice 字段
if price_value is None:
promo = sku_data.get('promotionPrice')
if promo is not None:
price_value = _extract_price_from_text(str(promo))
if price_value is not None and price_value > 0:
if min_price is None or price_value < min_price:
min_price = price_value
if min_price is None:
print("[WARN] 未能从任何SKU中提取到有效价格")
min_price = 0
return min_price
def _extract_price_from_text(price_str):
"""从价格字符串中提取浮点数,返回 None 表示失败"""
if not price_str:
return None
cleaned = re.sub(r'[^0-9.]', '', str(price_str))
if cleaned:
try:
return float(cleaned)
except ValueError:
return None
return None
def extract_extension_info(infos):
"""从 componentsVO.extensionInfoVO.infos 中提取保障和参数"""
result = {
'guarantee': [],
'guarantee_new': [],
'params': {}
}
for item in infos:
item_type = item.get('type')
if item_type == 'GUARANTEE':
for sub in item.get('items', []):
texts = sub.get('text', [])
result['guarantee'].extend(texts)
elif item_type == 'GUARANTEE_NEW':
for sub in item.get('items', []):
result['guarantee_new'].append({
'title': sub.get('title'),
'icon': sub.get('icon'),
'description': sub.get('text', [''])[0] if sub.get('text') else ''
})
elif item_type == 'BASE_PROPS':
for sub in item.get('items', []):
param_name = sub.get('title')
param_values = sub.get('text', [])
if param_name:
if len(param_values) == 1:
result['params'][param_name] = param_values[0]
else:
result['params'][param_name] = param_values
return result
def append_full_record_to_csv(record, csv_file):
"""将一条完整记录追加到CSV文件"""
fieldnames = [
'timestamp', 'item_id', 'platform', 'shop_name', 'title', 'spu_id', 'qr_code', 'images',
'min_price', 'max_price', 'avg_price', 'total_quantity', 'in_stock_sku', 'out_of_stock_sku', 'total_sku',
'guarantee', 'guarantee_new', 'params'
]
file_exists = os.path.isfile(csv_file)
with open(csv_file, 'a', newline='', encoding='utf-8-sig') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists:
writer.writeheader()
record['images'] = json.dumps(record.get('images', []), ensure_ascii=False)
record['guarantee'] = json.dumps(record.get('guarantee', []), ensure_ascii=False)
record['guarantee_new'] = json.dumps(record.get('guarantee_new', []), ensure_ascii=False)
record['params'] = json.dumps(record.get('params', {}), ensure_ascii=False)
writer.writerow(record)
print("[INFO] 监控成功,数据已保存")
def monitor_once(url, cookies, csv_file):
"""单次监控流程,参数从URL解析"""
platform, base_url, params = parse_url_and_params(url)
if platform == "unknown":
print("[ERROR] 无法识别平台,请确保链接来自 taobao.com 或 tmall.com")
return False
headers = PLATFORM_HEADERS.get(platform, PLATFORM_HEADERS["taobao"])
try:
resp = fetch_page(base_url, params, headers, cookies)
except Exception:
return False
data = extract_ice_context(resp.text)
if not data:
return False
res = safe_get(data, 'loaderData', 'home', 'data', 'res', default={})
if not res:
print("[ERROR] 未找到商品数据 res")
return False
# 提取基础字段
shopName = safe_get(res, 'seller', 'shopName')
title = safe_get(res, 'item', 'title')
itemId = safe_get(res, 'item', 'itemId')
qrCode = safe_get(res, 'item', 'qrCode')
spuId = safe_get(res, 'item', 'spuId')
images = safe_get(res, 'item', 'images', default=[])
# 提取备选价格(页面右上角显示的价格)
rightBarPriceText = safe_get(res, 'componentsVO', 'priceVO', 'price', 'priceText', default='')
# 提取SKU最低价
sku2info = safe_get(res, 'skuCore', 'sku2info', default={})
if not sku2info:
print("[WARN] 未获取到SKU信息,将尝试使用备选价格")
min_price = 0
else:
min_price = parse_sku_min_price(sku2info)
# 价格回退逻辑:如果 min_price == 0,则尝试使用备选价格
use_fallback = False
if min_price == 0 and rightBarPriceText:
price_match = re.search(r'(\d+(?:\.\d+)?)', rightBarPriceText)
if price_match:
min_price = float(price_match.group(1))
use_fallback = True
print(f"[INFO] SKU价格无效,使用备选价格: {min_price} 元")
else:
print(f"[WARN] 备选价格文本无法解析: {rightBarPriceText}")
# 提取扩展信息
extension_infos = safe_get(res, 'componentsVO', 'extensionInfoVO', 'infos', default=[])
if extension_infos:
extension = extract_extension_info(extension_infos)
else:
extension = {'guarantee': [], 'guarantee_new': [], 'params': {}}
# 构建记录(max_price和avg_price设为0,total_quantity等可保留但不再计算,直接填0)
record = {
'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
'item_id': itemId,
'platform': platform,
'shop_name': shopName,
'title': title,
'spu_id': spuId,
'qr_code': qrCode,
'images': images,
'min_price': min_price,
'max_price': 0,
'avg_price': 0,
'total_quantity': 0,
'in_stock_sku': 0,
'out_of_stock_sku': 0,
'total_sku': 0,
'guarantee': extension.get('guarantee', []),
'guarantee_new': extension.get('guarantee_new', []),
'params': extension.get('params', {})
}
append_full_record_to_csv(record, csv_file)
return True
def main():
parser = argparse.ArgumentParser(description="统一商品监控脚本(通过商品链接自动提取参数)")
parser.add_argument("--url", "-u", required=True, help="商品完整链接(淘宝或天猫)")
parser.add_argument("--cookie", "-c", default=COOKIE_FILE, help=f"Cookie文件路径,默认 {COOKIE_FILE}")
parser.add_argument("--output", "-o", default=DEFAULT_CSV_FILE, help=f"输出CSV文件路径,默认 {DEFAULT_CSV_FILE}")
parser.add_argument("--interval", "-t", type=int, default=0, help="监控间隔(秒),0表示只运行一次,默认0")
args = parser.parse_args()
# 加载Cookie
try:
cookies = load_cookie_from_file(args.cookie)
except Exception:
return
# 循环或单次执行
while True:
success = monitor_once(args.url, cookies, args.output)
if not success:
print("[ERROR] 监控失败")
if args.interval <= 0:
break
print(f"[INFO] 等待 {args.interval} 秒后继续...")
time.sleep(args.interval)
if __name__ == "__main__":
main()
本文作者:苏皓明
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!