Python爬虫：抓取知乎所有话题

分析页面请求

LZ准备从知乎的话题页抓取所有的话题及其结构，在页面上获取话题共有两类ajax request，分别是「显示子话题」和「加载更多」，两者都是一次输出10个话题。request url如下：

https://www.zhihu.com/topic/19776749/organize/entire?child=&parent=19552706

其中child和parent是话题的ID，「显示子话题」只需要parent参数，「加载更多」则两个参数都需要。如果两个参数都没有，则表示「根话题」。

两类request的response数据都是json格式，结构比较相似，只是msg中最后一个元素不同。

抓取逻辑

知乎话题结构是有层级的，为描述方便，下面称「根话题」的子话题为一级话题，一级话题的子话题为二级话题，依此类推……

为抓取所有的话题，流程应该是这样的：从根话题出发，抓取所有的一级话题并存储，然后依次分析每个一级话题，若有子话题，则抓取它的所有子话题（二级）并存储，然后……一直循环下去。

根据流程，需要实现下面三个主要功能：

抓取一次request的josn数据
抽取json数据中的话题，并记录其是否拥有子话题
给出父话题，获取其所有的子话题

关于话题存储，多说一点。因为LZ想保存话题结构，并且在「加载更多」的时候需要用到父话题ID，所以LZ把每个话题记为列表，包含四个元素：话题名称、话题ID、父话题ID、有无子话题的状态（bool类型），比如['实体'，'19778287'，'19776749'，1]，在去重的时候，四个元素都一致，才算重复。

代码实现

代码中，前面一大段是模拟登录知乎。运行时，输入验证码后就会开始抓取数据，完成后会将数据存入到本地文件。因为运行时间比较长（1小时左右），所以添加了一些状态提示。

# all_topics.py

import requests
from bs4 import BeautifulSoup

# session变量
s = requests.Session()
headers = {
    "Accept": "*/*",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
    "Connection": "keep-alive",
    "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
    "Referer": "http://www.zhihu.com/"
}
_xsrf = ''

def set_xsrf():
    global _xsrf
    html = s.get('http://www.zhihu.com/', headers=headers)  # GET请求，获取一个响应对象
    soup = BeautifulSoup(html.text,'lxml-xml')  # 使用BeautifulSoup解析HTML
    _xsrf = soup.findAll(type='hidden')[0]['value']
 
def get_captcha():    
    # 获取验证码图片并保存在本地
    captcha_url = 'http://www.zhihu.com/captcha.gif'
    captcha = s.get(captcha_url, stream=True, headers=headers)
    with open('captcha.gif', 'wb') as f:
        for line in captcha.iter_content(10):
            f.write(line)
        f.close()

    # 输入获取到的验证码，并入变量captch_str
    print('输入验证码：', end='')
    captcha_str = input()
    return captcha_str.strip()

# 登录请求
def login():
    global s
    global _xsrf
    set_xsrf()
    captcha = get_captcha()
    data = {
        '_xsrf': _xsrf,
        'email': '654408619@qq.com',
        'password': 'mftx1029',
        'remember_me': True,
        'captcha': captcha
    }
    r = s.post(url='http://www.zhihu.com/login/email', data=data, headers=headers)
    return r

    
all_topics = [['根话题', '19776749', '', 1]]  # 保存所有的话题，初始状态保存根话题
has_children = [['根话题', '19776749', '', 1]]  # 有子话题，但还没有抓取其子话题。初始状态保存根话题
cnt_a = 0  # all_topics元素计数
cnt_h = 0  # has_children元素计数

# 获取一次request的json数据
def get_json(child_id='', parent_id=''):
    global s
    global headers
    url = 'https://www.zhihu.com/topic/19776749/organize/entire'
    data = {'child': child_id, 'parent': parent_id, '_xsrf': _xsrf}
    res = s.post(url,data=data, headers=headers)
    json_dict = res.json()
    return json_dict

# 解析json数据，获取子话题及「加载更多」的信息   
def json_process(json_dict):
    single_dict = {}  # 保存下面的children和more
    children = []  # 保存子话题
    more = []  # 保存加载更多的信息
    parent_id = json_dict['msg'][0][2]
    if len(json_dict['msg'][1]) == 11:
        more_child_id = json_dict['msg'][1][9][0][2]
        more = [more_child_id, parent_id]
    for i in json_dict['msg'][1][:10]:
        child_name, child_id = i[0][1], i[0][2]
        if i[1]:
            children.append([child_name, child_id, parent_id, 1])
        else:
            children.append([child_name, child_id, parent_id, 0])
    single_dict['children'] = children
    single_dict['more'] = more
    return single_dict

# 获取一个话题的所有子话题
def get_children(parent_id='', parent_name=''):
    global cnt_a
    global cnt_h
    children = []
    more = ['', parent_id]
    while more:
        json_dict = get_json(child_id=more[0], parent_id=more[1])
        single_dict = json_process(json_dict)
        children += single_dict['children']
        more = single_dict['more']
        # 用140个空格清除上次的提示
        print(' '*140, end='\r')
        # 终端提示抓取状态
        print('已获取{}个话题。还有{}个话题排队当中，正在抓取「{}」的第{} 个子话题...'.format(\
        cnt_a, cnt_h, str(parent_name).encode('gbk', 'ignore').decode('gbk'), len(children)), end='\r')
    return children

# 工作函数，抓取全部话题并保存为文件    
def work():
    global cnt_h
    global cnt_a
    login()
    while has_children:
        first_child = has_children.pop(0)
        children = get_children(first_child[1],first_child[0])
        for c in children:
            # 去重并添加到all_topics和has_topics，后者需要再判断下是否有子话题
            if c not in all_topics:
                all_topics.append(c)
                if c[-1] == 1:
                    has_children.append(c)
        cnt_a = len(all_topics)
        cnt_h = len(has_children)
    
    # 存入文件
    for item in all_topics:
        with open('all_topics.txt', 'a', encoding='utf-8') as f:
            string = str(item[0]) + '\t' + str(item[1]) + '\t' + str(item[2]) + '\t' + str(item[3]) + '\n'
            f.write(string)
            

if __name__ == "__main__":
    work()

运行状态如下：

E:\@coding\python>python all_topics.py
输入验证码：hbfu
已获取53个话题。还有46个话题排队当中，正在抓取「「未归类」话题」的第1600 个子话题...

LZ是昨天（2016-02-04）跑的程序，共抓取了46272个话题。

Python爬虫：抓取知乎所有话题

分析页面请求

抓取逻辑

代码实现

添加新评论

最近回复

分类