备份微博的方法

本博客有个微言小义系列，其实是个人微博的备份，但一直是手动维护，有些麻烦。这两天捣腾了一个Python脚本，能自动抓取微博内容并生成微言小义格式的文章，昨天发布的微言小义（2020.12）已用这个脚本重新生成。

需要用到：

登录微博获取cookie和user-agent
微博翻页接口：https://weibo.com/ajax/statuses/mymblog?uid=5384740764&feature=0&page=1
长微博展开接口：https://weibo.com/ajax/statuses/longtext?id=JyRAoxq9w
第三方Python库requests

为实现真正的存档，实现了：

尽可能获取全文
获取了转发微博
下载图片到本地
因微博的短链被拦截，尽可能将短链替换为原始链接

非科班，代码能力很烂，脚本只是在我的电脑上（macOS+Python3.8）跑通了，完整代码分享如下：

import requests
from datetime import datetime
import random
import time

# 设定时间，只抓取YYYYMM之后的内容
YEAR_MONTH = datetime.strptime('202012','%Y%m') 

# 换上自己的cookie和user-agent
HEADERS={
    'cookie': '',
    'user-agent': ''
}

# 获取单条微博内容，不含转发内容。
def get_one_weibo(wb_json_obj):
    mblogid = wb_json_obj['mblogid']
    wb_url = 'https://weibo.com/5384740764/' + mblogid
    created_at = datetime.strptime(wb_json_obj['created_at'].replace('+0800',''),'%a %b %d %H:%M:%S %Y')
    created_at_str = time.strftime('%Y-%m-%d %H:%M')
    user_name = wb_json_obj['user']['screen_name']

    if wb_json_obj['isLongText'] == True:
        long_data = requests.get('https://weibo.com/ajax/statuses/longtext?id='+mblogid,headers=HEADERS).json()['data']
        text_raw=long_data['longTextContent'].replace('\u200b','').strip()
        if 'url_struct' in long_data.keys():
            text_raw=replace_url(text_raw, long_data['url_struct'])
    else:
        text_raw = wb_json_obj['text_raw'].replace('\u200b','').strip()
    
    if 'url_struct' in wb_json_obj.keys():
        text_raw = replace_url(text_raw, wb_json_obj['url_struct'])
    
    
    pic_ids = wb_json_obj['pic_ids']
    pic_text = ''
    for pic_id in pic_ids:
        pic_url = wb_json_obj['pic_infos'][pic_id]['original']['url']
        pic_name = download_pic(pic_url)
        pic_text = pic_text + '![[' + pic_name + ']]' + '\n'
    
    wb_text = '@{user_name}：{text_raw}\n\n{pic_text}'.format(
        user_name=user_name, 
        text_raw=text_raw ,
        pic_text=pic_text
    ) 
    
    return {
        'wb_time': created_at,
        'wb_url': wb_url,
        'wb_user': user_name,
        'wb_text': wb_text
    }

# 获取单条微博，若有转发含转发内容
def get_single_weibo(wb_json_obj):
    org_weibo=get_one_weibo(wb_json_obj)
    if 'retweeted_status' in wb_json_obj.keys():
        re_weibo=get_one_weibo(wb_json_obj['retweeted_status'])
        full_text = '* [{time}]({url})\n\n{org_text}\n> {re_text}\n\n'.format(
            time=org_weibo['wb_time'],
            url=org_weibo['wb_url'],
            org_text=org_weibo['wb_text'],
            re_text=re_weibo['wb_text'].strip().replace('\n','<br/>')
        )
    else:
        full_text = '* [{time}]({url})\n\n{org_text}\n\n'.format(
            time=org_weibo['wb_time'],
            url=org_weibo['wb_url'],
            org_text=org_weibo['wb_text'],
        )

    if 'url_struct' in wb_json_obj.keys():
        full_text=replace_url(full_text, wb_json_obj['url_struct'])
        
    return {
        'wb_time': org_weibo['wb_time'],
        'full_text': full_text
    }

 

# 获取一页微博，且要求时间在YEAR_MONTH（含）之后
def get_page_weibo(page_no,YEAR_MONTH):
    page_url = 'https://weibo.com/ajax/statuses/mymblog?uid=5384740764&feature=0&page=' + str(page_no)
    wb_req = requests.get(page_url, headers=HEADERS)
    if wb_req.json()['ok'] != 1:
        return {
            'page_no': page_no,
            'ok': wb_req.json()['ok'],
            'date_stop': False,
            'my_wb_list': []
        }
    else:
        wb_list = wb_req.json()['data']['list']
        my_wb_list = []
        date_stop = False
        for wb in wb_list:
            single_wb = get_single_weibo(wb)
            time = single_wb['wb_time']
            if time >= YEAR_MONTH:
                my_wb_list.append(single_wb)
            else:
                date_stop = True
        return {
            'page_no': page_no,
            'ok': 1,
            'date_stop': date_stop,
            'my_wb_list': my_wb_list
        }


# 获取一个月的微博
def get_month_weibo(YEAR_MONTH):
    month_weibo = []
    for page_no in range(1,100):
        time.sleep(random.random()*5)
        page_weibo = get_page_weibo(page_no, YEAR_MONTH)
        month_weibo.append(page_weibo)
        if page_weibo['date_stop'] == True:
            break
    return month_weibo


# 下载图片到pic文件夹，图片名加weibo-yyyymm前缀，并且返回图片的名称
def download_pic(url):
    pic_name = 'weibo-' + YEAR_MONTH.strftime('%Y%m') + '-' + url.split('?')[0].split('/')[-1]
    pic_req = requests.get(url)
    with open('pic/'+pic_name, 'wb') as f:
        f.write(pic_req.content)
    return pic_name

# 将微博文本中的短链替换为原始长链
def replace_url(text, url_struct):
    for url in url_struct:
        md_url='[{url_title}]({long_url})'.format(url_title=url['url_title'], long_url=url['long_url'])
        text=text.replace(url['short_url'], md_url)
    return text


# 将微博写入md文件，并返回抓取失败的页面
def work():
    month_weibo = get_month_weibo(YEAR_MONTH)
    md_content = ''
    page404 = []

    for page in month_weibo:
        print(page['page_no'],': ',page['ok'], '\n')
        for weibo in page['my_wb_list']:
            md_content = weibo['full_text'] + md_content
            
    file_name = '微言小义（{time}）.md'.format(time=YEAR_MONTH.strftime('%Y%m'))
    with open(file_name,'w') as f:
        f.write(md_content)
        
work()

print('done')

已有 11 条评论

沉舟侧畔

2021-01-22 13:16

基本不用微博。以后想办法同步wp和twitter试试

回复
1. SKYue
  
  2021-01-23 13:20
  
  wp+twitter生态，应该有成熟的方案。
  
  回复
  1. 沉舟侧畔
    
    2021-01-25 08:33
    
    现在才发现好难，需要先去twitter建立app，要人工审核……
    
    回复
箭上有毒

2021-01-05 09:39

谦虚，比我这个测试的代码写得好多了。惭愧。

回复
1. SKYue
  
  2021-01-05 13:13
  
  多谢，谬赞了。
  
  回复
Dr. Drunker

2021-01-04 15:17

这个牛，是不是也可以爬别人的微博？

回复
1. SKYue
  
  2021-01-04 16:24
  
  看了下，接口一致，应该可以通用的抓取某个人的微博。
  
  回复
Unee Wang

2021-01-03 20:05

以前的微博不能被抓，只能抓别的内容发布到微博，比如我的博客，会用ifttt同步发到微博上，直到大号冲塔，我号没了

回复
1. SKYue
  
  2021-01-04 09:54
  
  这个接口也不知道能坚持多久，真希望官方能推出数据备份功能，twitter的备份就特别好用。
  
  回复
三棵树人

2021-01-03 18:44

换个思路，如果可以抓取微博的RSS，然后通过同步RSS发布文章也是可以的。

回复
1. SKYue
  
  2021-01-03 19:27
  
  微博应该没有RSS，有些第三方服务也是抓取数据再转化为RSS。我比较喜欢转化成文章本地存档。
  
  回复

备份微博的方法

已有 11 条评论

添加新评论

最近回复

分类