保存微博到本地并批量删除

因为一些原因想删除以前发的微博,又想留下记忆看看以前的自己是什么样的,所以想把自己发的微博都保存到本地,然后删除。

python登录微博

一种比较简便的方法是利用cookie登录。先用网页端登录微博后inspect element,找到network下的XHR(XMLHttpRequest),再找到header,能看到cookies。同时找到浏览器的user agent信息。

用如下代码实现登录。

1
2
3
4
5
6
7
8
9
10
import urllib.request

headers = {
	'User-Agent': "replaceme",
	'cookie': "replaceme"
}

url = "will talk later"
response = urllib.request.Request(url, headers=headers)
data = urllib.request.urlopen(response).read().decode("utf-8", "ignore")

另一种方法是用post请求登录,稍微复杂一点,我没有尝试,但是网上有相关信息

找到微博页URL

登录微博并翻到自己的微博第一页的时候发现此时url为https://weibo.com/p/1005052842501660/home?from=page_100505_profile&wvr=6&mod=data&is_all=1。但用这个url有一个问题,此时微博只显示第一页的前15条,需要滚动到页面最下方才能显示另外的微博,一共要重新load两次。找到这些新发的request也很简单,在滚动到下方时再次调出network的xhr页面。

观察url的规律发现,每一页由3组参数组成,分别是

1
2
3
4
5
6
7
8
9
10
11
page = X
pre_page = X-1
pagebar = 1

page = X
pre_page = X
pagebar = 0

page = X
pre_page = X
pagebar = 1

也就是说对于一页微博,我们要用以上的参数发三次request。

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def get_data(page):
	headers = {
	'User-Agent': "replaceme",
	'cookie': "replaceme"
}

	# weibo uses ajax to do lazy loading, 
  # we need to scroll down the page twice to get the whole content for one page
	url1 = f"https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={page}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=1005052842501660&script_uri=/p/1005052842501660/home&feed_type=0&pre_page={page-1}&domain_op=100505&__rnd=1619765016482"
	url2 = f"https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={page}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=1005052842501660&script_uri=/p/1005052842501660/home&feed_type=0&pre_page={page}&domain_op=100505&__rnd=1619765016482"
	url3 = f"https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={page}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=1005052842501660&script_uri=/p/1005052842501660/home&feed_type=0&pre_page={page}&domain_op=100505&__rnd=1619765016482"
	
	urls = [url1, url2, url3]
	data_list = []
	for url in urls:
		response = urllib.request.Request(url, headers=headers)
		data_list.append(urllib.request.urlopen(response).read().decode("utf-8", "ignore"))

	return data_list

解析数据

可以print一下拿出来的数据,已经是一个json string了,再用beautiful soup处理一下。观察网页构成发现微博内容的div<div class="WB_text W_f14" nick-name="XXX" node-type="feed_list_content">。用以下代码实现:

1
2
3
4
5
6
7
8
9
def parse_data(data_list):
	for data in data_list:
		json_data = json.loads(data)["data"]
		soup = BeautifulSoup(json_data, 'html.parser')
		feeds = soup.findAll('div', attrs={'node-type' : 'feed_list_content', 'class' : 'WB_text W_f14'})
		unused_text = ['<div class="WB_text W_f14" nick-name="XXX node-type="feed_list_content">', '</div>']
		for feed in feeds:
			text = "> " + feed.text.strip() + '\n'
			result.append(text)

解析出来的文字结果就是> 微博内容 这样了。

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import urllib.request
import json
import time
from bs4 import BeautifulSoup

result = []

def get_data(page):
	headers = {
	'User-Agent': "replaceme",
	'cookie': "replaceme"
}

	# weibo uses ajax to do lazy loading, 
  # we need to scroll down the page twice to get the whole content for one page
	url1 = f"https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={page}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=1005052842501660&script_uri=/p/1005052842501660/home&feed_type=0&pre_page={page-1}&domain_op=100505&__rnd=1619765016482"
	url2 = f"https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={page}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=1005052842501660&script_uri=/p/1005052842501660/home&feed_type=0&pre_page={page}&domain_op=100505&__rnd=1619765016482"
	url3 = f"https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={page}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=1005052842501660&script_uri=/p/1005052842501660/home&feed_type=0&pre_page={page}&domain_op=100505&__rnd=1619765016482"
	
	urls = [url1, url2, url3]
	data_list = []
	for url in urls:
		response = urllib.request.Request(url, headers=headers)
		data_list.append(urllib.request.urlopen(response).read().decode("utf-8", "ignore"))

	return data_list


def parse_data(data_list):
	for data in data_list:
		json_data = json.loads(data)["data"]
		soup = BeautifulSoup(json_data, 'html.parser')
		feeds = soup.findAll('div', attrs={'node-type' : 'feed_list_content', 'class' : 'WB_text W_f14'})
		unused_text = ['<div class="WB_text W_f14" nick-name="XXX" node-type="feed_list_content">', '</div>']
		for feed in feeds:
			text = "> " + feed.text.strip() + '\n'
			result.append(text)


def main():
	for i in range(1, X):  # 最大页数
		print(f"Getting and parsing page {i}......")
		parse_data(get_data(i))
		print(f"Sleep 5 secs......")
		time.sleep(5)  # 以防太快爬取被微博封号

	with open('out', 'w') as file:
		for res in result:
			file.write(res)


if __name__ == '__main__':
	main()

删除原微博

把微博保存到本地以后就可以毫无顾虑地删掉原微博啦。在知乎上已经有很多方法了,这里不作赘述。