python爬虫初探

大三上学期一直在当牛马,选了特别多懂得都懂的课,其中有一门课程是《软件工程经济学》,这门课程要求完成一篇《人人都是产品经理》的读后感和实现一个符合要求的爬虫的课设,于是抽了一个下午研究python爬虫,作品太过简陋本不好意思发,但担心下次造轮子的时候代码不知道烂在哪个文件夹里面了……所以特此记录。

python爬虫简介

爬虫:一段自动抓取互联网信息的程序,从互联网上抓取对于我们有价值的信息。

使用场景(我能想到的):挖掘数据进行数据可视化分析,获取各种评论信息,书籍信息,文件批量下载等

python爬虫流程

Python 爬虫架构主要由五个部分组成,分别是调度器、URL管理器、网页下载器、网页解析器、应用程序(爬取的有价值数据)。

    1.调度器:相当于一台电脑的CPU,主要负责调度URL管理器、下载器、解析器之间的协调工作。

    2.URL管理器:包括待爬取的URL地址和已爬取的URL地址,防止重复抓取URL和循环抓取URL,实现URL管理器主要用三种方式,通过内存、数据库、缓存数据库来实现。

    3.网页下载器:通过传入一个URL地址来下载网页,将网页转换成一个字符串,网页下载器有urllib2(Python官方基础模块)包括需要登录、代理、和cookie,requests(第三方包)

    4.网页解析器:将一个网页字符串进行解析,可以按照我们的要求来提取出我们有用的信息,也可以根据DOM树的解析方式来解析。网页解析器有正则表达式(直观,将网页转成字符串通过模糊匹配的方式来提取有价值的信息,当文档比较复杂的时候,该方法提取数据的时候就会非常的困难)、html.parser(Python自带的)、beautifulsoup(第三方插件,可以使用Python自带的html.parser进行解析,也可以使用lxml进行解析,相对于其他几种来说要强大一些)、lxml(第三方插件,可以解析 xml 和 HTML),html.parser 和 beautifulsoup 以及 lxml 都是以 DOM 树的方式进行解析的。

    5.应用程序:就是从网页中提取的有用数据组成的一个应用。

总结如下:

指定请求页面的选择策略 -> 调度器负责处理访问关系 -> URL管理避免重复访问 -> requests库执行下载 -> BeautifulSoup负责解析dom树,允许访问标签 -> 过滤器进行过滤去除脏数据 -> 数据持久化

爬虫功能介绍

课程要求设计一个人才推广系统,完成以下目标:

根据调查,目前市场上以宣传人才引进政策,促进大学生择业就业为主题的互联网产品大多为政府官方网站,资讯推广网站和微信类产品。官方网站是各地政府推出的向广大人民群众发布政策的窗口,由于过于具有权威性 ,其上的内容往往无法成为受大学生群体关注的热点信息,起不到良好的宣传作用。 相关的资讯推广网站通过发布由编辑对相关政策文件进行的解读与二次创作内容对人才引进政策进行宣传推广,此种方法无法针对性的对特定的大学生群体起到科普作用。相关产品在微信可分为公众号(如武汉本地宝)和小程序(如上海落户小助手)两种,公众号主打信息推送,为用户提供官网政策文件浏览 接口和相关资讯内容,通过简单的规则匹配模式引导用户对于相关信息进行检索,无法进行智能化的信息推荐;小程序主打辅助功能,如可帮助用户进行落户积分计算、记录落户流程等,此类产品工具感过强,无法启发用户更深入的了解人才引进政策 …

翻译成为人话就是:

在一个叫做本地宝的网站爬取与人才落户相关的所有信息,整理归纳所有信息,进行持久化,以便日后进行数据推送。(有一说一本地宝真的惨,突然就从WUT的IP来了亿堆高并发,而且初学者普遍是怎么暴力怎么来,好在该网站没有设置反爬虫,ip池,代理等功能都未使用到)

具体的要求如下:

  1. 人才落户链接爬取
  2. 获取目标一线城市对应链接
  3. 拼接搜索url
  4. 获取详情页链接
  5. 爬取具体政策内容
  6. 脏数据清理
  7. 数据持久化

注意:该爬虫代码仅供课程设计使用,行为目的为且只为学习,发布到互联网仅供日后学习参考,若有侵权请联系本人删除!

代码实现

程序简介

共包含如下类:

province

1
2
3
4
5
6
7
8
9
10
11
12
13
class province:     # 用provinces(list对象)存储province
def __init__(self, province_name, city_list = None):
self.province_name = province_name #存储省份的名字
if not city_list :
self.city_list = []
else:
self.city_list = city_list # 存储城市列表,其中放city对象

def parseJSON(dct):
if isinstance(dct, dict):
p = province(str(dct["province_name"]), dct["city_list"])
return p
return dct

province类包含省份的信息,储存省份名称以及省份下的城市列表

类中包含构造函数和解析JSON字典的方法

city

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class city:
def __init__(self, city, href, searchUrl = '', detailUrl = None):
self.city = city
self.href = href
self.searchUrl = searchUrl
if not detailUrl :
self.detailUrl = []
else:
self.detailUrl = detailUrl # ArrayList存放不超过10个详情页信息即urlContext

def parseJSON(dct):
if isinstance(dct, dict):
p = city(str(dct["city"]), str(dct["href"]), str(dct["searchUrl"]), dct["detailUrl"])
return p
return dct

city类中包含城市信息,存储该城市的名称,对应的详情页的地址,以及对应的搜索查询的URL,后期通过拼接将欲查询内容附到URL后,最终获取查询结果的前十个页面,返回对应的URL

类中包含构造函数和解析JSON字典的方法

urlContext

1
2
3
4
5
6
7
8
9
10
11
12
class urlContext:       # url文章详情
def __init__(self, urlName, href, text = '', content = '') -> None:
self.urlName = urlName
self.href = href
self.text = text
self.content = content # 存储tag方便取用

def parseJSON(dct):
if isinstance(dct, dict):
p = urlContext(str(dct["urlName"]), str(dct["href"]), str(dct["text"]), str(dct["content"]))
return p
return dct

urlContext类中包含查询城市的详情信息,存储索引到的页面的相关数据

类中包含构造函数和解析JSON字典的方法

download

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
class download(object):
def __init__(self):
self.server = 'http://www.bendibao.com/'
self.target = 'http://www.bendibao.com/index.htm'
self.search = 'http://sou.%s.bendibao.com/cse/search'
self.search_question = "人才落户"
self.proxies = { "http": None, "https": None}
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,zh-TW;q=0.5',
'Connection' : 'close'
}
self.provinces = [] #用于存储所有的省份

# 函数说明:搜索相关落户信息
def get_search_url(self) :
for p in self.provinces :
if isinstance(p, province) :
for c in p.city_list :
if isinstance(c, city) :
# print(c.href)
try:
tmp_html = requests.get(url = c.href, headers = self.headers, proxies = self.proxies, timeout = 5)
except requests.exceptions.ConnectionError as connError:
message = "An exception of type {0} occurred.".format(type(connError).__name__)
print(message)
continue
except requests.exceptions.RequestException as reqExcep:
message = "An exception of type {0} occurred. ".format(type(reqExcep).__name__)
print(message)
continue
tmp_html.encoding = 'UTF-8'
body_bf = BeautifulSoup(tmp_html.text, "html.parser")
try:
# 查找s_id
input = body_bf.find('input', attrs={'name': "s", 'type': "hidden"})
# print(input)
s_value = input.get('value')
if(s_value == ''):
continue
# print(s_value)
# 查找搜索URL
#header > div.search_nav > form
tmpurl = body_bf.select('div.search_nav > form')
url_value = tmpurl[0].get('action')
# print(url_value)
# 将数据存入city.searchUrl
except Exception as excep:
message = "An exception of type {0} occurred. Arguments:\n{1!r}".format(type(excep).__name__, excep.args)
print(message)
continue
params = {
"s" : s_value,
"q" : self.search_question
}
str_params = urllib.parse.urlencode((params))
final_url = url_value + "?" + str_params
c.searchUrl = final_url
print(c.searchUrl)
time.sleep(random.uniform(0.1, 0.2)) # random防封

# 函数说明,获取详情页面的人才落户
def get_detail_url(self) :
for p in self.provinces:
if isinstance(p, province):
for c in p.city_list :
if isinstance(c, city) and c.searchUrl != "":
#print(c.href)
try:
tmp_html = requests.get(url = c.searchUrl, headers = self.headers, proxies = self.proxies, timeout = 5) # 访问搜索到的结果
except requests.exceptions.ConnectionError as connError:
message = "An exception of type {0} occurred.".format(type(connError).__name__)
print(message)
continue
except requests.exceptions.RequestException as reqExcep:
message = "An exception of type {0} occurred. ".format(type(reqExcep).__name__)
print(message)
continue
tmp_html.encoding = 'UTF-8'
body_bf = BeautifulSoup(tmp_html.text, "html.parser")
try:
# 查找搜索URL
#results > div:nth-child(3) > h3 > a
tmpurl = body_bf.select('div.result.f.s0 > h3 > a')
for u in tmpurl:
print(u)
# 获取样例:<a rpos="" cpos="title" href="http://bj.bendibao.com/news/2018322/249126.shtm" target="_blank">北京<em>人才</em>引进<em>落户政策</em> 哪些<em>人才</em>符合落户条件?- 北京本地宝</a>
du = u.get('href') #获取标签
print(str(u.get_text())) # 注意若需要获得的标签内容中含有其他标签,例如<em>标签,需要使用get_text获取内容
ds = str(u.get_text()).replace('<em>','').replace('</em>','') # 获取名称,去除中间的<em></em>标签对
ds = re.sub('- [\u4e00-\u9fa5]*本地宝$', '', ds) # 清洗数据,将数据中的- xx本地宝去掉
tmp_detailUrl = urlContext(ds, du) # 封装数据
c.detailUrl.append(tmp_detailUrl) # 将数据聚集为list
except Exception as excep:
message = "An exception of type {0} occurred. Arguments:\n{1!r}".format(type(excep).__name__, excep.args)
print(message)
continue
time.sleep(random.uniform(0.1, 0.2)) # random防封

# 函数说明,获取落户网址链接
def get_download_url(self):
req = requests.get(url = self.target, headers = self.headers, proxies = self.proxies, timeout = 5)
req.encoding = 'UTF-8'
html = req.text
#print(html)
div_bf = BeautifulSoup(html, "html.parser")
div = div_bf.findAll('div', class_ = 'city-list')
dl_bf = BeautifulSoup(str(div[0]), "html.parser")
dl = dl_bf.find_all('dl')
for each_dl in dl:
dt_bf = BeautifulSoup(str(each_dl), "html.parser")
dt = dt_bf.find('dt') # 获取省份名称
aProvince = province(str(dt.string)) # 创建一个省份
a = dt_bf.findAll('a') # 获取a标签
for each_a in a:
aCity = city(each_a.string, each_a.get('href'))
# print(aCity.city + aCity.href)
aProvince.city_list.append(aCity) # 添加城市到省份中,便于索引
# print(len(aProvince.city_list))
self.provinces.append(aProvince)

# 匹配城市,进行索引,若city_name为ALL则匹配所有数据
def get_detail_content(self, city_name) :
if isinstance(city_name, str) and city_name.upper == 'ALL':
city_name = True # 保证遍历所有数据
for tmp_province in self.provinces:
if isinstance(tmp_province, province):
for tmp_city in tmp_province.city_list :
if isinstance(tmp_city, city) and (tmp_city.city == city_name or city_name):
for tmp_detail in tmp_city.detailUrl :
if isinstance(tmp_detail, urlContext) :
try:
# print(tmp_detail.href)
tmp_html = requests.get(url = tmp_detail.href, headers = self.headers, proxies = self.proxies, timeout = 5) # 访问搜索到的结果
except requests.exceptions.ConnectionError as connError:
message = "An exception of type {0} occurred.".format(type(connError).__name__)
print(message)
continue
except requests.exceptions.RequestException as reqExcep:
message = "An exception of type {0} occurred. ".format(type(reqExcep).__name__)
print(message)
continue
tmp_html.encoding = 'UTF-8'
body_bf = BeautifulSoup(tmp_html.text, "html.parser")
# print(body_bf)
try :
# div#bo.content
tmp_content = body_bf.select('div#bo.content')
# 清除脏数据
for tmp_element in tmp_content[0](text=lambda text: isinstance(text, Comment)):
tmp_element.extract() # 清除注释
for tmp_div in tmp_content[0].find_all("div"):
tmp_div.decompose() # 清除广告推荐
for tmp_script in tmp_content[0].find_all('script'):
tmp_script.decompose() # 清除js
for tmp_a in tmp_content[0].find_all('a'):
tmp_a.decompose() # 清除链接
for tmp_p_vci in tmp_content[0].find_all('p', class_='view_city_index'):
tmp_p_vci.decompose() # 清除网站推广
for tmp_span in tmp_content[0].find_all('span'):
tmp_span.decompose() # 清除微信推广
# 清除脏数据
# 存储数据
tmp_detail.text = str(tmp_content[0].get_text().strip())

list = str(tmp_content[0].contents)
print(list ,"\n\n")
tmp_detail.content = str(tmp_content[0].contents)
except Exception as excep :
message = "An exception of type {0} occurred. Arguments:\n{1!r}".format(type(excep).__name__, excep.args)
print(message)
continue
time.sleep(random.uniform(0.1, 0.2)) # random防封

# 函数说明,保存数据到json中
def write_to_json(self, path):
with open(path, 'w', encoding='utf-8') as f:
# 注意此处需要进行封装,封装成字典类型,即键值对后再能写入json
json.dump(dict({'provinces': self.provinces}),
f, # File对象
indent=4, # 空格缩进符,写入多行
#sort_keys=True, # 键的排序
default=lambda obj: obj.__dict__,
ensure_ascii=False # 显示中文
)

# 函数说明,从json中读取数据,
def read_from_json(self, path):
with open(path, 'r', encoding='utf-8') as f:
data = json.load(f)
if isinstance(data, dict) :
self.provinces = data['provinces'] # 没问题
for i_province in range(len(self.provinces)):
self.provinces[i_province] = province.parseJSON(self.provinces[i_province])
if isinstance(self.provinces[i_province], province):
for i_city in range(len(self.provinces[i_province].city_list)):
self.provinces[i_province].city_list[i_city] = city.parseJSON(self.provinces[i_province].city_list[i_city])
if isinstance(self.provinces[i_province].city_list[i_city], city):
for i_detail in range(len(self.provinces[i_province].city_list[i_city].detailUrl)):
self.provinces[i_province].city_list[i_city].detailUrl[i_detail] = urlContext.parseJSON(self.provinces[i_province].city_list[i_city].detailUrl[i_detail])

download类共有以下相关函数:

get_search_url: 函数说明,获取所有城市相关落户信息的搜索URL

get_detail_url: 函数说明,获取详情页面的人才落户

get_download_url: 函数说明,获取落户相关信息的网址链接

get_detail_content: 函数说明,匹配城市,进行索引,若city_name为ALL则匹配所有数据

write_to_json: 函数说明,保存对象数据到json中,进行序列化操作

read_from_json: 函数说明,从json中读取数据,进行反序列化操作

程序执行流程介绍

爬取的网页:本地宝-爱上本地宝,生活会更好. (bendibao.com)

  1. 通过get_download_url获取该页面的所有a标签,将这些信息作为province对象存入download对象中的provinces列表中

image-20211209213721433

  1. 进入该省份页面,通过get_search_url获取搜索框搜索的URL,此处以北京为例

image-20211209214711647

  1. 拼接搜索,通过get_detail_url获取搜索页面内容,本次设计要求使用 “人才落户” 关键词进行搜索

image-20211209215127027

  1. 通过get_detail_content获取所有URL地址的详细内容,进行存储

image-20211209215232714

  1. 最后使用write_to_json保存数据到./data/provinces.json中,需要时可以使用read_from_json从文件中读取

完整目录结构

包括爬虫文件和data文件夹下的json数据

1
2
3
4
5
6
7
8
9
E:\PROGRAMDEMO\LEARN_PYTHON
├─.vscode
│ settings.json

└─bendibao
│ getLuoHu.py

└─data
provinces.json

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests, json, random, time, urllib, re

class province: # 用provinces(list对象)存储province
def __init__(self, province_name, city_list = None):
self.province_name = province_name #存储省份的名字
if not city_list :
self.city_list = []
else:
self.city_list = city_list # 存储城市列表,其中放city对象

def parseJSON(dct):
if isinstance(dct, dict):
p = province(str(dct["province_name"]), dct["city_list"])
return p
return dct

class city:
def __init__(self, city, href, searchUrl = '', detailUrl = None):
self.city = city
self.href = href
self.searchUrl = searchUrl
if not detailUrl :
self.detailUrl = []
else:
self.detailUrl = detailUrl # ArrayList存放不超过10个详情页信息即urlContext

def parseJSON(dct):
if isinstance(dct, dict):
p = city(str(dct["city"]), str(dct["href"]), str(dct["searchUrl"]), dct["detailUrl"])
return p
return dct

class urlContext: # url文章详情
def __init__(self, urlName, href, text = '', content = '') -> None:
self.urlName = urlName
self.href = href
self.text = text
self.content = content # 存储tag方便取用

def parseJSON(dct):
if isinstance(dct, dict):
p = urlContext(str(dct["urlName"]), str(dct["href"]), str(dct["text"]), str(dct["content"]))
return p
return dct

class download(object):
def __init__(self):
self.server = 'http://www.bendibao.com/'
self.target = 'http://www.bendibao.com/index.htm'
self.search = 'http://sou.%s.bendibao.com/cse/search'
self.search_question = "人才落户"
self.proxies = { "http": None, "https": None}
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,zh-TW;q=0.5',
'Connection' : 'close'
}
self.provinces = [] #用于存储所有的省份

# 函数说明:搜索相关落户信息
def get_search_url(self) :
for p in self.provinces :
if isinstance(p, province) :
for c in p.city_list :
if isinstance(c, city) :
# print(c.href)
try:
tmp_html = requests.get(url = c.href, headers = self.headers, proxies = self.proxies, timeout = 5)
except requests.exceptions.ConnectionError as connError:
message = "An exception of type {0} occurred.".format(type(connError).__name__)
print(message)
continue
except requests.exceptions.RequestException as reqExcep:
message = "An exception of type {0} occurred. ".format(type(reqExcep).__name__)
print(message)
continue
tmp_html.encoding = 'UTF-8'
body_bf = BeautifulSoup(tmp_html.text, "html.parser")
try:
# 查找s_id
input = body_bf.find('input', attrs={'name': "s", 'type': "hidden"})
# print(input)
s_value = input.get('value')
if(s_value == ''):
continue
# print(s_value)
# 查找搜索URL
#header > div.search_nav > form
tmpurl = body_bf.select('div.search_nav > form')
url_value = tmpurl[0].get('action')
# print(url_value)
# 将数据存入city.searchUrl
except Exception as excep:
message = "An exception of type {0} occurred. Arguments:\n{1!r}".format(type(excep).__name__, excep.args)
print(message)
continue
params = {
"s" : s_value,
"q" : self.search_question
}
str_params = urllib.parse.urlencode((params))
final_url = url_value + "?" + str_params
c.searchUrl = final_url
print(c.searchUrl)
time.sleep(random.uniform(0.1, 0.2)) # random防封

# 函数说明,获取详情页面的人才落户
def get_detail_url(self) :
for p in self.provinces:
if isinstance(p, province):
for c in p.city_list :
if isinstance(c, city) and c.searchUrl != "":
#print(c.href)
try:
tmp_html = requests.get(url = c.searchUrl, headers = self.headers, proxies = self.proxies, timeout = 5) # 访问搜索到的结果
except requests.exceptions.ConnectionError as connError:
message = "An exception of type {0} occurred.".format(type(connError).__name__)
print(message)
continue
except requests.exceptions.RequestException as reqExcep:
message = "An exception of type {0} occurred. ".format(type(reqExcep).__name__)
print(message)
continue
tmp_html.encoding = 'UTF-8'
body_bf = BeautifulSoup(tmp_html.text, "html.parser")
try:
# 查找搜索URL
#results > div:nth-child(3) > h3 > a
tmpurl = body_bf.select('div.result.f.s0 > h3 > a')
for u in tmpurl:
print(u)
# 获取样例:<a rpos="" cpos="title" href="http://bj.bendibao.com/news/2018322/249126.shtm" target="_blank">北京<em>人才</em>引进<em>落户政策</em> 哪些<em>人才</em>符合落户条件?- 北京本地宝</a>
du = u.get('href') #获取标签
print(str(u.get_text())) # 注意若需要获得的标签内容中含有其他标签,例如<em>标签,需要使用get_text获取内容
ds = str(u.get_text()).replace('<em>','').replace('</em>','') # 获取名称,去除中间的<em></em>标签对
ds = re.sub('- [\u4e00-\u9fa5]*本地宝$', '', ds) # 清洗数据,将数据中的- xx本地宝去掉
tmp_detailUrl = urlContext(ds, du) # 封装数据
c.detailUrl.append(tmp_detailUrl) # 将数据聚集为list
except Exception as excep:
message = "An exception of type {0} occurred. Arguments:\n{1!r}".format(type(excep).__name__, excep.args)
print(message)
continue
time.sleep(random.uniform(0.1, 0.2)) # random防封

# 函数说明,获取落户网址链接
def get_download_url(self):
req = requests.get(url = self.target, headers = self.headers, proxies = self.proxies, timeout = 5)
req.encoding = 'UTF-8'
html = req.text
#print(html)
div_bf = BeautifulSoup(html, "html.parser")
div = div_bf.findAll('div', class_ = 'city-list')
dl_bf = BeautifulSoup(str(div[0]), "html.parser")
dl = dl_bf.find_all('dl')
for each_dl in dl:
dt_bf = BeautifulSoup(str(each_dl), "html.parser")
dt = dt_bf.find('dt') # 获取省份名称
aProvince = province(str(dt.string)) # 创建一个省份
a = dt_bf.findAll('a') # 获取a标签
for each_a in a:
aCity = city(each_a.string, each_a.get('href'))
# print(aCity.city + aCity.href)
aProvince.city_list.append(aCity) # 添加城市到省份中,便于索引
# print(len(aProvince.city_list))
self.provinces.append(aProvince)
# 测试代码
# print('start:')
# for p in self.provinces:
# if isinstance(p, province) :
# print(p.province_name , ':')
# for c in p.city_list:
# if isinstance(c, city) :
# print(c.city, ':', c.href)

# 匹配城市,进行索引,若city_name为ALL则匹配所有数据
def get_detail_content(self, city_name) :
if isinstance(city_name, str) and city_name.upper == 'ALL':
city_name = True # 保证遍历所有数据
for tmp_province in self.provinces:
if isinstance(tmp_province, province):
for tmp_city in tmp_province.city_list :
if isinstance(tmp_city, city) and (tmp_city.city == city_name or city_name):
for tmp_detail in tmp_city.detailUrl :
if isinstance(tmp_detail, urlContext) :
try:
# print(tmp_detail.href)
tmp_html = requests.get(url = tmp_detail.href, headers = self.headers, proxies = self.proxies, timeout = 5) # 访问搜索到的结果
except requests.exceptions.ConnectionError as connError:
message = "An exception of type {0} occurred.".format(type(connError).__name__)
print(message)
continue
except requests.exceptions.RequestException as reqExcep:
message = "An exception of type {0} occurred. ".format(type(reqExcep).__name__)
print(message)
continue
tmp_html.encoding = 'UTF-8'
body_bf = BeautifulSoup(tmp_html.text, "html.parser")
# print(body_bf)
try :
# div#bo.content
tmp_content = body_bf.select('div#bo.content')
# 清除脏数据
for tmp_element in tmp_content[0](text=lambda text: isinstance(text, Comment)):
tmp_element.extract() # 清除注释
for tmp_div in tmp_content[0].find_all("div"):
tmp_div.decompose() # 清除广告推荐
for tmp_script in tmp_content[0].find_all('script'):
tmp_script.decompose() # 清除js
for tmp_a in tmp_content[0].find_all('a'):
tmp_a.decompose() # 清除链接
for tmp_p_vci in tmp_content[0].find_all('p', class_='view_city_index'):
tmp_p_vci.decompose() # 清除网站推广
for tmp_span in tmp_content[0].find_all('span'):
tmp_span.decompose() # 清除微信推广
# 清除脏数据
# 存储数据
tmp_detail.text = str(tmp_content[0].get_text().strip())

list = str(tmp_content[0].contents)
print(list ,"\n\n")
tmp_detail.content = str(tmp_content[0].contents)
except Exception as excep :
message = "An exception of type {0} occurred. Arguments:\n{1!r}".format(type(excep).__name__, excep.args)
print(message)
continue
time.sleep(random.uniform(0.1, 0.2)) # random防封

# 函数说明,保存数据到json中
def write_to_json(self, path):
# json_object = json.dumps(dict({'provinces': self.provinces}),
# default=lambda obj: obj.__dict__
# )
# print(json_object)
with open(path, 'w', encoding='utf-8') as f:
# 注意此处需要进行封装,封装成字典类型,即键值对后再能写入json
json.dump(dict({'provinces': self.provinces}),
f, # File对象
indent=4, # 空格缩进符,写入多行
#sort_keys=True, # 键的排序
default=lambda obj: obj.__dict__,
ensure_ascii=False # 显示中文
)

# 函数说明,从json中读取数据,
def read_from_json(self, path):
with open(path, 'r', encoding='utf-8') as f:
data = json.load(f)
if isinstance(data, dict) :
self.provinces = data['provinces'] # 没问题
for i_province in range(len(self.provinces)):
# print(self.provinces[i_province])
# print(province.parseJSON(self.provinces[i_province]))
self.provinces[i_province] = province.parseJSON(self.provinces[i_province])
# print(self.provinces[i_province].province_name)
# print(self.provinces[i_province].city_list)
if isinstance(self.provinces[i_province], province):
for i_city in range(len(self.provinces[i_province].city_list)):
self.provinces[i_province].city_list[i_city] = city.parseJSON(self.provinces[i_province].city_list[i_city])
# print(self.provinces[i_province].city_list[i_city].city)
# print(self.provinces[i_province].city_list[i_city].href)
# print(self.provinces[i_province].city_list[i_city].searchUrl)
# print(self.provinces[i_province].city_list[i_city].detailUrl)
if isinstance(self.provinces[i_province].city_list[i_city], city):
for i_detail in range(len(self.provinces[i_province].city_list[i_city].detailUrl)):
self.provinces[i_province].city_list[i_city].detailUrl[i_detail] = urlContext.parseJSON(self.provinces[i_province].city_list[i_city].detailUrl[i_detail])
# print(self.provinces[i_province].city_list[i_city].detailUrl[i_detail].urlName)
# print(self.provinces[i_province].city_list[i_city].detailUrl[i_detail].href)
# print(self.provinces[i_province].city_list[i_city].detailUrl[i_detail].text)
# print(self.provinces[i_province].city_list[i_city].detailUrl[i_detail].content)
# print(type(self.provinces[i_province].city_list[i_city].detailUrl[i_detail]))
# print(self.provinces[i_province].city_list[i_city].detailUrl[i_detail].text)
#print(self.provinces)


if __name__ == "__main__" :
dl = download()
# dl.read_from_json('./bendibao/data/provinces.json')
# dl.write_to_json('./bendibao/data/provinces.json')


choice = input("1.get from web\n2.get from local json\nyour choice:")
if choice == '1' :
dl.get_download_url()
dl.get_search_url()
dl.get_detail_url()
dl.get_detail_content("ALL")
dl.write_to_json('./bendibao/data/provinces.json')
elif choice == '2' :
dl.read_from_json('./bendibao/data/provinces.json')
dl.get_detail_content("ALL")
dl.write_to_json('./bendibao/data/provinces.json')
else :
pass

数据持久化样式

由于provinces.json中内容过多,只截图部分展示:

省份以及城市数据:

image-20211209215936514

城市详细数据:

image-20211209215846460

参考资料

[1] 网络爬虫 - 维基百科,自由的百科全书 (wikipedia.org)


-------------本文到此结束 感谢您的阅读-------------
谢谢你请我喝肥宅快乐水(๑>ڡ<) ☆☆☆