爬虫学习2:爬取北京地区短租房信息

思路分析

要爬取网站:http://bj.xiaozhu.com/

查看网页HTML信息

第二页:http://bj.xiaozhu.com/search-duanzufang-p2-0/
第三页:http://bj.xiaozhu.com/search-duanzufang-p3-0/
以此类推,点击查看结果
第一页:http://bj.xiaozhu.com/search-duanzufang-p1-0/
第n页:http://bj.xiaozhu.com/search-duanzufang-pn-0/
能够实现,接着进行下一步。

获取爬取目标

要爬取的信息:标题、地址、价格、房东名称、房东性别、房东头像链接。
例如:

{'tittle': '初见黎戈.温馨整租地铁景泰站天坛南门直达南站', 'address': '北京市丰台区景泰西里西区', 'price': '558', 'img': 'https://image.xiaozhustatic1.com/21/14,0,53,20357,376,375,56c14dc3.jpg', 'name': '爱笑的婷哥', 'sex': '女'}

查看信息:
Chrome浏览器按F12打开开发者工具,查找相关元素;或者查看网页源代码

获取标题、地址

网页源代码
1
2
3
4
5
6
7
8
9
<div class="pho_info">
<h4>
<em>初见黎戈.温馨整租地铁景泰站天坛南门直达南站</em>
</h4>

<p title="北京市丰台区景泰西里西区">
...
<span class="pr5">北京市丰台区景泰西里西区
...

获取标题、地址:

标题:tittles = soup.select('div.pho_info > h4')
地址:addresses = soup.select('span.pr5')

获取价格、房东名称、房东性别、房东头像链接

网页源代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<div class="bg_box" id="floatRightBox" style="position: static;">
<div class="day_top clearfix" id="pricePart">
<div class="day_l">¥<span>558</span><em></em></div>
<div class="day_r">每晚</div>
</div>


<div class="js_box clearfix">
<div class="member_pic">
<div class="member_ico1"></div>
<a href="http://www.xiaozhu.com/fangdong/17313731801/" target="_blank">
<img src="https://image.xiaozhustatic1.com/21/14,0,53,20357,376,375,56c14dc3.jpg" // alt="">
</a>
</div>
<div class="w_240" style="width: 215px;">
<h6>
<a class="lorder_name" href="http://www.xiaozhu.com/fangdong/17313731801/" title="爱笑的婷哥" target="_blank">爱笑的婷哥</a>
<span class="member_girl_ico"></span>
</h6>

获取价格、房东名称、房东性别、房东头像链接:

价格:prices = soup.select('#pricePart > div.day_l > span')
房东性别:sexs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')
房东头像链接:imgs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
房东姓名:names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')

加入请求头,伪装成浏览器

Chrome浏览器按F12打开开发者工具,找到Network–>Headers–>User-Agent: 进行复制

请求头

1
2
3
4
5
6
Request Headers
Provisional headers are shown
Content-Type: text/plain;charset=UTF-8
Origin: http://bj.xiaozhu.com
Referer: http://bj.xiaozhu.com/search-duanzufang-p2-0/
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36

从主页获取分页url

网页源代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<div id="page_list">

<ul class="pic_list clearfix">
<li lodgeunitid="lodgeunit_32332047201" latlng="40.056149,116.444755">
<a target="_blank" href="http://bj.xiaozhu.com/fangzi/32332047201.html" class="resule_img_a">
<img class="lodgeunitpic" title="紧邻华贸城福熙大道乐想汇公寓美式一居品茶" data-growing-title="32332047201" src="https://image.xiaozhustatic3.com/12/14,0,15,28119,1800,1200,eb500b46.jpg" lazy_src="finish" alt="紧邻华贸城福熙大道乐想汇公寓美式一居品茶" style="height: 290px;">
</a>
...

<a target="_blank" href="http://bj.xiaozhu.com/fangzi/32113089701.html" class="resule_img_a">
<img class="lodgeunitpic" title="初见黎戈.温馨整租地铁景泰站天坛南门直达南站" data-growing-title="32113089701" src="https://image.xiaozhustatic3.com/12/14,0,49,24365,1800,1200,2554932d.jpg" lazy_src="finish" alt="初见黎戈.温馨整租地铁景泰站天坛南门直达南站" style="height: 290px;">
</a>

...

获取分页url

links = soup.select('#page_list > ul > li > a') #通过selector方法定位元素位置,提取信息
href = link.get("href") 
get_info(href) #循环调用get_info()函数获取分页url

实例操作

源代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# 导入相应的库
from bs4 import BeautifulSoup
import requests
import time

# 加入请求头,伪装成浏览器
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}

# 判断用户性别,
#Chrome浏览器按F12打开开发者工具,可以看到<div class="member_ico1"></div>对应女性;<div class="member_ico"></div>对应男性。
def judgment_sex(class_name):
if class_name == ['member_ico1']:
return '女'
else:
return '男'

#
def get_links(url):
wb_data = requests.get(url,headers=headers) #请求网页
soup = BeautifulSoup(wb_data.text,'lxml') #解析数据
links = soup.select('#page_list > ul > li > a') #通过selector方法定位元素位置,提取信息
for link in links:
href = link.get("href")
get_info(href) #循环调用get_info()函数

# 获取网页信息
def get_info(url):
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
tittles = soup.select('div.pho_info > h4') #标题
addresses = soup.select('span.pr5') #地址
prices = soup.select('#pricePart > div.day_l > span') #价格
imgs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img') #房东头像链接
names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a') #房东姓名
sexs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div') #房东性别
for tittle, address, price, img, name, sex in zip(tittles,addresses,prices,imgs,names,sexs):
data = { #构造字典方式信息数据
'tittle':tittle.get_text().strip(),
'address':address.get_text().strip(),
'price':price.get_text(),
'img':img.get("src"),
'name':name.get_text(),
'sex':judgment_sex(sex.get("class"))
}
print(data) #通过字典打印

# 主函数
if __name__ == '__main__':
urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(number) for number in range(1,14)] #构造网站前13页url
for single_url in urls:
get_links(single_url) #循环调用get_links()函数
time.sleep(2) #睡眠2秒

运行结果(部分)

1
2
{'tittle': '近国贸九龙山双地铁-南站西站地铁直达-旧时光', 'address': '北京市朝阳区风度柏林', 'price': '428', 'img': 'https://image.xiaozhustatic1.com/21/8,0,96,14129,356,356,8308d7cf.jpg', 'name': '小马佳佳', 'sex': '女'}
{'tittle': '初见黎戈.温馨整租地铁景泰站天坛南门直达南站', 'address': '北京市丰台区景泰西里西区', 'price': '558', 'img': 'https://image.xiaozhustatic1.com/21/14,0,53,20357,376,375,56c14dc3.jpg', 'name': '爱笑的婷哥', 'sex': '女'}
+