思路分析 要爬取网站:http://bj.xiaozhu.com/
查看网页HTML信息 第二页:http://bj.xiaozhu.com/search-duanzufang-p2-0/ 第三页:http://bj.xiaozhu.com/search-duanzufang-p3-0/ 以此类推,点击查看结果 第一页:http://bj.xiaozhu.com/search-duanzufang-p1-0/ 第n页:http://bj.xiaozhu.com/search-duanzufang-pn-0/ 能够实现,接着进行下一步。
获取爬取目标
要爬取的信息:标题、地址、价格、房东名称、房东性别、房东头像链接。 例如:
{'tittle': '初见黎戈.温馨整租地铁景泰站天坛南门直达南站', 'address': '北京市丰台区景泰西里西区', 'price': '558', 'img': 'https://image.xiaozhustatic1.com/21/14,0,53,20357,376,375,56c14dc3.jpg', 'name': '爱笑的婷哥', 'sex': '女'}
查看信息: Chrome浏览器按F12打开开发者工具,查找相关元素;或者查看网页源代码
获取标题、地址 网页源代码 1 2 3 4 5 6 7 8 9 <div class ="pho_info" > <h4 > <em > 初见黎戈.温馨整租地铁景泰站天坛南门直达南站</em > </h4 > <p title ="北京市丰台区景泰西里西区" > ... <span class ="pr5" > 北京市丰台区景泰西里西区 ...
获取标题、地址:
标题:tittles = soup.select('div.pho_info > h4')
地址:addresses = soup.select('span.pr5')
获取价格、房东名称、房东性别、房东头像链接 网页源代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <div class ="bg_box" id ="floatRightBox" style ="position: static;" > <div class ="day_top clearfix" id ="pricePart" > <div class ="day_l" > ¥<span > 558</span > <em > 起</em > </div > <div class ="day_r" > 每晚</div > </div > <div class ="js_box clearfix" > <div class ="member_pic" > <div class ="member_ico1" > </div > <a href ="http://www.xiaozhu.com/fangdong/17313731801/" target ="_blank" > <img src ="https://image.xiaozhustatic1.com/21/14,0,53,20357,376,375,56c14dc3.jpg" // alt ="" > </a > </div > <div class ="w_240" style ="width: 215px;" > <h6 > <a class ="lorder_name" href ="http://www.xiaozhu.com/fangdong/17313731801/" title ="爱笑的婷哥" target ="_blank" > 爱笑的婷哥</a > <span class ="member_girl_ico" > </span > </h6 >
获取价格、房东名称、房东性别、房东头像链接:
价格:prices = soup.select('#pricePart > div.day_l > span')
房东性别:sexs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')
房东头像链接:imgs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
房东姓名:names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
加入请求头,伪装成浏览器 Chrome浏览器按F12打开开发者工具,找到Network–>Headers–>User-Agent: 进行复制
请求头
1 2 3 4 5 6 Request Headers Provisional headers are shown Content-Type: text/plain;charset=UTF-8 Origin: http://bj.xiaozhu.com Referer: http://bj.xiaozhu.com/search-duanzufang-p2-0/ User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
从主页获取分页url 网页源代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <div id ="page_list" > <ul class ="pic_list clearfix" > <li lodgeunitid ="lodgeunit_32332047201" latlng ="40.056149,116.444755" > <a target ="_blank" href ="http://bj.xiaozhu.com/fangzi/32332047201.html" class ="resule_img_a" > <img class ="lodgeunitpic" title ="紧邻华贸城福熙大道乐想汇公寓美式一居品茶" data-growing-title ="32332047201" src ="https://image.xiaozhustatic3.com/12/14,0,15,28119,1800,1200,eb500b46.jpg" lazy_src ="finish" alt ="紧邻华贸城福熙大道乐想汇公寓美式一居品茶" style ="height: 290px;" > </a > ... <a target ="_blank" href ="http://bj.xiaozhu.com/fangzi/32113089701.html" class ="resule_img_a" > <img class ="lodgeunitpic" title ="初见黎戈.温馨整租地铁景泰站天坛南门直达南站" data-growing-title ="32113089701" src ="https://image.xiaozhustatic3.com/12/14,0,49,24365,1800,1200,2554932d.jpg" lazy_src ="finish" alt ="初见黎戈.温馨整租地铁景泰站天坛南门直达南站" style ="height: 290px;" > </a > ...
获取分页url
links = soup.select('#page_list > ul > li > a') #通过selector方法定位元素位置,提取信息
href = link.get("href")
get_info(href) #循环调用get_info()函数获取分页url
实例操作 源代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 from bs4 import BeautifulSoupimport requestsimport timeheaders = { 'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' } def judgment_sex (class_name) : if class_name == ['member_ico1' ]: return '女' else : return '男' def get_links (url) : wb_data = requests.get(url,headers=headers) soup = BeautifulSoup(wb_data.text,'lxml' ) links = soup.select('#page_list > ul > li > a' ) for link in links: href = link.get("href" ) get_info(href) def get_info (url) : wb_data = requests.get(url,headers=headers) soup = BeautifulSoup(wb_data.text,'lxml' ) tittles = soup.select('div.pho_info > h4' ) addresses = soup.select('span.pr5' ) prices = soup.select('#pricePart > div.day_l > span' ) imgs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img' ) names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a' ) sexs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div' ) for tittle, address, price, img, name, sex in zip(tittles,addresses,prices,imgs,names,sexs): data = { 'tittle' :tittle.get_text().strip(), 'address' :address.get_text().strip(), 'price' :price.get_text(), 'img' :img.get("src" ), 'name' :name.get_text(), 'sex' :judgment_sex(sex.get("class" )) } print(data) if __name__ == '__main__' : urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/' .format(number) for number in range(1 ,14 )] for single_url in urls: get_links(single_url) time.sleep(2 )
运行结果(部分) 1 2 {'tittle': '近国贸九龙山双地铁-南站西站地铁直达-旧时光', 'address': '北京市朝阳区风度柏林', 'price': '428', 'img': 'https://image.xiaozhustatic1.com/21/8,0,96,14129,356,356,8308d7cf.jpg', 'name': '小马佳佳', 'sex': '女'} {'tittle': '初见黎戈.温馨整租地铁景泰站天坛南门直达南站', 'address': '北京市丰台区景泰西里西区', 'price': '558', 'img': 'https://image.xiaozhustatic1.com/21/14,0,53,20357,376,375,56c14dc3.jpg', 'name': '爱笑的婷哥', 'sex': '女'}