爬虫学习4:MySQL数据库使用

MySQL数据库使用需要以下软件:

MySQL数据库 安装

下载 MySQL windows 版本

MySQL 官网:https://www.mysql.com/
打开MySQL windows 下载界面 ,选择第二个文件下载

(图)MySQL windows 下载

安装 pysql

方法一: 命令安装 pyMySQL

$ pip install PyMySQL

方法二: Pycharm 安装 pyMySQL

(1) 工具栏 FileSettingProject interpreter
(2) 点击“+”号 → 搜索 pyMySQL → 点击左下角 Install Package 按钮,安装成功后,如下图所示:

(图)Pycharm 安装 pyMySQL

安装 SQLyog

(1)SQLyog 下载

百度云盘 提取码:e9c6

(2)安装成功后,连接 MySQL

(图)连接 MySQL

实例 爬取起点中文网VIP收藏榜数据

起点中文网VIP收藏榜网站:https://www.qidian.com/rank/vipcollect?style=1

(图)VIP收藏榜

分析

(1)网站地址构造。起点中文网VIP收藏榜共五页,每页25条数据。

第一页:https://www.qidian.com/rank/vipcollect?style=1
第二页:https://www.qidian.com/rank/vipcollect?style=1&page=2
第三页:https://www.qidian.com/rank/vipcollect?style=1&page=3
……

经测试,第一页也可使用https://www.qidian.com/rank/vipcollect?style=1&page=1进行访问。

如此,网页访问可构造为:

1
2
3
urls = ['https://www.qidian.com/rank/vipcollect?style=1&page={}'.format(str(i)) for i in range(1, 6)]
for url in urls:
......

(2)爬取目标

小说的排名 book_rank 
小说的名字 book_name 
小说的作者 book_author 
小说的分类 book_category 
小说的状态 book_status 
小说的封面地址 book_img 
小说的链接地址 book_url 

(3)爬取目标数据提取

(图)爬取目标

本次提取采用 BeautifulSoup,点击谷歌浏览器右键检查选项,可查找爬取目标数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 导入相应的库
from bs4 import BeautifulSoup

# 得到爬取目标数据
def get_books_url(url):
qd_html = requests.get(url, headers=headers)
soup = BeautifulSoup(qd_html.text, 'lxml') # 解析数据
ranks = soup.select('div.book-img-box > span ')
links = soup.select('div.book-img-box > a')
names = soup.select('div.book-mid-info > h4 >a ')
authors = soup.select('div.book-mid-info > p.author ')
imgs = soup.select('div.book-img-box > a > img ')

for rank, name, author, link, img in zip(ranks, names, authors, links, imgs):
book_rank = rank.get_text()
book_name = name.get_text()
book_author = author.get_text().split('|')[0]
book_category = author.get_text().split('|')[1]
book_status = author.get_text().split('|')[2]
book_url = link.get("href")
book_img = img.get("src")

创建 MySQL 数据库和表

创建 MySQL 数据库和表

  • 右键 root@localhost创建数据库
  • 输入数据库名称 mydb
(图)创建数据库
  • 右键 mydb创建表
  • 输入表名称 qidian_book,引擎 InnoDB,字符集 utf8,其它默认。
  • 输入表的列表名称和数据类型,如下所示,其他默认。
book_rank  text
book_name  text
book_author  text
book_category  text
book_status  text
book_img  text
book_url  text
(图)创建数据表
(图)数据库

数据提交

1
2
3
4
5
6
7
8
9
10
11
# 引入相应的库
import pymysql

# 连接MySQL数据库 mydb
conn = pymysql.connect(host='localhost', user='root', passwd='1213', db='mydb', port=3306, charset='utf8')
cursor = conn.cursor()

# 提交数据到 qidian_book 表中
cursor.execute(
"insert into qidian_book (book_rank, book_name, book_author, book_category, book_status, book_url, book_img) values(%s, %s, %s, %s, %s, %s, %s)", (str(book_rank), str(book_name), str(book_author), str(book_category), str(book_status), str(book_url), str(book_img)))
conn.commit()

运行测试

源代码

qidian_vipbook.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# -*- coding: UTF-8 –*-
# 采用 Request 和 BeautifulSoup 实现

# 导入相应的库
from bs4 import BeautifulSoup
import requests
import pymysql
import time

# 连接MySQL数据库 mydb
conn = pymysql.connect(host='localhost', user='root', passwd='1213', db='mydb', port=3306, charset='utf8')
cursor = conn.cursor()

# 加入请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Mobile Safari/537.36',
}

# 得到爬取目标数据
def get_books_url(url):
qd_html = requests.get(url, headers=headers)
soup = BeautifulSoup(qd_html.text, 'lxml') # 解析数据
ranks = soup.select('div.book-img-box > span ')
links = soup.select('div.book-img-box > a')
names = soup.select('div.book-mid-info > h4 >a ')
authors = soup.select('div.book-mid-info > p.author ')
imgs = soup.select('div.book-img-box > a > img ')

for rank, name, author, link, img in zip(ranks, names, authors, links, imgs):
book_rank = rank.get_text()
book_name = name.get_text()
book_author = author.get_text().split('|')[0]
book_category = author.get_text().split('|')[1]
book_status = author.get_text().split('|')[2]
book_url = link.get("href")
book_img = img.get("src")
# 提交数据到 qidian_book 表中
cursor.execute(
"insert into qidian_book (book_rank, book_name, book_author, book_category, book_status, book_url, book_img) values(%s, %s, %s, %s, %s, %s, %s)", (str(book_rank), str(book_name), str(book_author), str(book_category), str(book_status), str(book_url), str(book_img)))
conn.commit()


if __name__ == '__main__':
urls = ['https://www.qidian.com/rank/vipcollect?style=1&page={}'.format(str(i)) for i in range(1, 6)]
for url in urls:
get_books_url(url)
time.sleep(1)

测试结果

(图)测试结果

如果SQLyog连接MySQL时出现的2058错误。。。

具体参考:
SQLyog连接MySQL时出现的2058错误解决方法

解决方法:
(1)windows 下cmd 登录 mysql -u root -p 登录你的 mysql 数据库,
(2)然后执行这条SQL命令:

ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password';其中 #password 是你自己设置的root密码,注意别忘了分号;

(3)然后重新配置SQLyog的连接,则可连接成功了,就OK了。

+