当前位置：首页 > news >正文

【自用】Python爬虫学习（一）：爬虫基础与四个简单案例

news 来源：原创 2024/9/20 12:32:47

Python爬虫学习（一）

基础知识
四个简单的爬虫案列
- 1.使用urlopen获取百度首页并保存
- 2.获取某翻译单词翻译候选结果
- 3.获取某网页中的书名与价格
- 4.获取某瓣排名前250的电影名称

基础知识

对于一个网页，浏览器右键可以查看页面源代码，但是这与使用开发者工具的检查看到的结果不一定相同。
在这里插入图片描述

服务器渲染：相同则说明应该是服务器渲染，在页面看到的数据，源代码中就有，服务器将所有数据一并发送给客户端。只需要对网页进行请求，获得页面数据后对感兴趣内容进行数据解析即可。
客户端渲染：不一样则说明应该是客户端渲染，右键看到的页面源代码只是简单的html框架，数据信息是服务器单独再次发送，经客户端注入重新渲染的结果。

想要获取第二种类型的网页数据，需要用到浏览器的抓包工具。
如下所示，页面中含有“美丽人生”，但右键查看页面源代码，使用Ctrl+F搜索却没有该文本，说明该网页应该就属于第2种类型，即客户端渲染。
在这里插入图片描述
那么包含“美丽人生”的文本在哪里呢？在该页面右键点击最下面的检查，或者直接按F12键打开开发者工具。

依次点击左侧红色方框中的条目内容，查看右侧预览信息，发现第二个就应该是我们需要的内容，其中就有“美丽人生”的文本。
在这里插入图片描述
确定好之后，点击右侧的标头，目前需要关注这几个部分的信息。

编写代码尝试获取预览的数据信息

import requestsurl = 'https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=0&limit=20'herders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
}resp = requests.get(url=url, headers=herders)print(resp.text)

运行结果：
在这里插入图片描述
可以看到，已经获取到预览中看到的所有数据，但略显杂乱，后续只需要对该部分内容进行感兴趣提取就行，显然这是python基础，与爬虫无关了，毕竟已经获取到了数据。

例如，只获取电影名称与评分，示例代码如下：

import requestsurl = 'https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=0&limit=20'herders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
}resp = requests.get(url=url, headers=herders)
content_list = resp.json()for content in content_list:movie_name = content['title']movie_score = content['score']print(f'《{movie_name}》, 评分：{movie_score}')

运行结果：
在这里插入图片描述

四个简单的爬虫案列

1.使用urlopen获取百度首页并保存

from urllib.request import urlopenresp = urlopen('http://www.baidu.com')with open('baidu.html', mode='w', encoding='utf-8') as f:f.write(resp.read().decode('utf-8'))

2.获取某翻译单词翻译候选结果

在这里插入图片描述

参考源码：

import requestsurl = 'https://fanyi.baidu.com/sug'name = input('请输入你要查询的单词：')
data = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0','kw': name
}resp = requests.post(url, data=data)fanyi_result = dict(resp.json()['data'][0])['v']
print(fanyi_result)resp.close()

3.获取某网页中的书名与价格

在这里插入图片描述
参考源码：

import requests
from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0"
}
url = "http://books.toscrape.com/"
response = requests.get(url=url, headers=headers)
if response.ok:response = requests.get("http://books.toscrape.com/")print(response.status_code)  # 状态代码，200为请求成功content = response.text# 参数"html.parser"表明解析的是htmlsoup = BeautifulSoup(content, "html.parser")# 获取网站中书本的价格信息：根据属性查找对应的p标签，返回的结果为可迭代对象all_prices = soup.find_all("p", attrs={"class": "price_color"})# print(list(all_prices))print("=====书本价格：=====")for price in all_prices:# 利用price.string可以只保留html标签中的文本内容，再利用字符串的切片得到价格print(price.string[2:])print("=====书本名称：=====")# 获取网站中书名信息all_titles = soup.find_all("h3")for title in all_titles:all_links = title.findAll("a")for link in all_links:print(link.string)response.close()
else:print("请求失败")

4.获取某瓣排名前250的电影名称

在这里插入图片描述
参考源码：

import requests
from bs4 import BeautifulSoup# 获取豆瓣排名前250的电影名称# 浏览器标识
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0"
}i = 1
for start_num in range(0, 250, 25):# print(start_num)response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)# print("服务器响应状态码：", response.status_code)response.encoding = "UTF-8"  # 指定字符集if response.ok:  # 如果服务器响应正常执行下面代码douban_top250_html = response.textsoup = BeautifulSoup(douban_top250_html, "html.parser")# all_titles = soup.find_all("span", attrs={"class": "title"})all_titles = soup.find_all("span", class_="title")  # 两种写法效果都一样for title in all_titles:title_string = title.stringif "/" not in title_string:print(f"{i}:\t《{title.string}》")i = i + 1else:print("请求失败!")response.close()