当前位置：首页 > news >正文

《人生苦短，我用python·十一》python网络爬虫的简单使用

news 来源：原创 2024/9/19 9:39:26

Python 有很多库可以用于网络爬虫，最常用的包括 requests 和 BeautifulSoup。以下是如何使用这些库来爬取数据的详细步骤和示例。

1. 安装依赖库
首先，确保安装了 requests 和 BeautifulSoup 库。如果还没有安装，可以使用以下命令进行安装：

pip install requests
pip install beautifulsoup4

2. 使用 requests 库获取网页内容
requests 库用于发送 HTTP 请求并接收响应。以下是获取网页内容的示例：

import requestsurl = 'https://example.com'
response = requests.get(url)if response.status_code == 200:html_content = response.textprint(html_content)
else:print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

3. 使用 BeautifulSoup 解析 HTML 内容
BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库。以下是解析 HTML 内容的示例：

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_content, 'html.parser')# 查找所有的标题标签
titles = soup.find_all('h1')
for title in titles:print(title.get_text())# 查找特定的标签
specific_div = soup.find('div', {'class': 'specific-class'})
if specific_div:print(specific_div.get_text())

4. 综合示例
以下是一个综合示例，演示如何从一个新闻网站爬取标题和链接：

import requests
from bs4 import BeautifulSoupurl = 'https://news.ycombinator.com/'
response = requests.get(url)if response.status_code == 200:html_content = response.textsoup = BeautifulSoup(html_content, 'html.parser')# 查找所有新闻条目stories = soup.find_all('a', {'class': 'storylink'})for story in stories:title = story.get_text()link = story['href']print(f"Title: {title}")print(f"Link: {link}\n")
else:print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

5. 处理分页
有些网站的数据分布在多个页面上，需要处理分页。以下是处理分页的示例：

import requests
from bs4 import BeautifulSoupbase_url = 'https://example.com/page/'
page_number = 1while True:url = f"{base_url}{page_number}"response = requests.get(url)if response.status_code != 200:breakhtml_content = response.textsoup = BeautifulSoup(html_content, 'html.parser')items = soup.find_all('div', {'class': 'item'})if not items:breakfor item in items:title = item.find('h2').get_text()print(title)page_number += 1

6. 处理动态内容
对于动态生成的内容，如通过 JavaScript 加载的内容，可以使用 Selenium 库。安装方法：

pip install selenium

使用 Selenium 获取动态内容的示例：

from selenium import webdriverurl = 'https://example.com'
driver = webdriver.Chrome()  # 或者使用其他浏览器的驱动程序
driver.get(url)html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')# 解析内容
items = soup.find_all('div', {'class': 'item'})
for item in items:title = item.find('h2').get_text()print(title)driver.quit()

7. 爬虫礼仪
遵守网站的 robots.txt 文件：这个文件定义了哪些页面允许被爬取。
设置适当的请求间隔：避免频繁请求，给服务器带来负担。
使用 User-Agent：在请求头中添加 User-Agent，表明请求是由浏览器发出的。

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}response = requests.get(url, headers=headers)