Python 有很多库可以用于网络爬虫,最常用的包括 requests 和 BeautifulSoup。以下是如何使用这些库来爬取数据的详细步骤和示例。

1. 安装依赖库
首先,确保安装了 requests 和 BeautifulSoup 库。如果还没有安装,可以使用以下命令进行安装:

pip install requests
pip install beautifulsoup4

2. 使用 requests 库获取网页内容
requests 库用于发送 HTTP 请求并接收响应。以下是获取网页内容的示例:

import requestsurl = 'https://example.com'
response = requests.get(url)if response.status_code == 200:html_content = response.textprint(html_content)
else:print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

3. 使用 BeautifulSoup 解析 HTML 内容
BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库。以下是解析 HTML 内容的示例:

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_content, 'html.parser')# 查找所有的标题标签
titles = soup.find_all('h1')
for title in titles:print(title.get_text())# 查找特定的标签
specific_div = soup.find('div', {'class': 'specific-class'})
if specific_div:print(specific_div.get_text())

4. 综合示例

import requests
from bs4 import BeautifulSoupurl = 'https://news.ycombinator.com/'
response = requests.get(url)if response.status_code == 200:html_content = response.textsoup = BeautifulSoup(html_content, 'html.parser')# 查找所有新闻条目stories = soup.find_all('a', {'class': 'storylink'})for story in stories:title = story.get_text()link = story['href']print(f"Title: {title}")print(f"Link: {link}\n")
else:print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

5. 处理分页

import requests
from bs4 import BeautifulSoupbase_url = 'https://example.com/page/'
page_number = 1while True:url = f"{base_url}{page_number}"response = requests.get(url)if response.status_code != 200:breakhtml_content = response.textsoup = BeautifulSoup(html_content, 'html.parser')items = soup.find_all('div', {'class': 'item'})if not items:breakfor item in items:title = item.find('h2').get_text()print(title)page_number += 1

6. 处理动态内容
对于动态生成的内容,如通过 JavaScript 加载的内容,可以使用 Selenium 库。安装方法:

pip install selenium

使用 Selenium 获取动态内容的示例:

from selenium import webdriverurl = 'https://example.com'
driver = webdriver.Chrome()  # 或者使用其他浏览器的驱动程序
driver.get(url)html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')# 解析内容
items = soup.find_all('div', {'class': 'item'})
for item in items:title = item.find('h2').get_text()print(title)driver.quit()

7. 爬虫礼仪
遵守网站的 robots.txt 文件:这个文件定义了哪些页面允许被爬取。
使用 User-Agent:在请求头中添加 User-Agent,表明请求是由浏览器发出的。

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}response = requests.get(url, headers=headers)


