当前位置：首页 > news >正文

python爬虫——爬取豆瓣TOP250电影

news 来源：原创 2024/5/5 20:37:54

相信很多朋友在看一部电影的时候喜欢先去豆瓣找一下网友对该片的评价。豆瓣作为国内最权威的电影评分网站，虽然有不少水军和精日精美分子，但是TOP250的电影还是不错的，值得一看。

爬取目标

本文将爬取豆瓣电影 TOP250 排行榜的电影名称、时间、主演和评分等信息，爬去的结果我们将以 excel 格式存储下来。

爬取分析

打开豆瓣电影 TOP250 我们会发现榜单主要显示电影名、主演、上映时间和评分。

通过对网页源码的分析我们发现电影的标题在 <div class=hd'>...</div> 标签中，主演和上映时间信息在 <div class=hd'>...</div> 中，电影评分在 <div class=star'>...</div> 中。所以我们调用 find_all方法，即可获得所有信息。

提取首页信息

def find_movies(res):
 soup = bs4.BeautifulSoup(res.text, 'html.parser')

 # 电影名
 movies = []
 targets = soup.find_all("div", class_="hd")
 for each in targets:
 movies.append(each.a.span.text)

 # 评分
 ranks = []
 targets = soup.find_all("span", class_="rating_num")
 for each in targets:
 ranks.append(each.text)

 # 资料
 messages = []
 targets = soup.find_all("div", class_="bd")
 for each in targets:
 try:
 messages.append(each.p.text.split('\n')[1].strip() + each.p.text.split('\n')[2].strip())
 except:
 continue

 result = []
 length = len(movies)
 for i in range(length):
 result.append([movies[i], ranks[i], messages[i]])

 return result

分页爬取

我们需要爬去的数据是 TOP100 的电影，所以我们需要获取他所有页面的数据

def find_depth(res):
 soup = bs4.BeautifulSoup(res.text, 'html.parser')
 depth = soup.find('span', class_='next').previous_sibling.previous_sibling.text

 return int(depth)

写入文件

def save_to_excel(result):
 wb = openpyxl.Workbook()
 ws = wb.active

 ws['A1'] = "电影名称"
 ws['B1'] = "评分"
 ws['C1'] = "资料"

 for each in result:
 ws.append(each)

 wb.save("豆瓣TOP250电影.xlsx")

整理代码

import requests
import bs4
import openpyxl


def open_url(url):
 # 使用代理
 # proxies = {"http": "127.0.0.1:1080", "https": "127.0.0.1:1080"}
 headers = {
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}

 # res = requests.get(url, headers=headers, proxies=proxies)
 res = requests.get(url, headers=headers)

 return res


def find_movies(res):
 soup = bs4.BeautifulSoup(res.text, 'html.parser')

 # 电影名
 movies = []
 targets = soup.find_all("div", class_="hd")
 for each in targets:
 movies.append(each.a.span.text)

 # 评分
 ranks = []
 targets = soup.find_all("span", class_="rating_num")
 for each in targets:
 ranks.append(each.text)

 # 资料
 messages = []
 targets = soup.find_all("div", class_="bd")
 for each in targets:
 try:
 messages.append(each.p.text.split('\n')[1].strip() + each.p.text.split('\n')[2].strip())
 except:
 continue

 result = []
 length = len(movies)
 for i in range(length):
 result.append([movies[i], ranks[i], messages[i]])

 return result


# 找出一共有多少个页面
def find_depth(res):
 soup = bs4.BeautifulSoup(res.text, 'html.parser')
 depth = soup.find('span', class_='next').previous_sibling.previous_sibling.text

 return int(depth)


def save_to_excel(result):
 wb = openpyxl.Workbook()
 ws = wb.active

 ws['A1'] = "电影名称"
 ws['B1'] = "评分"
 ws['C1'] = "资料"

 for each in result:
 ws.append(each)

 wb.save("豆瓣TOP250电影.xlsx")


def main():
 host = "https://movie.douban.com/top250"
 res = open_url(host)
 depth = find_depth(res)

 result = []
 for i in range(depth):
 url = host + '/?start=' + str(25 * i)
 res = open_url(url)
 result.extend(find_movies(res))