当前位置: 首页 > news >正文

使用Python爬取华为市场游戏类APP应用

文章目录

  • 1. 写在前面
  • 2. 接口分析
  • 3. 爬虫开发
  • 4. 下载链接获取

【🏠作者主页】:吴秋霖
【💼作者介绍】:擅长爬虫与JS加密逆向分析!Python领域优质创作者、CSDN博客专家、阿里云博客专家、华为云享专家。一路走来长期坚守并致力于Python与爬虫领域研究与开发工作!
【🌟作者推荐】:对爬虫领域以及JS逆向分析感兴趣的朋友可以关注《爬虫JS逆向实战》《深耕爬虫领域》
未来作者会持续更新所用到、学到、看到的技术知识!包括但不限于:各类验证码突防、爬虫APP与JS逆向分析、RPA自动化、分布式爬虫、Python领域等相关文章

作者声明:文章仅供学习交流与参考!严禁用于任何商业与非法用途!否则由此产生的一切后果均与作者无关!如有侵权,请联系作者本人进行删除!

1. 写在前面

  这个网站也是作者最近接触到的一个APP应用市场类网站。讲实话,还是蛮适合新手朋友去动手学习的。毕竟爬虫领域要想进步,还是需要多实战、多分析!该网站中的一些小细节也是能够锻炼分析能力的,也有反爬虫处理。甚至是下载APP的话在Web端是无法拿到APK下载的直链,需要去APP端接口数据获取

2. 接口分析

需要抓取的内容为整个游戏板块(当然可以是所有板块甚至是关键词去搜素命中)。游戏板块包含了所有分类与子分类下APP信息,如下所示:

在这里插入图片描述

首先我们打开控制台发个包先,监测一下请求内容,如下所示:

在这里插入图片描述

这里可以直接把请求CURL出来,转换成Python代码,如下所示:

import requestsheaders = {"sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"","Accept": "application/json, text/plain, */*","Referer": "https://appgallery.huawei.com/","Interface-Code": "6bc2bec970e616747dc0a99b57aa6730_1716973000358","sec-ch-ua-mobile": "?0","User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36","sec-ch-ua-platform": "\"macOS\""
}
url = "https://web-drcn.hispace.dbankcloud.com/edge/uowap/index"
params = {"method": "internal.getTabDetail","serviceType": "20","reqPageNum": "1","uri": "b2b4752f0a524fe5ad900870f88c11ed","maxResults": "25","zone": "","locale": "en"
}
response = requests.get(url, headers=headers, params=params)print(response.json())

这里直接请求,会失败!因为其中有一个细节的反爬虫检测,代码运行会提示接口验证失败,如下所示:

{'rtnCode': 1002, 'rtnDesc': 'InterfaceCode Verification failed.'}

这个是什么原因?根据返回提示中接口检验的问题,回看请求头内携带有Interface-Code参数,大概率是这个参数的问题,重放失败!这个参数是动态的,不过好在并非算法加密生成

Interface-Code参数其实相当于一个动态注册的令牌,在我们请求接口的时候。需要去固定的接口请求并拿到值,然后携带这个参数的值进行后续的任何请求

注意!这个动态的值也是有时效性的,大概在一分钟时间不等~

在这里插入图片描述

如上图,动态获取Interface-Code参数的接口,可每次在提交页面请求之前,从这个接口拿动态值携带!

3. 爬虫开发

这里我们经过分析可以找到属于游戏板块全量分类的一个唯一ID,拿它可以获取到所有的子分类ID,如下所示:

在这里插入图片描述

完整链接地址如下:


https://web-drcn.hispace.dbankcloud.com/edge/uowap/index?method=internal.getTabDetail&serviceType=20&reqPageNum=1&uri=33ef450cbac34770a477cfa78db4cf8c&maxResults=25&zone=&locale=en

这个接口请求之后,去解析拿到tabInfo下面的tabId字段即可!这是子分类板块的ID,用来后续请求细分栏目下的所有游戏类APP所需要的字段,同样放到上面链接替换uri,如下所示:

在这里插入图片描述

接下来就是详情页数据抓取,找到详情页面的接口,如下所示:

在这里插入图片描述

这里我们构造出对详情接口请求的具体实现功能,代码如下:

# appDetail是从子分类接口过来的结构化数据
appid = appDetail.get('appid', '')
if appid:detail_link = f"https://appgallery.huawei.com/#/app/{appid}"encoded_appid = quote(f"app|{appid}")encoded_detail_link = quote(quote(detail_link))detail_json_link = ('https://web-drcn.hispace.dbankcloud.cn/uowap/index?''method=internal.getTabDetail&serviceType=20&reqPageNum=1&maxResults=25&'f'uri={encoded_appid}&shareTo=&currentUrl={encoded_detail_link}&accessId=&'f'appid={appid}&zone=&locale=zh')response = requests.get(detail_json_link, headers=headers)

对详情页面接口构造完请求后同样将会得到JSON数据,需要对结构化数据进行解析,实现代码如下:

import json
import dateparseritem = {}
app_keys = ["icoUintrori", "intro", "name", "sizeDesc", "versionName", "downurl", "package", "commentCount", "appid", "md5", "kindName", "images", "releaseDate", "developer", "appIntro", "authority", "portalUrl"
]
appdata = {}
response_data = json.loads(response)
layoutDatas = response_data.get('layoutData', [])for layoutData in layoutDatas:# 详情里面数据分支结构不一样,在3、6中if layoutData.get('dataList-type') in (3, 6):appinfo = layoutData.get('dataList', [])[0]appdata.update({key: value for key, value in appinfo.items() if key in app_keys and key not in appdata})item['apk_name'] = appdata.get('name')
item['developer'] = appdata.get("developer")
item["apk_size"] = appdata.get("sizeDesc")
item['apk_version_name'] = appdata.get("versionName")
pub_time = appdata.get("releaseDate")
item['apk_update_time'] = dateparser.parse(pub_time).strftime('%Y-%m-%d %H:%M:%S') if pub_time else None
item['apk_screenshots'] = appdata.get("images")
item['apk_introduction'] = appdata.get("appIntro")
item['apk_download_times'] = process_downloads_times(appdata.get('intro', ''))

如上,拿到完整的APP信息数据(包括名称、版本、发布时间、大小、下载量、开发者…)其中process_downloads_times方法是一个自定义方法,对下载量数据进行清洗,页面下载量如下:

在这里插入图片描述

所以需要一个对该下载量数据清洗的方法,实现代码如下所示:

import redef process_downloads_times(data):if not data:return Nonetry:return int(data)except (ValueError, TypeError):passdata = re.sub(r"[\+\+,|次|,|<|>]", "", data).strip()pa = re.compile(r"[\d+\.]+")units = {"百亿": 10000000000,"千万": 10000000,"百万": 1000000,"十万": 100000,"亿": 100000000,"十": 10,"百": 100,"千": 1000,"万": 10000,}for unit, number in units.items():if unit in data:num = "".join(pa.findall(data))try:return str(int(float(num) * number))except ValueError:return datareturn data

4. 下载链接获取

接下来,最重要的就是对APK包的下载了,在Web端可以看到有一个getAppDownloadUrl的接口,但是它并不是APK包的下载链接,它是华为应用商店的APP下载链接,所以注意不要被误导了,如下所示:

在这里插入图片描述

要获取APK包的下载链接,需要从APP端入手,Web网站内不再提供APK的下载直链!以前是可以的,早期的下载链接直接拼appid即可,早期链接如下所示:

ttps://appgallery.cloud.huawei.com/appdl/C100210735

这里需要对手机端的华为应用商店APP进行抓包分析,拿到下载链接接口!最终APK在移动端的下载接口如下所示:

https://store-drcn.hispace.dbankcloud.com/hwmarket/api/clientApi

这里下载接口的请求参数巨长,好在没有加密的参数字段,如下所示:

data = 'apsid=1705374325915&brand=google&channelId=background&clientPackage=com.huawei.appmarket&cno=4010001&code=0200&contentPkg=&dataFilterSwitch=&deviceId=a7a6dcc273d12e3f63865060533dec04f24e5042e48a621c1b5e601953e1bb69&deviceIdRealType=4&deviceIdType=9&deviceSpecParams=%7B%22abis%22%3A%22arm64-v8a%2Carmeabi-v7a%2Carmeabi%22%2C%22deviceFeatures%22%3A%22U%2Ccom.verizon.hardware.telephony.lte%2Ccom.verizon.hardware.telephony.ehrpd%2CP%2CB%2C0c%2Ce%2C0J%2Cp%2Ca%2Cb%2C04%2Cm%2Candroid.hardware.wifi.rtt%2Ccom.google.android.feature.PIXEL_2017_EXPERIENCE%2C08%2C03%2CC%2CS%2C0G%2Cq%2Ccom.google.android.feature.PIXEL_2018_EXPERIENCE%2CL%2C2%2C6%2Ccom.google.android.feature.GOOGLE_BUILD%2CY%2C0M%2Candroid.hardware.vr.high_performance%2Cf%2C1%2C07%2C8%2C9%2Candroid.hardware.sensor.hifi_sensors%2Candroid.hardware.strongbox_keystore%2CO%2CH%2Ccom.google.android.feature.TURBO_PRELOAD%2Candroid.hardware.vr.headtracking%2CW%2Cx%2CG%2Co%2C06%2C0N%2Ccom.google.android.feature.PIXEL_EXPERIENCE%2C3%2CR%2Cd%2CQ%2Cn%2Candroid.hardware.telephony.carrierlock%2Cy%2CT%2Ci%2Cr%2Cu%2Ccom.google.android.feature.WELLBEING%2Cl%2C4%2C0Q%2CN%2Candroid.software.device_id_attestation%2CM%2C01%2C09%2CV%2C7%2C5%2C0H%2Cg%2Cs%2Cc%2CF%2Ct%2C0L%2C0W%2Ccom.google.hardware.camera.easel_2018%2Ccom.google.android.apps.dialer.SUPPORTED%2C0X%2Ck%2C00%2Ccom.google.android.feature.GOOGLE_EXPERIENCE%2Ccom.google.android.feature.EXCHANGE_6_2%2Ccom.google.android.apps.photos.PIXEL_2018_PRELOAD%2Candroid.hardware.sensor.assist%2Ccom.google.android.feature.DREAMLINER%2Candroid.hardware.audio.pro%2CK%2CE%2C02%2CI%2CJ%2Cj%2CD%2Ch%2Candroid.hardware.wifi.aware%2CX%2Cv%22%2C%22dpi%22%3A560%2C%22openglExts%22%3A%221%2C0y%2C1d%2C1e%2C1f%2C0c%2CM%2CO%2C0q%2C0r%2C0s%2CB%2CA%2C5%2C4%2C0p%2CD%2CE%2C0m%2C0n%2C7%2C0i%2C0j%2C0k%2C8%2C0t%2C0w%2C1g%2C1o%2C1m%2C1n%2CH%2C0u%2C1i%2C1l%22%2C%22preferLan%22%3A%22zh%22%2C%22usesLibrary%22%3A%225%2C6%2Ccom.vzw.apnlib%2C09%2C0D%2Ccom.google.android.camera.experimental2018%2CA%2C0E%2C0L%2Ccom.google.android.poweranomalydatamodeminterface%2C9%2C2%2Cb%2C0C%2Ccom.android.ims.rcsmanager%2CE%2C1%2Ccom.verizon.embms%2Ccom.qualcomm.qti.uim.uimservicelibrary%2Ccom.google.android.lowpowermonitordevicefactory%2Ccom.google.android.lowpowermonitordeviceinterface%2Ccom.google.android.poweranomalydatafactory%2C0H%2C08%2C7%2Ccom.google.android.dialer.support%2Ccom.google.vr.platform%2CF%2Ccom.verizon.provider%2Cd%2C0A%2Ccom.google.android.hardwareinfo%2CB%2C0K%2CC%2C07%22%7D&fid=0&globalTrace=null&gradeLevel=0&gradeType=&hardwareType=0&isSupportPage=1&manufacturer=Google&maxResults=25&method=client.getTabDetail&net=1&outside=0&recommendSwitch=1&reqPageNum=1&roamingTime=0&runMode=2&serviceType=0&shellApkVer=0&sid=1705386105840&sign=h9001090fp00010920000000000001000a0000000600100000011190000010000040240b0100001001000%40000F6C248B294FAD99F665A13B23795D&thirdPartyPkg=com.huawei.appmarket&translateFlag=1&ts={ts}&uri=app%7C{appid}__HiAd__fed345ef91e24079b46e93c92d7e7d4e__cds_111122701__21__null____null%3Faglocation%3D%257B%2522cres%2522%253A%257B%2522lPos%2522%253A0%252C%2522lid%2522%253A%2522905310%2522%252C%2522pos%2522%253A21%252C%2522relResId%2522%253A%2522{appid}%2522%252C%2522relType%2522%253A%2522app%2522%252C%2522resid%2522%253A%2522{appid}%2522%252C%2522rest%2522%253A%2522app%2522%252C%2522tid%2522%253A%2522dist_83ae03cda9994714a499347e25bbccd1%2522%257D%252C%2522ftid%2522%253A%2522dist_25f5040f852a48c1b3bb52c3a99f4cf3%2522%252C%2522pres%2522%253A%257B%2522lPos%2522%253A0%252C%2522pos%2522%253A1%252C%2522resid%2522%253A%252283ae03cda9994714a499347e25bbccd1%2522%252C%2522rest%2522%253A%2522tab%2522%252C%2522tid%2522%253A%2522dist_25f5040f852a48c1b3bb52c3a99f4cf3%2522%257D%257D%26templateId%3D1428744958c44b80aad5091114d7fce4%26maple%3D0%26trackId%3D0%26attribution%3D%257B%2522taskid%2522%253A%2522910297447%2522%252C%2522subTaskId%2522%253A%25229102974470000%2522%252C%2522RTAID%2522%253A%2522%2522%252C%2522callback%2522%253A%2522security%253A44D8410ED7A32A50E8AABFD6%253A03EC36808D57EC58A89C207497BB1CED2B6192ED09D44BAC9D825C50BF49E14A95A9E0E6281F83A1DA4C1F2E6D7A8A%2522%252C%2522channel%2522%253A%25220%2522%257D%26appoid%3Ddde9f2d763154f37acc76b8620d195a9%26listId%3Dcds_111122701%26listIdType%3Dcds%26adFlag%3DHiAd%26cdrInfo%3D20240116142810el21s692259%255E%257BopType%257D%255E910297447%255E{appid}%255Ecds_111122701%255E21%255E608743720197185117_p21%255E608743720197185117%255E14effd3e76d34d76d6aa4185d8d08e5bb340dfdcb8ce0227285d09b8c9001ce2f315fce6f842b5449b4eac5ebebbd78bf7f6f470baaadc081ae820e43b6653825f66a6e7b6f2892cd760d063df840dd9954254fb48fa6a9f19259d68c7170c54%255Ectr%255Efn5-Y29tLmh1YXdlaS5hcHBtYXJrZXR-LTF-MTcwNTMxMDUxODQ0OQ%255E2024-01-16%2B14%253A28%253A10%255E%255EU8800%255E0.000126%255E0%255E2.0%255E2.0%255E9102974470000%255E900011964%255E%255E%255E%255E%255E11.4.1.303%255E1705374325914%255E0%255E1%255E%255E%255E%255E9%255E00000000-0000-0000-0000-000000000000%255E1%255E%255E%255ECN%255ECNY%255E243728192482956951%255Eappgallery%255E%255E%255E%255E2ea67221765846d69b880f43a78a7ab3%255ECPD%255E69%255E%255E%255E243728192570833732%255EWIFI%255EMA%255E%255EZXhwU3RyYXRlZ3k9I2V4cFBhcmFtcz0xMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnkjYnVja2V0SWQ9I2V4cERlcz0xMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnkjZXhwSWQ9MTMxN19WMi5hZ18xMzE3X1YyX2FsbF90YXJnZXRfb3B0aW1pemF0aW9uX29uZWZvcmFsbHwxMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnl8MTMwN19WMi5hZ18xMzA3X1YyX2NvbnRyb2xsZXI%255E%255E%255E%255E1%255E1%255E4010001%255E%255E3G%255E%255E0%255E%255E%255E%255E%255E%255E0%255E1%255E1027162%255E%255E1%255EP8%255E%255E%255Ecom.huawei.appmarket%255E%255E%255E%255E%255E%255E%255E%255E%255Ecom.huawei.appmarket%255E0.000252%255E%255E%255E20240116142810el21s692259%255E20240116142810el21s692182%255E1%255E0%255E%255E%255Esign%253A987064c954dab704e2b30dfe17f5ca63dae4245392d4fff32e75670e44da8998%26version%3D2%26phase%3D0%26requestId%3D2ea67221765846d69b880f43a78a7ab3%26algExp%3D%257B%2522expStrategy%2522%253A%2522246353%2522%252C%2522expParams%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%252C%2522bucketId%2522%253A%252236%2522%252C%2522expDes%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%252C%2522expId%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%257D&ver=1.1'

这里我们拿一个APP的appid对移动端的详情进行请求,可以看到接口数据内包含APK的下载链接,如下所示:

在这里插入图片描述

最终,完整的APP类爬虫运行及数据抓取效果如下所示:

在这里插入图片描述

相关文章:

  • Redis用GEO实现附近的人功能
  • 网络流量处理及分析工具
  • Redis 中的 Zset 数据结构详解
  • C++系列——————类和对象(上)
  • 固定翼飞机(固定翼飞行器)种类丰富 国家政策推动行业发展速度加快
  • 基于FreeRTOS+STM32CubeMX+LCD1602+MCP6S28的8通道模拟可编程增益放大器Proteus仿真
  • 什么是AVIEXP提前发货通知?
  • 供应黑烟识别器公司哪家强?
  • 【mysql】ubuntu下安装数据库
  • 【React篇】组件错误边界处理(组件错误引起的页面白屏)
  • 智狐联创-AI知识库:AI数字化转型的领航者
  • 【linux深入剖析】进程间通信
  • Spark 核心编程之 RDD 介绍
  • 独家首发 | 基于 KAN、KAN卷积的轴承故障诊断模型
  • 【SpringMVC】_简单示例计算器
  • const let
  • Cumulo 的 ClojureScript 模块已经成型
  • ES学习笔记(10)--ES6中的函数和数组补漏
  • IP路由与转发
  • JavaScript标准库系列——Math对象和Date对象(二)
  • JS字符串转数字方法总结
  • PHP CLI应用的调试原理
  • PHP变量
  • python 学习笔记 - Queue Pipes,进程间通讯
  • uni-app项目数字滚动
  • 从@property说起(二)当我们写下@property (nonatomic, weak) id obj时,我们究竟写了什么...
  • 互联网大裁员:Java程序员失工作,焉知不能进ali?
  • 极限编程 (Extreme Programming) - 发布计划 (Release Planning)
  • 如何选择开源的机器学习框架?
  • 微信小程序填坑清单
  • 一起来学SpringBoot | 第十篇:使用Spring Cache集成Redis
  • ​LeetCode解法汇总2670. 找出不同元素数目差数组
  • ​MySQL主从复制一致性检测
  • #define、const、typedef的差别
  • #在 README.md 中生成项目目录结构
  • (2)空速传感器
  • (Demo分享)利用原生JavaScript-随机数-实现做一个烟花案例
  • (翻译)Quartz官方教程——第一课:Quartz入门
  • (附源码)ssm本科教学合格评估管理系统 毕业设计 180916
  • (附源码)ssm户外用品商城 毕业设计 112346
  • (附源码)ssm旅游企业财务管理系统 毕业设计 102100
  • (三)c52学习之旅-点亮LED灯
  • (深度全面解析)ChatGPT的重大更新给创业者带来了哪些红利机会
  • (循环依赖问题)学习spring的第九天
  • (最完美)小米手机6X的Usb调试模式在哪里打开的流程
  • .gitignore
  • .NET 8.0 发布到 IIS
  • .net core 6 redis操作类
  • .NET/C# 反射的的性能数据,以及高性能开发建议(反射获取 Attribute 和反射调用方法)
  • .net专家(张羿专栏)
  • .Net转前端开发-启航篇,如何定制博客园主题
  • :O)修改linux硬件时间
  • [ 网络基础篇 ] MAP 迈普交换机常用命令详解
  • [<MySQL优化总结>]
  • [ASP.NET 控件实作 Day7] 设定工具箱的控件图标