当前位置: 首页 > news >正文

快速方便地下载huggingface的模型库和数据集

快速方便地下载huggingface的模型库和数据集

  • 方法一:用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具
    • 特点
    • Usage
  • 方法二:模型下载【个人使用记录】
    • 保持目录结构
    • 数据集下载
    • 不足之处

方法一:用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具

来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。

使用方法:将hfd.sh拷贝过去,然后参考下面的参考命令,下载数据集或者模型

🤗Huggingface 模型下载器

考虑到官方 huggingface-cli 缺乏多线程下载支持,以及错误处理不足在 hf_transfer 中,这个命令行工具巧妙地利用 wgetaria2 来处理 LFS 文件,并使用 git clone 来处理其余文件。

特点

  • ⏯️ 从断点恢复:您可以随时重新运行它或按 Ctrl+C。
  • 🚀 多线程下载:利用多线程加速下载过程。
  • 🚫 文件排除:使用--exclude--include跳过或指定文件,为具有重复格式的模型(例如,*.bin*.safetensors)节省时间)。
  • 🔐 身份验证支持:对于需要 Huggingface 登录的门控模型,请使用 --hf_username--hf_token 进行身份验证。
  • 🪞 镜像站点支持:使用“HF_ENDPOINT”环境变量进行设置。
  • 🌍代理支持:使用“HTTPS_PROXY”环境变量进行设置。
  • 📦 简单:仅依赖gitaria2c/wget

Usage

首先,下载 hfd.sh 或克隆此存储库,然后授予脚本执行权限。

chmod a+x hfd.sh

为了方便起见,您可以创建一个别名

alias hfd="$PWD/hfd.sh"

使用说明:

$ ./hfd.sh -h
Usage:hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]Description:Downloads a model or dataset from Hugging Face using the provided repo ID.Parameters:repo_id        The Hugging Face repo ID in the format 'org/repo_name'.--include       (Optional) Flag to specify a string pattern to include files for downloading.--exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.--hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.--hf_token      (Optional) Hugging Face token for authentication.--tool          (Optional) Download tool to use. Can be aria2c (default) or wget.-x              (Optional) Number of download threads for aria2c. Defaults to 4.--dataset       (Optional) Flag to indicate downloading a dataset.--local-dir     (Optional) Local directory path where the model or dataset will be stored.Example:hfd bigscience/bloom-560m --exclude *.safetensorshfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4hfd lavita/medical-qa-shared-task-v1-toy --dataset

下载模型:

hfd bigscience/bloom-560m

下载模型需要登录

从https://huggingface.co/settings/tokens获取huggingface令牌,然后

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

下载模型并排除某些文件(例如.safetensors):

hfd bigscience/bloom-560m --exclude *.safetensors

使用 aria2c 和多线程下载:

hfd bigscience/bloom-560m

输出
下载过程中,将显示文件 URL:

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
# 安装包
apt update
apt-get install aria2
apt-get install iftop
apt-get install git-lfs 
#参考命令
bash /xxx/xxx/hfd.sh mmaaz60/ActivityNet-QA-Test-Videos --tool aria2c -x 16 --dataset --local-dir /xxx/xxx/ActivityNet

hfd.sh

#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Colortrap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INTdisplay_help() {cat << EOF
Usage:hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]    Description:Downloads a model or dataset from Hugging Face using the provided repo ID.Parameters:repo_id        The Hugging Face repo ID in the format 'org/repo_name'.--include       (Optional) Flag to specify a string pattern to include files for downloading.--exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.--hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.--hf_token      (Optional) Hugging Face token for authentication.--tool          (Optional) Download tool to use. Can be aria2c (default) or wget.-x              (Optional) Number of download threads for aria2c. Defaults to 4.--dataset       (Optional) Flag to indicate downloading a dataset.--local-dir     (Optional) Local directory path where the model or dataset will be stored.Example:hfd bigscience/bloom-560m --exclude *.safetensorshfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOFexit 1
}MODEL_ID=$1
shift# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}while [[ $# -gt 0 ]]; docase $1 in--include) INCLUDE_PATTERN="$2"; shift 2 ;;--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;--hf_username) HF_USERNAME="$2"; shift 2 ;;--hf_token) HF_TOKEN="$2"; shift 2 ;;--tool) TOOL="$2"; shift 2 ;;-x) THREADS="$2"; shift 2 ;;--dataset) DATASET=1; shift ;;--local-dir) LOCAL_DIR="$2"; shift 2 ;;*) shift ;;esac
done# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {if ! command -v $1 &>/dev/null; thenecho -e "${RED}$1 is not installed. Please install it first.${NC}"exit 1fi
}# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; thengit config --global --add safe.directory "${PWD}"printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" fi
}[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_helpif [[ -z "$LOCAL_DIR" ]]; thenLOCAL_DIR="${MODEL_ID#*/}"
fiif [[ "$DATASET" == 1 ]]; thenMODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"if [ -d "$LOCAL_DIR/.git" ]; thenprintf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
elseREPO_URL="$HF_ENDPOINT/$MODEL_ID"GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"echo "Testing GIT_REFS_URL: $GIT_REFS_URL"response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")if [ "$response" == "401" ] || [ "$response" == "403" ]; thenif [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; thenprintf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"exit 1fiREPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"elif [ "$response" != "200" ]; thenprintf "${RED}Unexpected HTTP Status Code: $response\n${NC}"printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1fiecho "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }ensure_ownershipwhile IFS= read -r file; dotruncate -s 0 "$file"done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fiprintf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urlswhile IFS= read -r file; dourl="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"file_dir=$(dirname "$file")mkdir -p "$file_dir"if [[ "$TOOL" == "wget" ]]; thendownload_cmd="wget -c \"$url\" -O \"$file\""[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""elsedownload_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""fi[[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue[[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continueprintf "%s\n" "$download_cmd"urls+=("$url|$file")
done <<< "$files"for url_file in "${urls[@]}"; doIFS='|' read -r url file <<< "$url_file"printf "${YELLOW}Start downloading ${file}.\n${NC}" file_dir=$(dirname "$file")if [[ "$TOOL" == "wget" ]]; then[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"else[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"fi[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
doneprintf "${GREEN}Download completed successfully.\n${NC}"

方法二:模型下载【个人使用记录】

这个代码不能保持目录结构,见下面的改进版

import datetime
import os
import threadingfrom huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects# 执行命令
def execCmd(cmd):print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))os.system(cmd)print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))if __name__ == '__main__':# 需下载的hf库名称repo_id = "Salesforce/blip2-opt-2.7b"# 本地存储路径save_path = './blip2-opt-2.7b'# 获取项目信息_api = HfApi()repo_info = _api.repo_info(repo_id=repo_id,repo_type="model",revision='main',token=None,)# 获取文件信息filtered_repo_files = list(filter_repo_objects(items=[f.rfilename for f in repo_info.siblings],allow_patterns=None,ignore_patterns=None,))cmds = []threads = []# 需要执行的命令列表for file in filtered_repo_files:# 获取路径url = hf_hub_url(repo_id=repo_id, filename=file)# 断点下载指令cmds.append(f'wget -c {url} -P {save_path}')print(cmds)print("程序开始%s" % datetime.datetime.now())for cmd in cmds:th = threading.Thread(target=execCmd, args=(cmd,))th.start()threads.append(th)for th in threads:th.join()print("程序结束%s" % datetime.datetime.now())

保持目录结构

import datetime
import os
import threading
from pathlib import Pathfrom huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects# 执行命令
def execCmd(cmd):print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))os.system(cmd)print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))if __name__ == '__main__':# 需下载的hf库名称repo_id = "Salesforce/blip2-opt-2.7b"# 本地存储路径save_path = './blip2-opt-2.7b'# 创建本地保存目录Path(save_path).mkdir(parents=True, exist_ok=True)# 获取项目信息_api = HfApi()repo_info = _api.repo_info(repo_id=repo_id,repo_type="model",revision='main',token=None,)# 获取文件信息filtered_repo_files = list(filter_repo_objects(items=[f.rfilename for f in repo_info.siblings],allow_patterns=None,ignore_patterns=None,))cmds = []threads = []# 需要执行的命令列表for file in filtered_repo_files:# 获取路径url = hf_hub_url(repo_id=repo_id, filename=file)# 在本地创建子目录local_file = os.path.join(save_path, file)local_dir = os.path.dirname(local_file)Path(local_dir).mkdir(parents=True, exist_ok=True)# 断点下载指令cmds.append(f'wget -c {url} -P {local_dir}')print(cmds)print("程序开始%s" % datetime.datetime.now())for cmd in cmds:th = threading.Thread(target=execCmd, args=(cmd,))th.start()threads.append(th)for th in threads:th.join()print("程序结束%s" % datetime.datetime.now())

数据集下载

import datetime
import os
import threading
from pathlib import Pathfrom huggingface_hub import HfApi
from huggingface_hub.utils import filter_repo_objects# 执行命令
def execCmd(cmd):print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))os.system(cmd)print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))if __name__ == '__main__':# 需下载的数据集IDdataset_id = "openai/webtext"# 本地存储路径save_path = './webtext'# 创建本地保存目录Path(save_path).mkdir(parents=True, exist_ok=True)# 获取数据集信息_api = HfApi()dataset_info = _api.dataset_info(dataset_id=dataset_id,revision='main',token=None,)# 获取文件信息filtered_dataset_files = list(filter_repo_objects(items=[f.rfilename for f in dataset_info.siblings],allow_patterns=None,ignore_patterns=None,))cmds = []threads = []# 需要执行的命令列表for file in filtered_dataset_files:# 获取路径url = dataset_info.get_file_url(file)# 在本地创建子目录local_file = os.path.join(save_path, file)local_dir = os.path.dirname(local_file)Path(local_dir).mkdir(parents=True, exist_ok=True)# 断点下载指令cmds.append(f'wget -c {url} -P {local_dir}')print(cmds)print("程序开始%s" % datetime.datetime.now())for cmd in cmds:th = threading.Thread(target=execCmd, args=(cmd,))th.start()threads.append(th)for th in threads:th.join()print("程序结束%s" % datetime.datetime.now())

不足之处

不支持需要授权的库。

文件太多可能会开很多线程。


创作不易,观众老爷们请留步… 动起可爱的小手,点个赞再走呗 (๑◕ܫ←๑)
欢迎大家关注笔者,你的关注是我持续更博的最大动力


原创文章,转载告知,盗版必究



在这里插入图片描述


在这里插入图片描述
♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠

相关文章:

  • 北京网站建设多少钱?
  • 辽宁网页制作哪家好_网站建设
  • 高端品牌网站建设_汉中网站制作
  • MQTT(速记版)
  • Arduino PID库 (2) –微分导致的过冲
  • 基于ThinkPHP开发的校园跑腿社区小程序系统源码,包含前后端代码
  • css3的继承性
  • 十五 open CV 教程 形态学二值化和腐蚀操作
  • 结构型设计模式:桥接/组合/装饰/外观/享元
  • 【Nuxt】配置
  • 【Python 逆向滑块】(实战六)逆向滑块,并实现用Python+Node.js 生成滑块、识别滑块、验证滑块、发送短信
  • CTF web bibibi题型
  • Unity计算位置平移矩阵
  • 《Milvus Cloud向量数据库指南》——什么是高可用:深入理解数据库系统中的高可用性架构
  • 【Redis 进阶】哨兵 Sentinel(重点理解流程和原理)
  • XML 学习笔记
  • ZeroMQ(二):请求-响应模式,C和C++。
  • 中国AI大模型场景探索及产业应用调研报告
  • Invalidate和postInvalidate的区别
  • Java多态
  • Just for fun——迅速写完快速排序
  • leetcode46 Permutation 排列组合
  • opencv python Meanshift 和 Camshift
  • VUE es6技巧写法(持续更新中~~~)
  • 观察者模式实现非直接耦合
  • 今年的LC3大会没了?
  • 力扣(LeetCode)21
  • 每天一个设计模式之命令模式
  • 前端技术周刊 2019-02-11 Serverless
  • 浅谈web中前端模板引擎的使用
  • 算法-图和图算法
  • 微信小程序填坑清单
  • 微信支付JSAPI,实测!终极方案
  • 消息队列系列二(IOT中消息队列的应用)
  • AI算硅基生命吗,为什么?
  • Unity3D - 异步加载游戏场景与异步加载游戏资源进度条 ...
  • ​Benvista PhotoZoom Pro 9.0.4新功能介绍
  • ​LeetCode解法汇总1410. HTML 实体解析器
  • # 达梦数据库知识点
  • # 睡眠3秒_床上这样睡觉的人,睡眠质量多半不好
  • #Datawhale AI夏令营第4期#AIGC文生图方向复盘
  • #QT(串口助手-界面)
  • (AngularJS)Angular 控制器之间通信初探
  • (C语言)strcpy与strcpy详解,与模拟实现
  • (C语言)编写程序将一个4×4的数组进行顺时针旋转90度后输出。
  • (附源码)计算机毕业设计ssm基于Internet快递柜管理系统
  • (机器学习的矩阵)(向量、矩阵与多元线性回归)
  • (论文阅读30/100)Convolutional Pose Machines
  • (亲测)设​置​m​y​e​c​l​i​p​s​e​打​开​默​认​工​作​空​间...
  • (一)VirtualBox安装增强功能
  • (转)Sublime Text3配置Lua运行环境
  • (转载)CentOS查看系统信息|CentOS查看命令
  • (状压dp)uva 10817 Headmaster's Headache
  • . Flume面试题
  • .net core docker部署教程和细节问题
  • .NET Core WebAPI中使用Log4net 日志级别分类并记录到数据库
  • @Data注解的作用
  • [ 网络基础篇 ] MAP 迈普交换机常用命令详解