当前位置：首页 > news >正文

部署百川大语言模型Baichuan2

news 来源：原创 2024/5/20 23:50:10

Baichuan2是百川智能推出的新一代开源大语言模型，采用 2.6 万亿 Tokens 的高质量语料训练。在多个权威的中文、英文和多语言的通用、领域 benchmark 上取得同尺寸最佳的效果。包含有 7B、13B 的 Base 和 Chat 版本，并提供了 Chat 版本的 4bits 量化。

模型下载

基座模型

Baichuan2-7B-Base

https://huggingface.co/baichuan-inc/Baichuan2-7B-Basehttps://huggingface.co/baichuan-inc/Baichuan2-7B-BaseBaichuan2-13B-Base

https://huggingface.co/baichuan-inc/Baichuan2-13B-Basehttps://huggingface.co/baichuan-inc/Baichuan2-13B-Base

对齐模型

Baichuan2-7B-Chat

https://huggingface.co/baichuan-inc/Baichuan2-7B-Chathttps://huggingface.co/baichuan-inc/Baichuan2-7B-ChatBaichuan2-13B-Chat

https://huggingface.co/baichuan-inc/Baichuan2-13B-Chathttps://huggingface.co/baichuan-inc/Baichuan2-13B-Chat

对齐模型 4bits 量化

Baichuan2-7B-Chat-4bits

https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bitshttps://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bitsBaichuan2-13B-Chat-4bits

https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bitshttps://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits

拉取代码

git clone https://github.com/baichuan-inc/Baichuan2

安装依赖

pip install -r requirements.txt

调用方式

Python代码调用

Chat 模型推理方法示例：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Chat", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Chat", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("baichuan-inc/Baichuan2-13B-Chat")
messages = []
messages.append({"role": "user", "content": "解释一下“温故而知新”"})
response = model.chat(tokenizer, messages)
print(response)

Base 模型推理方法示范

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Base", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

模型加载指定 device_map='auto'，会使用所有可用显卡。

如需指定使用的设备，可以使用类似 export CUDA_VISIBLE_DEVICES=0,1（使用了0、1号显卡）的方式控制。

命令行方式

python cli_demo.py

本命令行工具是为 Chat 场景设计，不支持使用该工具调用 Base 模型。

网页 demo 方式

依靠 streamlit 运行以下命令，会在本地启动一个 web 服务，把控制台给出的地址放入浏览器即可访问。

streamlit run web_demo.py

本网页demo工具是为 Chat 场景设计，不支持使用该工具调用 Base 模型。

量化方法

Baichuan2支持在线量化和离线量化两种模式。

在线量化

对于在线量化，baichuan2支持 8bits 和 4bits 量化，使用方式和 Baichuan-13B 项目中的方式类似，只需要先加载模型到 CPU 的内存里，再调用quantize()接口量化，最后调用 cuda()函数，将量化后的权重拷贝到 GPU 显存中。实现整个模型加载的代码非常简单，以 Baichuan2-7B-Chat 为例：

8bits 在线量化:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(8).cuda()

4bits 在线量化:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(4).cuda()

需要注意的是，在用 from_pretrained 接口的时候，用户一般会加上 device_map="auto"，在使用在线量化时，需要去掉这个参数，否则会报错。

离线量化

为了方便用户的使用，baichuan2提供了离线量化好的 4bits 的版本 Baichuan2-7B-Chat-4bits，供用户下载。用户加载 Baichuan2-7B-Chat-4bits 模型很简单，只需要执行:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat-4bits", device_map="auto", trust_remote_code=True)

对于 8bits 离线量化，baichuan2没有提供相应的版本，因为 Hugging Face transformers 库提供了相应的 API 接口，可以很方便的实现 8bits 量化模型的保存和加载。用户可以自行按照如下方式实现 8bits 的模型保存和加载：

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained(quant8_saved_dir)
model = AutoModelForCausalLM.from_pretrained(quant8_saved_dir, device_map="auto", trust_remote_code=True)

CPU 部署

Baichuan2 模型支持 CPU 推理，但需要强调的是，CPU 的推理速度相对较慢。需按如下方式修改模型加载的方式：

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float32, trust_remote_code=True)

模型微调

依赖安装

git clone https://github.com/baichuan-inc/Baichuan2.git
cd Baichuan2/fine-tune
pip install -r requirements.txt

如需使用 LoRA 等轻量级微调方法需额外安装 peft

如需使用 xFormers 进行训练加速需额外安装 xFormers

单机训练

hostfile=""
deepspeed --hostfile=$hostfile fine-tune.py  \--report_to "none" \--data_path "data/belle_chat_ramdon_10k.json" \--model_name_or_path "baichuan-inc/Baichuan2-7B-Base" \--output_dir "output" \--model_max_length 512 \--num_train_epochs 4 \--per_device_train_batch_size 16 \--gradient_accumulation_steps 1 \--save_strategy epoch \--learning_rate 2e-5 \--lr_scheduler_type constant \--adam_beta1 0.9 \--adam_beta2 0.98 \--adam_epsilon 1e-8 \--max_grad_norm 1.0 \--weight_decay 1e-4 \--warmup_ratio 0.0 \--logging_steps 1 \--gradient_checkpointing True \--deepspeed ds_config.json \--bf16 True \--tf32 True

轻量化微调

代码已经支持轻量化微调如 LoRA，如需使用仅需在上面的脚本中加入以下参数：

--use_lora True

LoRA 具体的配置可见 fine-tune.py 脚本。

使用 LoRA 微调后可以使用下面的命令加载模型：

from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("output", trust_remote_code=True)

经验篇：大数据常用工具集合

k8s之HPA

解锁内存之谜：从C到Python、Java和Go的内存管理对比

基于安卓android微信小程序的装修家装小程序

pycharm使用

requests 在 Python 3.2 中使用 OAuth 导入失败的问题与解决方案

Axure9 基本操作（二）

centos 6.10 安装 tcmalloc

ASP.NET限流器的简单实现

TCP连接保活机制

串口通信(11)-CRC校验介绍算法

第 117 场 LeetCode 双周赛题解

webpack打包时使用import引入element，element地址信息不会被打包到budle中而axios就会呢?

Python爬取股票交易数据代码示例及可视化展示。

CSS 属性学习笔记（入门）

python3.6+scrapy+mysql 爬虫实战

《Java8实战》-第四章读书笔记（引入流Stream）

Hibernate【inverse和cascade属性】知识要点

Java 网络编程（2）：UDP 的使用

javascript数组去重/查找/插入/删除

miniui datagrid 的客户端分页解决方案 - CS结合

react 代码优化(一) ——事件处理

Solarized Scheme

Spring Security中异常上抛机制及对于转型处理的一些感悟

SQL 难点解决：记录的引用

第三十一到第三十三天：我是精明的小卖家（一）

翻译：Hystrix - How To Use

如何优雅地使用 Sublime Text

用Visual Studio开发以太坊智能合约

哈罗单车融资几十亿元，蚂蚁金服与春华资本加持 ...

LeetCode解法汇总1276. 不浪费原料的汉堡制作方案

草莓熊python turtle绘图代码（玫瑰花版）附源代码

虚拟化系列介绍（十）

#鸿蒙生态创新中心#揭幕仪式在深圳湾科技生态园举行

#我与Java虚拟机的故事#连载14：挑战高薪面试必看

$jQuery 重写Alert样式方法

$分析了六十多年间100万字的政府工作报告，我看到了这样的变迁

（0）Nginx 功能特性

（顶刊）一个基于分类代理模型的超多目标优化算法

（幽默漫画）有个程序员老公，是怎样的体验？

（转）大型网站的系统架构

(转)视频码率,帧率和分辨率的联系与区别

*_zh_CN.properties 国际化资源文件 struts 防乱码等

.Net Core webapi RestFul 统一接口数据返回格式

.NET NPOI导出Excel详解

.net 怎么循环得到数组里的值_关于js数组

/run/containerd/containerd.sock connect: connection refused

@ModelAttribute注解使用

[.NET 即时通信SignalR] 认识SignalR (一)

[Android开源]EasySharedPreferences：优雅的进行SharedPreferences数据存储操作

[AUTOSAR][诊断管理][ECU][$37] 请求退出传输。终止数据传输的（上传/下载）

[AX]AX2012 SSRS报表Drill through action

[BPU部署教程] 教你搞定YOLOV5部署 (版本: 6.2)

[Editor]Unity Editor类常用方法

[Electron] 将应用打包成供Ubuntu、Debian平台下安装的deb包