当前位置: 首页 > news >正文

TensorRT-LLM高级用法

--multi_block_mode

decoding phase, 推理1个新token,

平时:按照batch样本,按照head,将计算平均分给所有SM;

batch_size*num_heads和SM数目相比较小时:有些SM会空闲;加了--multi_block_mode,似乎是将input context再进行划分,原来1个SM干的活儿,分给多个SM来干,让所有SM都并行忙碌起来;

其他证据:

"we only use multi-block in generation phase (generating new token). In context phase, we have enough blocks to run in parallel and we don't need to use multi-block."
"take H100-SXM as an example, you have 132 SMs, and let us say the batch size is 1, num heads is 16, then normally we can split the sequence into (132/16 = 8) blocks to fully utilize all SMs, but if the sequence length is quite small like 1K, it might not worth 8 blocks per sequence (maybe fewer)."

支持llama格式和hf格式

llama格式,要使用--meta_ckpt_dir:

# Build LLaMA v3 70B TP=8 using Meta checkpoints directly.
python convert_checkpoint.py --meta_ckpt_dir ./tmp/llama/70B/ \--output_dir ./tllm_checkpoint_8gpu_tp8 \--dtype float16 \--tp_size 8

hf格式,使用--model_dir:

# Build LLaMA v3 70B using 4-way tensor parallelism and 2-way pipeline parallelism.
python convert_checkpoint.py --model_dir ./tmp/llama/70B/hf/ \--output_dir ./tllm_checkpoint_8gpu_tp4_pp2 \--dtype float16 \--tp_size 4 \--pp_size 2

推理显存占用分析

Total memory = (Model size + KV cache size + Activation memory) / Parallelism

where

  • The model size is the number of parameters * the size of data type.
  • The KV cache size is the total number of tokens * the size of KV cache data type * the number of layers * the KV hidden dimension
  • The activation memory is determined by TRT engine, which can be a few GBs regardless of the degree of parallelism used

For LLaMA v2 70B FP16 weights + FP8 KV cache, the model size is 70B parameters * 2 bytes = 140GB. The KV cache size is 32K tokens * 1 bytes * 80 layers * 2048 KV hidden dimension = 5GB per 32K tokens. We have 145GB spread across 8 GPUs. The end result is ~18GB per GPU plus some GBs of flat scratch/activation memory allocated by TRT engine and the TRT-LLM runtime.

Note that the KV hidden dimension is derived by the number of KV heads times hidden dimension of each head. LLaMA v2 70B has hidden dimension of 8192, and uses grouped-query attention where 8 key heads and 8 value heads are associated with 64 query heads. Each head has hidden dimension of 8192/64 = 128. So the hidden dimension for KV in total is 128 * 8 * 2 = 2048. (2是K和V)

The total number of tokens is determined by beam width, batch size, and maximum sequence length.

--use_paged_context_fmha: 似乎是KV cache分页

--enable_kv_cache_reuse:有些推理样本开头的prompt很长一段是相同的,这个样本的KV-cache可以给其他样本复用;

LLama70B, 1张卡放不下,8张卡Tensor并行:

git-lfs clone https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k/python examples/llama/convert_checkpoint.py --model_dir ./Llama-3-70B-Instruct-Gradient-1048k/ \--output_dir /tmp/llama-3-70B-1048k/trt_ckpts \--dtype float16 \--tp_size 8python -m tensorrt_llm.commands.build --checkpoint_dir /tmp/llama-3-70B-1048k/trt_ckpts \--output_dir /tmp/llama-3-70B-1048k/trt_engines \--gemm_plugin float16 \--max_num_tokens 4096 \--max_batch_size 1 \--max_seq_len 1048576 \--use_paged_context_fmha enable \--workers 8mpirun -n 8 --allow-run-as-root python examples/eval_long_context.py  --task passkey \--engine_dir /tmp/llama-3-70B-1048k/trt_engines \--tokenizer_dir ./Llama-3-70B-Instruct-Gradient-1048k/ \--stop_idx 1 \--max_input_length 1048566 \--enable_chunked_context \--max_tokens_in_paged_kv_cache 1100000

convert那里指定tp_size为8;

build那里指定workers为8,8张GPU卡每个负责一个model partition,同时build,加快build速度;

执行run,用的mpirun -n 8,每个进程跑一个model partition;

int8 kv-cache和weight的int8量化, 可一起使用:

# Build model with both INT8 weight-only and INT8 KV cache enabled
python convert_checkpoint.py --model_dir ./llama-models/llama-7b-hf   \--output_dir ./tllm_checkpoint_1gpu_int8_kv_wq \--dtype float16  \--int8_kv_cache \--use_weight_only \--weight_only_precision int8trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int8_kv_wq \--output_dir ./tmp/llama/7B/trt_engines/int8_kv_cache_weight_only/1-gpu \--gemm_plugin auto

(int8 kv-cache的calibration在哪步做的?)

相关文章:

  • 2024 年高教社杯全国大学生数学建模竞赛 C 题 农作物的种植策略(详细思路+matlab代码+python代码+论文范例)
  • android系统源码12 修改默认桌面壁纸--SRO方式
  • ELK学习笔记(二)——使用K8S部署Kibana8.15.0
  • uniapp小程序下载缓存服务器上的图片
  • iPhone手机清理软件:照片清理功能全解析
  • Mysql数据库表结构迁移PostgreSQL
  • [论文笔记]Making Large Language Models A Better Foundation For Dense Retrieval
  • Nginx跨域运行案例:云台控制http请求,通过 http server 代理转发功能,实现跨域运行。(基于大华摄像头WEB无插件开发包)
  • Mac+Pycharm配置PyQt6教程
  • 调研-libevent
  • github 工作流自动编译 ffmpeg for windows on arm
  • Python中的属性装饰器:解锁更优雅的编程之道
  • 数据结构--经典排序之选择排序(超详细!!)
  • 八、Maven总结
  • 从零开始,认识游戏设计师(4)体验源于设计师②
  • 【399天】跃迁之路——程序员高效学习方法论探索系列(实验阶段156-2018.03.11)...
  • 【跃迁之路】【463天】刻意练习系列222(2018.05.14)
  • JavaScript DOM 10 - 滚动
  • JavaScript 基本功--面试宝典
  • markdown编辑器简评
  • php中curl和soap方式请求服务超时问题
  • React16时代,该用什么姿势写 React ?
  • 对象引论
  • 分布式任务队列Celery
  • 给新手的新浪微博 SDK 集成教程【一】
  • 机器学习中为什么要做归一化normalization
  • 基于游标的分页接口实现
  • $emit传递多个参数_PPC和MIPS指令集下二进制代码中函数参数个数的识别方法
  • $forceUpdate()函数
  • (2024最新)CentOS 7上在线安装MySQL 5.7|喂饭级教程
  • (C语言)求出1,2,5三个数不同个数组合为100的组合个数
  • (function(){})()的分步解析
  • (vue)el-cascader级联选择器按勾选的顺序传值,摆脱层级约束
  • (附源码)ssm高校志愿者服务系统 毕业设计 011648
  • (更新)A股上市公司华证ESG评级得分稳健性校验ESG得分年均值中位数(2009-2023年.12)
  • (贪心) LeetCode 45. 跳跃游戏 II
  • (一一四)第九章编程练习
  • (转)JVM内存分配 -Xms128m -Xmx512m -XX:PermSize=128m -XX:MaxPermSize=512m
  • (转)ORM
  • .NET Core WebAPI中封装Swagger配置
  • .net 连接达梦数据库开发环境部署
  • .net 设置默认首页
  • .net 使用$.ajax实现从前台调用后台方法(包含静态方法和非静态方法调用)
  • .NET委托:一个关于C#的睡前故事
  • .Net下的签名与混淆
  • @Async注解的坑,小心
  • [ IO.File ] FileSystemWatcher
  • [ JavaScript ] JSON方法
  • [].shift.call( arguments ) 和 [].slice.call( arguments )
  • [AMQP Connection 127.0.0.1:5672] An unexpected connection driver error occured
  • [android] 天气app布局练习
  • [BT]小迪安全2023学习笔记(第29天:Web攻防-SQL注入)
  • [BUG] Authentication Error
  • [ccc3.0][数字钥匙] UWB配置和使用(二)
  • [CSS]浮动