当前位置: 首页 > news >正文

PPQ库中KLD算法实现代码解析

PPQ量化工具库KLD算法解析

  • 前言
  • PPQ算法实现
    • NVIDIA的PPT中KLD算法流程
    • KLD算法PPQ实现版本
    • PPQ与NVIDIA的区别:

前言

这是对PPQ库中KLD算法实现代码解析,关于PPQ库安装与使用详情见专栏上一篇博客。

PPQ算法实现

nvidia发布的PPT:8-bit Inference with TensorRT,百度可下载。下两图是KLD算法的实现伪代码:
在这里插入图片描述
在这里插入图片描述
下图是PPQ算法的实现过程:见https://github.com/openppl-public/ppq/blob/master/ppq/quantization/observer/range.py

def hist_to_scale_offset(
        self, histogram: torch.Tensor, hist_bins: int, hist_scale: float,
        config: TensorQuantizationConfig, computing_device: str = OBSERVER_KL_COMPUTING_DEVICE,
        scale_threshold: float=OBSERVER_MIN_SCALE
    ) -> Tuple[float, int]:
        """
        PPQ core quant parameter computing method - Histogram to scale & offset
        With a pre-defined histogram,
        this function will automatically search best clip value
        to minimize KL divergence between quantized result and fp32 input.
        only work for per-tensor symmetrical quantization policy for now.
        see also https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
        Args:
            histogram (torch.Tensor): histogram records activation's statistics.
            hist_bins (int): how many bins are included in histogram(also known as histogram length)
            hist_scale (float): histogram step size. it can be solved by histogram.max_val / histogram.bins
            config (TensorQuantizationConfig): quantization config.
            computing_device (str, optional): computing device. Defaults to 'cpu'.
        Raises:
            ValueError: given quantization config is invalid.
        Returns:
            Tuple[float, int]: scale(fp32) and offset(int).
        """
        if config.policy.has_property(QuantizationProperty.ASYMMETRICAL):
            raise PermissionError('KL observer is not designed for ASYMMETRICAL quantization')
        
        if OBSERVER_MIN_SCALE_MANUL_OVERRIDE in config.detail:
            scale_threshold = config.detail[OBSERVER_MIN_SCALE_MANUL_OVERRIDE]

        # move histogram to cpu, speedup computation.
        histogram = histogram.to(computing_device).float()

        # compute symmtrical kl-divergence.
        # Here is a simple example: reference distribution P consisting of 8 bins, we want to quantize into 2 bins:
        # P = [ 1, 0, 2, 3, 5, 3, 1, 7]
        # we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin)
        # [1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16]
        # then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P:
        # Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4]
        # now we should normalize both distributions, after that we can compute KL_divergence
        # P /= sum(P) Q /= sum(Q)
        # result = KL_divergence(P, Q)
        # see also
        # https://github.com/NVIDIA/TensorRT/blob/3835424af081db4dc8cfa3ff3c9f4a8b89844421/tools/pytorch-quantization/pytorch_quantization/calib/histogram.py#L147

        losses, quant_bins = [], 2 ** (config.num_of_bits - 1)

        # following code is curcial, do not move
        histogram[: int(hist_bins * .002)] = 0
        histogram[int(hist_bins * .002)] = 1

        hist_sum = torch.sum(histogram)
        for bin_range in range(quant_bins, hist_bins + quant_bins - 1, quant_bins):
            p_hist = torch.zeros(size=(bin_range, ), dtype=torch.float, device=computing_device)
            p_hist[: bin_range].copy_(histogram[: bin_range])
            p_hist[bin_range - 1] += torch.sum(histogram[bin_range: ])
            p_hist = p_hist / hist_sum

            expand_ratio = int(bin_range / quant_bins)
            q_hist = histogram[: bin_range].clone()
            q_hist = q_hist.reshape((quant_bins, expand_ratio))
            positive_map = q_hist > 0
            positive_cnt = positive_map.sum(axis=1, keepdim=True)
            positive_cnt[positive_cnt == 0] = 1
            q_hist = torch.div(q_hist.sum(axis=1, keepdim=True), positive_cnt)
            q_hist = q_hist.repeat([1, expand_ratio])
            q_hist = q_hist * positive_map
            q_hist = q_hist / torch.sum(q_hist)
            q_hist = q_hist.flatten()

            losses.append({
                'kl': torch_KL_divergence(p_hist, q_hist),
                'bin_range': bin_range
            })

        best_bin_range = sorted(losses, key=lambda x: x['kl'])[0]['bin_range']
        scale, offset = (best_bin_range / self._hist_bins) * hist_scale * (self._hist_bins / quant_bins), 0
        
        if scale < scale_threshold and OBSERVER_WARNING: 
            ppq_warning('Numeric instability detected: '
                        'ppq find there is a scale value < 1e-7, '
                        'which probably cause numeric underflow in further computation.')
        scale = max(scale, scale_threshold)

        if config.policy.has_property(QuantizationProperty.POWER_OF_2):
            scale = ppq_round_to_power_of_2(scale, policy=RoundingPolicy.ROUND_HALF_UP)
        return scale, offset

NVIDIA的PPT中KLD算法流程

整个过程:从128循环到2048,i为截断阈值将bin截断(第i个条形图也会被舍弃),生成P和Q,计算每组P和Q的KL散度,最小散度对应阈值即为所求
输入Input:一个有2048个统计条条形图bin
输出:截断阈值threshhold,浮点数
在这里插入图片描述
在这里插入图片描述

KLD算法PPQ实现版本

算法流程:
在这里插入图片描述
具体代码分析:
在这里插入图片描述

PPQ与NVIDIA的区别:

1.原始histogram条形图舍弃
NVIDIA是:不进行预处理
PPQ:前其千分之二置为零,第千分之二个条形置为1
2.for循环找截断阈值
NVIDIA是:for i in range(102,2048)
PPQ库是:for bin_range in range(quant_bins, hist_bins + quant_bins - 1, quant_bins):
3.阈值m转为实际浮点数
NVIDIA是:threshold = ( m + 0.5 ) * ( width of a bin )
PPQ库是:
在这里插入图片描述

相关文章:

  • 我就不信你还不懂HashSet/HashMap的底层原理
  • pytorch学习之pytorch构建模型的流程
  • react-swipeable-views轮播图实现下方的切换点控制组件
  • Java线程知识点总结
  • Android Compose——一个简单的Bilibili APP
  • 世界顶级五大女程序媛,不仅技术强还都是美女
  • 2023年再不会Redis,就要被淘汰了
  • 【学习笔记】深入理解JVM之垃圾回收机制
  • 【数据结构】链式二叉树
  • 自学大数据第三天~终于轮到hadoop了
  • 应用层协议 HTTP HTTPS
  • Linux内核学习笔记——页表的那些事。
  • 一文带你入门,领略angular风采(上)!!!
  • 嵌入式学习笔记——STM32硬件基础知识
  • 2023年“网络安全”赛项浙江省金华市选拔赛 任务书
  • “Material Design”设计规范在 ComponentOne For WinForm 的全新尝试!
  • 30天自制操作系统-2
  • Cookie 在前端中的实践
  • ERLANG 网工修炼笔记 ---- UDP
  • HTML5新特性总结
  • Logstash 参考指南(目录)
  • nginx(二):进阶配置介绍--rewrite用法,压缩,https虚拟主机等
  • 程序员该如何有效的找工作?
  • 回顾2016
  • 使用 5W1H 写出高可读的 Git Commit Message
  • 用 vue 组件自定义 v-model, 实现一个 Tab 组件。
  • ​业务双活的数据切换思路设计(下)
  • #stm32驱动外设模块总结w5500模块
  • #我与Java虚拟机的故事#连载03:面试过的百度,滴滴,快手都问了这些问题
  • (2)STL算法之元素计数
  • (33)STM32——485实验笔记
  • (AtCoder Beginner Contest 340) -- F - S = 1 -- 题解
  • (done) NLP “bag-of-words“ 方法 (带有二元分类和多元分类两个例子)词袋模型、BoW
  • (差分)胡桃爱原石
  • (附源码)springboot工单管理系统 毕业设计 964158
  • (教学思路 C#之类三)方法参数类型(ref、out、parmas)
  • (六) ES6 新特性 —— 迭代器(iterator)
  • (六)c52学习之旅-独立按键
  • (七)Java对象在Hibernate持久化层的状态
  • (淘宝无限适配)手机端rem布局详解(转载非原创)
  • (译) 理解 Elixir 中的宏 Macro, 第四部分:深入化
  • (原創) X61用戶,小心你的上蓋!! (NB) (ThinkPad) (X61)
  • (状压dp)uva 10817 Headmaster's Headache
  • .NET 6 在已知拓扑路径的情况下使用 Dijkstra,A*算法搜索最短路径
  • .NET Micro Framework 4.2 beta 源码探析
  • .NET建议使用的大小写命名原则
  • .net使用excel的cells对象没有value方法——学习.net的Excel工作表问题
  • @Autowired @Resource @Qualifier的区别
  • @PreAuthorize注解
  • [ vulhub漏洞复现篇 ] Django SQL注入漏洞复现 CVE-2021-35042
  • [52PJ] Java面向对象笔记(转自52 1510988116)
  • [Android开源]EasySharedPreferences:优雅的进行SharedPreferences数据存储操作
  • [CentOs7]图形界面
  • [HackMyVM]靶场 Quick3
  • [IDF]被改错的密码