当前位置: 首页 > news >正文

深入理解计算机系统——第五章 Optimizing Program Performance

深入理解计算机系统——第五章 Optimizing Program Performance

  • 5.2 Expressing Program Performance
  • 5.3 Program Example
  • 5.4 Eliminating Loop Inefficiencies
  • 5.5 Reducing Procedure Calls
  • 5.6 Eliminating Unneeded Memory References
  • 5.7 Understanding Modern Processors
    • 5.7.1 Overall Operation
    • 5.7.2 Functional Unit Performance
  • 5.8 Loop Unrolling
  • 5.10 Summary of Results for Optimizing Combining Code
  • 5.11 Some Limiting Factors
    • 5.11.1 Register Spilling
    • 5.11.2 Branch Prediction and Misprediction Penalties
  • 5.12 Understanding Memory Performance
  • 5.14 Identifying and Eliminating Performance Bottlenecks

资源:

视频课程
深入理解计算机系统_第5章 优化程序性能

5.2 Expressing Program Performance

We introduce the metric cycles per element, abbreviated CPE, to express program performance in a way that can guide us in improving the code.

5.3 Program Example

后面的程序示例都是基于下面的优化,程序的一些定义如下:

fig 5.3

typedef long data_t;

5.3

fig 5.4


fig 5.5


5.4 Eliminating Loop Inefficiencies

示例:

fig 5.7

可以看到 lower2lower1 效率更高,因为 lower1 的循环中每次都要计算 s 的长度。

5.5 Reducing Procedure Calls

示例:

fig 5.6


fig 5.9

combine3 不用每次循环时都调用 get_vec_start 函数。

5.6 Eliminating Unneeded Memory References

前面 combine3 的循环内的汇编代码如下:
loop code

可以看见每次循环都有读写内存的步骤,但实际没有必要的,只用在循环结束后将数据写到内存中即可,因此做如下修改:

fig 5.10

其循环内的汇编代码如下:
Inner loop of combine4


combine3 编译器在每次循环都要读写数据数据因为考虑到可能会有 memory aliasing 的情况。

Aliasing: two different memory references specify single location.

Easy to have happen in C:
1. since allowed to do address arithmetic
2. Direct access to storage structures

解决方案:
1. Get in habit of introducing local variables;
2. Your way of telling compiler not to check for aliasing


5.7 Understanding Modern Processors

The latency bound : a series of operations must be performed in strict sequence, because the result of one operation is required before the next one can begin.
延迟界限:一系列的指令有依赖关系,必须按照严格的顺序执行。

The throughput bound: characterizes the raw computing capacity of the processor’s functional units. This bound becomes the ultimate limit on program performance.
吞吐量限制:基于硬件的数量和性能,功能单元的原始计算能力。该限制是程序能优化的极限,

5.7.1 Overall Operation

Figure 5.11 shows a very simplified view of a modern microprocessor.

Block diagram of an out-of-order processor

These processors are described in the industry as being superscalar, which means they can perform multiple operations on every clock cycle and out of order, meaning that the order in which instructions execute need not correspond to their ordering in the machine-level program. (需要指令之间独立)

图 5.11 处理器两个部分:

  1. Instruction control unit (ICU): Reading a sequence of instructions from memory and generating from these a set of primitive operations to perform on program data.
  2. Execution unit (EU): Executes these operations.

优点:They are better at achieving higher degrees of instruction-level parallelism.

The ICU reads the instructions from an instruction cache—a special high-speed memory containing the most recently accessed instructions.

In general, the ICU fetches well ahead of the currently executing instructions, so that it has enough time to decode these and send operations down to the EU.

存在的问题:
遇到分支时进行分支预测(前面讲过),在执行前已经对预测的分支进行取指译码的操作,如果预测错误,则需要重新取指令和译码。
如果有分支预测,那最后的在确定分支是否正确前不会存到程序寄存器内存中。

load 单元:从内存中读数据到处理器。
store 单元:将处理器中的数据写入到内存,并且能计算地址。
从图 5.11 可以看出,loadstore 两个单元通过 data catch 来读写内存,该单元是包含最近使用的数据的高速内存。

Figure 5.11 indicates that the different functional units are designed to perform different operations.

arithmetic operations: 执行整型和浮点数不同组合的操作。

The arithmetic units are intentionally designed to be able to perform a variety of different operations, since the required operations vary widely across different programs.

For example, our Intel Core i7 Haswell reference machine has eight functional units, numbered 0–7. Here is a partial list of each one’s capabilities:

eight functional units

In the above list, “integer arithmetic” refers to basic operations, such as addition, bitwise operations, and shifting.
Multiplication and division require more specialized resources.

store 操作需要两个function units,一个计算存储地址,另一个存储数据。

We can see that this combination of functional units has the potential to perform multiple operations of the same type simultaneously.

Within the ICU, the retirement unit keeps track of the ongoing processing and makes sure that it obeys the sequential semantics of the machine-level program.

Our figure shows a register file containing the integer, floating-point, and, more recently, SSE and AVX registers as part of the retirement unit, because this unit controls the updating of these registers.

指令被译码时,信息存在队列中(先入先出)。
这些在分支预测结果出现前一直存在队列中。
如果分支预测成功,那么已经执行的预测的指令就可以退役(retired),并更新程序寄存器。
如果分支预测错误,执行的错误的指令将被清空(flushed),并丢弃已经计算的结果,因此不会修改程序状态。

程序寄存器(program registers)只会在指令退役后更新。

执行单元的不同部分能直接传输结果(Operation results)。

The most common mechanism for controlling the communication of operands among the execution units is called register renaming.

对于每个寄存器,通常有几百个寄存器的副本,用于存储需要更新到实际寄存器的值。

当一个更新寄存器 r 的指令被译码后,用标签 t 表示该操作的结果。

An entry (r, t) is added to a table maintaining the association between program register r and tag t for an operation that will update this register.

When a subsequent instruction using register r as an operand is decoded, the operation sent to the execution unit will contain t as the source for the operand value.

When some execution unit completes the first operation, it generates a result (v, t), indicating that the operation with tag t produced value v.

Any operation waiting for t as a source will then use v as the source value, a form of data forwarding.

By this mechanism, values can be forwarded directly from one operation to another, rather than being written to and read from the register file, enabling the second operation to begin as soon as the first has completed.

The renaming table only contains entries for registers having pending write operations.

With register renaming, an entire sequence of operations can be performed speculatively, even though the registers are updated only after the processor is certain of the branch outcomes.


5.7.2 Functional Unit Performance

latency : The total time required to perform the operation.
延迟:一个指令执行需要的时间。

issue time : The minimum number of clock cycles between two independent operations of the same type.

(Cycle/Issue:由于流水线(pipelining)操作,两条指令之间的时间。)

capacity : The number of functional units capable of performing that operation.

5.8 Loop Unrolling

Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.

在循环中一次计算多个值而非一个值。

示例:

combine4


采用 Loop Unrolling 后改进的 combine5

combine5


改进后的代码每次循环中计算两个数,这里还需要第二的循环处理剩下的数,例如元素的个数为5个时,第一个循环只处理4个数,因此第二个循环处理第5个数。

改进后效果如下:

performance

当循环内做2次展开时,对于整数的加法操作有改进(OP+IDENT0),其他无改进,因为该代码还是需要顺序执行,即一条指令执行完成后才执行下一条,指令之间不是独立的(先计算 acc OP data[i],得到结果后再将结果 OP data[i+1])。


继续优化:
acc = (acc OP data[i]) OP data[i+1]; 改为 acc = acc OP (data[i] OP data[i+1]); ,即改变括号的位置,其性能如下(优化后代码为 combine7):

performance

combine5 对比可以看见换了括号的位置后整数的乘法浮点数的操作时间减少几乎一半。
原因:指令之间变得独立了,当一条指令执行 data[0] OP data[1] 时,下一条指令可以提前计算 data[2] OP data[3]

存在的问题:对于整数的加法乘法满足结合率和交换率,这样改变括号的位置对计算结果无影响;但对于浮点数,并不满足结合率,改变括号的位置后可能会出现舍入,甚至溢出的情况,其结果可能和改变括号前不同

浮点数乘法操作吞吐量为 0.5,因为有两个浮点乘法器


进一步优化:分别计算两组元素的加法或者积:

combine6

combine6 将计算分成两组,索引为偶数的组合在一起计算,索引为奇数的组合为一组计算,其性能如下:

performance

combine 7 相比,整数的加法时间减少了。

5.10 Summary of Results for Optimizing Combining Code

继续优化,增大展开的数目(combine6 为 2*2,即一个循环中计算两个数,分成两组计算),让延迟界限尽量接近吞吐量界限:

Summary

可以看见 10 * 10时,比 2*2 进一步优化了。

5.11 Some Limiting Factors

5.11.1 Register Spilling

可以通过增加循环内计算的数量来优化性能,且级数越大执行时间越短,但这也存在上限,受寄存器的数目的限制,如果寄存器的数目不够用,则编译器会将部分数据存在内存中,通常在中,因而导致时间反而更长(内存访问速度低于寄存器),见下图:

result

可以看见 20 * 20 展开时时间反而比 10 * 10 长。

Fortunately, x86-64 has enough registers that most loops will become throughput limited before this occurs.

5.11.2 Branch Prediction and Misprediction Penalties

对于需要进行分支预测的情况,指令控制单元必须在指令执行单元完成,执行指令执行单元时,必须保证该指令是正确的分支(前面讲过)以保证不会影响程序,如果预测错误,则需要回到分支预测的地方,执行正确的分支,重新取指译码,最后才能进入执行阶段。

分支预测错误也不会对程序有影响,因为分支预测时对无效指令的操作只修改寄存器的值,而每个寄存器都有很多副本,每一次计算的结果都依次保留在寄存器副本中,因此预测错误时,只需要取消寄存器中那些错误的更新并将正确的值返回寄存器中(前面说过的寄存器重命名,即 register renaming)。

5.12 Understanding Memory Performance

All modern processors contain one or more cache memories to provide fast access to such small amounts of memory.

第六章会详细介绍

5.14 Identifying and Eliminating Performance Bottlenecks

本章主要介绍通过工具进行程序剖析(code profilers) 来分析程序的性能,可以查看不同部分花费的时间等来分析代码从而优化程序。

相关文章:

  • 【JDBC实战】水果库存系统 [设计阶段]
  • 【前端之旅】JavaScript进阶笔记
  • JVM运行时参数
  • 【Android】-- 数据存储(二)(存储卡上读写文件、Application保存数据)
  • 消息队列|RabbitMQ入门概述
  • 并发(5)线程之间的协作
  • Vue 新手期练手出现问题记录与解决方案——Vue练手项目“小问题“,强迫症患者福音——持续更新ing
  • 公钥密码学中的简单数学基础
  • docker简洁版
  • 牛客网面试必刷TOP101之——判断某种二叉树以及寻找最近公共祖先节点
  • leetcode——最长回文子串——百日算法成就第五天5%
  • Java_Stream流式计算
  • 微雪树莓派PICO笔记——4. ADC(模拟数字转换器)
  • 网课查题公众号接口使用
  • 使用并查集处理树的路径
  • 08.Android之View事件问题
  • CSS魔法堂:Absolute Positioning就这个样
  • django开发-定时任务的使用
  • ES6之路之模块详解
  • hadoop入门学习教程--DKHadoop完整安装步骤
  • HTML5新特性总结
  • iBatis和MyBatis在使用ResultMap对应关系时的区别
  • Leetcode 27 Remove Element
  • passportjs 源码分析
  • Redis字符串类型内部编码剖析
  • Service Worker
  • Spring Cloud Alibaba迁移指南(一):一行代码从 Hystrix 迁移到 Sentinel
  • ubuntu 下nginx安装 并支持https协议
  • webpack+react项目初体验——记录我的webpack环境配置
  • 腾讯视频格式如何转换成mp4 将下载的qlv文件转换成mp4的方法
  • 微信小程序--------语音识别(前端自己也能玩)
  • 想晋级高级工程师只知道表面是不够的!Git内部原理介绍
  • 自动记录MySQL慢查询快照脚本
  • 翻译 | The Principles of OOD 面向对象设计原则
  • ​LeetCode解法汇总2182. 构造限制重复的字符串
  • (Matlab)遗传算法优化的BP神经网络实现回归预测
  • (分类)KNN算法- 参数调优
  • (附源码)php投票系统 毕业设计 121500
  • (规划)24届春招和25届暑假实习路线准备规划
  • (理论篇)httpmoudle和httphandler一览
  • (免费领源码)Python#MySQL图书馆管理系统071718-计算机毕业设计项目选题推荐
  • (三) prometheus + grafana + alertmanager 配置Redis监控
  • (十一)JAVA springboot ssm b2b2c多用户商城系统源码:服务网关Zuul高级篇
  • (一)pytest自动化测试框架之生成测试报告(mac系统)
  • .bat批处理(十一):替换字符串中包含百分号%的子串
  • .NET Core Web APi类库如何内嵌运行?
  • .NET Standard、.NET Framework 、.NET Core三者的关系与区别?
  • .NET8.0 AOT 经验分享 FreeSql/FreeRedis/FreeScheduler 均已通过测试
  • .NETCORE 开发登录接口MFA谷歌多因子身份验证
  • .net操作Excel出错解决
  • .NET设计模式(2):单件模式(Singleton Pattern)
  • /usr/bin/env: node: No such file or directory
  • [2021]Zookeeper getAcl命令未授权访问漏洞概述与解决
  • [Android]RecyclerView添加HeaderView出现宽度问题
  • [AutoSAR 存储] 汽车智能座舱的存储需求