当前位置：首页 > news >正文

深入理解计算机系统——第五章 Optimizing Program Performance

news 来源：原创 2024/5/4 1:56:45

深入理解计算机系统——第五章 Optimizing Program Performance

5.2 Expressing Program Performance
5.3 Program Example
5.4 Eliminating Loop Inefficiencies
5.5 Reducing Procedure Calls
5.6 Eliminating Unneeded Memory References
5.7 Understanding Modern Processors
- 5.7.1 Overall Operation
- 5.7.2 Functional Unit Performance
5.8 Loop Unrolling
5.10 Summary of Results for Optimizing Combining Code
5.11 Some Limiting Factors
- 5.11.1 Register Spilling
- 5.11.2 Branch Prediction and Misprediction Penalties
5.12 Understanding Memory Performance
5.14 Identifying and Eliminating Performance Bottlenecks

资源：

视频课程
深入理解计算机系统_第5章优化程序性能

5.2 Expressing Program Performance

We introduce the metric cycles per element, abbreviated CPE, to express program performance in a way that can guide us in improving the code.

5.3 Program Example

后面的程序示例都是基于下面的优化，程序的一些定义如下：

fig 5.3

typedef long data_t;

5.3

fig 5.4

fig 5.5

5.4 Eliminating Loop Inefficiencies

示例：

fig 5.7

可以看到 lower2 比 lower1 效率更高，因为 lower1 的循环中每次都要计算 s 的长度。

5.5 Reducing Procedure Calls

示例：

fig 5.6

fig 5.9

combine3 不用每次循环时都调用 get_vec_start 函数。

5.6 Eliminating Unneeded Memory References

前面 combine3 的循环内的汇编代码如下：
loop code

可以看见每次循环都有读写内存的步骤，但实际没有必要的，只用在循环结束后将数据写到内存中即可，因此做如下修改：

fig 5.10

其循环内的汇编代码如下：
Inner loop of combine4

combine3 编译器在每次循环都要读写数据数据因为考虑到可能会有 memory aliasing 的情况。

Aliasing: two different memory references specify single location.

Easy to have happen in C:
1. since allowed to do address arithmetic
2. Direct access to storage structures

解决方案：
1. Get in habit of introducing local variables;
2. Your way of telling compiler not to check for aliasing

5.7 Understanding Modern Processors

The latency bound : a series of operations must be performed in strict sequence, because the result of one operation is required before the next one can begin.
延迟界限：一系列的指令有依赖关系，必须按照严格的顺序执行。

The throughput bound: characterizes the raw computing capacity of the processor’s functional units. This bound becomes the ultimate limit on program performance.
吞吐量限制：基于硬件的数量和性能，功能单元的原始计算能力。该限制是程序能优化的极限，

5.7.1 Overall Operation

Figure 5.11 shows a very simplified view of a modern microprocessor.

Block diagram of an out-of-order processor

These processors are described in the industry as being superscalar, which means they can perform multiple operations on every clock cycle and out of order, meaning that the order in which instructions execute need not correspond to their ordering in the machine-level program. (需要指令之间独立)

图 5.11 处理器两个部分：

Instruction control unit (ICU): Reading a sequence of instructions from memory and generating from these a set of primitive operations to perform on program data.
Execution unit (EU): Executes these operations.

优点：They are better at achieving higher degrees of instruction-level parallelism.

The ICU reads the instructions from an instruction cache—a special high-speed memory containing the most recently accessed instructions.

In general, the ICU fetches well ahead of the currently executing instructions, so that it has enough time to decode these and send operations down to the EU.

存在的问题：
遇到分支时进行分支预测（前面讲过），在执行前已经对预测的分支进行取指和译码的操作，如果预测错误，则需要重新取指令和译码。
如果有分支预测，那最后的在确定分支是否正确前不会存到程序寄存器或内存中。

load 单元：从内存中读数据到处理器。
store 单元：将处理器中的数据写入到内存，并且能计算地址。
从图 5.11 可以看出，load 和 store 两个单元通过 data catch 来读写内存，该单元是包含最近使用的数据的高速内存。

Figure 5.11 indicates that the different functional units are designed to perform different operations.

arithmetic operations: 执行整型和浮点数不同组合的操作。

The arithmetic units are intentionally designed to be able to perform a variety of different operations, since the required operations vary widely across different programs.

For example, our Intel Core i7 Haswell reference machine has eight functional units, numbered 0–7. Here is a partial list of each one’s capabilities:

eight functional units

In the above list, “integer arithmetic” refers to basic operations, such as addition, bitwise operations, and shifting.
Multiplication and division require more specialized resources.

store 操作需要两个function units，一个计算存储地址，另一个存储数据。

We can see that this combination of functional units has the potential to perform multiple operations of the same type simultaneously.

Within the ICU, the retirement unit keeps track of the ongoing processing and makes sure that it obeys the sequential semantics of the machine-level program.

Our figure shows a register file containing the integer, floating-point, and, more recently, SSE and AVX registers as part of the retirement unit, because this unit controls the updating of these registers.

指令被译码时，信息存在队列中（先入先出）。
这些在分支预测结果出现前一直存在队列中。
如果分支预测成功，那么已经执行的预测的指令就可以退役（retired），并更新程序寄存器。
如果分支预测错误，执行的错误的指令将被清空（flushed），并丢弃已经计算的结果，因此不会修改程序状态。

程序寄存器（program registers）只会在指令退役后更新。

执行单元的不同部分能直接传输结果（Operation results)。

The most common mechanism for controlling the communication of operands among the execution units is called register renaming.

对于每个寄存器，通常有几百个寄存器的副本，用于存储需要更新到实际寄存器的值。

当一个更新寄存器 r 的指令被译码后，用标签 t 表示该操作的结果。

An entry (r, t) is added to a table maintaining the association between program register r and tag t for an operation that will update this register.

When a subsequent instruction using register r as an operand is decoded, the operation sent to the execution unit will contain t as the source for the operand value.

When some execution unit completes the first operation, it generates a result (v, t), indicating that the operation with tag t produced value v.

Any operation waiting for t as a source will then use v as the source value, a form of data forwarding.

By this mechanism, values can be forwarded directly from one operation to another, rather than being written to and read from the register file, enabling the second operation to begin as soon as the first has completed.

The renaming table only contains entries for registers having pending write operations.

With register renaming, an entire sequence of operations can be performed speculatively, even though the registers are updated only after the processor is certain of the branch outcomes.

5.7.2 Functional Unit Performance

latency : The total time required to perform the operation.
延迟：一个指令执行需要的时间。

issue time : The minimum number of clock cycles between two independent operations of the same type.

(Cycle/Issue：由于流水线（pipelining)操作，两条指令之间的时间。)

capacity : The number of functional units capable of performing that operation.

5.8 Loop Unrolling

Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.

在循环中一次计算多个值而非一个值。

示例：

combine4