当前位置：首页 > news >正文

Rocprofiler测试

news 来源：原创 2024/9/27 18:21:19

Rocprofiler测试

一.参考链接
二.测试过程
- 1.登录服务器
- 2.使用smi获取列表
- 3.使用rocminfo获取Agent信息
- 4.准备测试用例
- 5.The hardware counters are called the basic counters
- 6.The derived metrics are defined on top of the basic counters using mathematical expression
- 7.Profing

Rocprofiler测试

一.参考链接

Compatibility matrix
AMD Radeon Pro VII
Radeon™ PRO VII Specifications
6.2.0 Supported GPUs
Performance model&相关名词解释

二.测试过程

1.登录服务器

.TODO

2.使用smi获取列表

rocm-smi

输出

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%(DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
========================================================================================================================
0       1     0x66a1,   3820   35.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
1       2     0x66a1,   22570  38.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

3.使用rocminfo获取Agent信息

在 ROCm（Radeon Open Compute）平台中，Agent 通常指的是计算设备或处理单元，这些可以是 CPU 或 GPU。每个 Agent 可以执行计算任务并具有自己的计算资源，如计算核心、内存等。在 ROCm 的程序模型中，Agent 是负责执行特定任务的实体，当你使用 ROCm 进行并行计算时，任务通常会分配给不同的 Agent 来处理。Agent 是 ROCm 的异构计算环境中进行任务调度和管理的基本单元之一

rocminfo

输出

*******
Agent 2
*******Name:                    gfx906Uuid:                    GPU-021860c17348c2f7Marketing Name:          AMD Radeon (TM) Pro VIIVendor Name:             AMDFeature:                 KERNEL_DISPATCHProfile:                 BASE_PROFILEFloat Round Mode:        NEARMax Queue Number:        128(0x80)Queue Min Size:          64(0x40)Queue Max Size:          131072(0x20000)Queue Type:              MULTINode:                    1Device Type:             GPUCache Info:L1:                      16(0x10) KBL2:                      8192(0x2000) KBChip ID:                 26273(0x66a1)ASIC Revision:           1(0x1)Cacheline Size:          64(0x40)Max Clock Freq. (MHz):   1700BDFID:                   1792Internal Node ID:        1Compute Unit:            60SIMDs per CU:            4Shader Engines:          4Shader Arrs. per Eng.:   1WatchPts on Addr. Ranges:4Coherent Host Access:    FALSEMemory Properties:Features:                KERNEL_DISPATCHFast F16 Operation:      TRUEWavefront Size:          64(0x40)Workgroup Max Size:      1024(0x400)Workgroup Max Size per Dimension:x                        1024(0x400)y                        1024(0x400)z                        1024(0x400)Max Waves Per CU:        40(0x28)Max Work-item Per CU:    2560(0xa00)Grid Max Size:           4294967295(0xffffffff)Grid Max Size per Dimension:x                        4294967295(0xffffffff)y                        4294967295(0xffffffff)z                        4294967295(0xffffffff)Max fbarriers/Workgrp:   32Packet Processor uCode:: 472SDMA engine uCode::      145IOMMU Support::          NonePool Info:Pool 1Segment:                 GLOBAL; FLAGS: COARSE GRAINEDSize:                    16760832(0xffc000) KBAllocatable:             TRUEAlloc Granule:           4KBAlloc Recommended Granule:2048KBAlloc Alignment:         4KBAccessible by all:       FALSEPool 2Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINEDSize:                    16760832(0xffc000) KBAllocatable:             TRUEAlloc Granule:           4KBAlloc Recommended Granule:2048KBAlloc Alignment:         4KBAccessible by all:       FALSEPool 3Segment:                 GROUPSize:                    64(0x40) KBAllocatable:             FALSEAlloc Granule:           0KBAlloc Recommended Granule:0KBAlloc Alignment:         0KBAccessible by all:       FALSEISA Info:ISA 1Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Machine Models:          HSA_MACHINE_MODEL_LARGEProfiles:                HSA_PROFILE_BASEDefault Rounding Mode:   NEARDefault Rounding Mode:   NEARFast f16:                TRUEWorkgroup Max Size:      1024(0x400)Workgroup Max Size per Dimension:x                        1024(0x400)y                        1024(0x400)z                        1024(0x400)Grid Max Size:           4294967295(0xffffffff)Grid Max Size per Dimension:x                        4294967295(0xffffffff)y                        4294967295(0xffffffff)z                        4294967295(0xffffffff)FBarrier Max Size:       32
*******

4.准备测试用例

tee ROCmMatrixTranspose.cpp<<-'EOF'
#include <iostream>
// hip header file
#include <hip/hip_runtime.h>
// roctx header file
#include <roctracer/roctx.h>#define WIDTH 1024
#define NUM (WIDTH * WIDTH)
#define THREADS_PER_BLOCK_X 4
#define THREADS_PER_BLOCK_Y 4
#define THREADS_PER_BLOCK_Z 1// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;out[y * width + x] = in[x * width + y];
}// CPU implementation of matrix transpose
void matrixTransposeCPUReference(float* output, float* input, const unsigned int width) {for (unsigned int j = 0; j < width; j++) {for (unsigned int i = 0; i < width; i++) {output[i * width + j] = input[j * width + i];}}
}int main() {float* Matrix;float* TransposeMatrix;float* cpuTransposeMatrix;float* gpuMatrix;float* gpuTransposeMatrix;hipDeviceProp_t devProp;hipGetDeviceProperties(&devProp, 0);std::cout << "Device name " << devProp.name << std::endl;int i;int errors;Matrix = (float*)malloc(NUM * sizeof(float));TransposeMatrix = (float*)malloc(NUM * sizeof(float));cpuTransposeMatrix = (float*)malloc(NUM * sizeof(float));// initialize the input datafor (i = 0; i < NUM; i++) {Matrix[i] = (float)i * 10.0f;}// allocate the memory on the device sidehipMalloc((void**)&gpuMatrix, NUM * sizeof(float));hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));uint32_t iterations = 1;while (iterations-- > 0) {std::cout << "## Iteration (" << iterations << ") #################" << std::endl;// Memory transfer from host to devicehipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);roctxMark("ROCTX-MARK: before hipLaunchKernel");roctxRangePush("ROCTX-RANGE: hipLaunchKernel");roctx_range_id_t roctx_id = roctxRangeStartA("roctx_range with id");// Lauching kernel from hosthipLaunchKernelGGL(matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH);roctxRangeStop(roctx_id);roctxMark("ROCTX-MARK: after hipLaunchKernel");// Memory transfer from device to hostroctxRangePush("ROCTX-RANGE: hipMemcpy");hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);roctxRangePop();  // for "hipMemcpy"roctxRangePop();  // for "hipLaunchKernel"// CPU MatrixTranspose computationmatrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);// verify the resultserrors = 0;double eps = 1.0E-6;for (i = 0; i < NUM; i++) {if (std::abs(TransposeMatrix[i] - cpuTransposeMatrix[i]) > eps) {errors++;}}if (errors != 0) {printf("FAILED: %d errors\n", errors);} else {printf("PASSED!\n");}}// free the resources on device sidehipFree(gpuMatrix);hipFree(gpuTransposeMatrix);// free the resources on host sidefree(Matrix);free(TransposeMatrix);free(cpuTransposeMatrix);return errors;
}EOF/opt/rocm/bin/hipcc -c ROCmMatrixTranspose.cpp -o ROCmMatrixTranspose.cpp.o
/opt/rocm/bin/hipcc ROCmMatrixTranspose.cpp.o -o ROCmMatrixTranspose \/opt/rocm/lib/libamd_comgr.so.2.8.60200 /usr/lib/x86_64-linux-gnu/libnuma.so /opt/rocm/lib/libroctx64.so	
./ROCmMatrixTranspose

5.The hardware counters are called the basic counters

rocprof --list-basic | grep -A 2  "gpu-agent2"

输出

  gpu-agent2 : TCC_EA1_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.block TCC has 4 countersgpu-agent2 : TCC_EA1_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.block TCC has 4 countersgpu-agent2 : TCC_EA1_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.block TCC has 4 countersgpu-agent2 : TCC_EA1_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)block TCC has 4 countersgpu-agent2 : TCC_EA1_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requestsblock TCC has 4 countersgpu-agent2 : GRBM_COUNT : Tie High - Count Number of Clocksblock GRBM has 2 countersgpu-agent2 : GRBM_GUI_ACTIVE : The GUI is Activeblock GRBM has 2 countersgpu-agent2 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)block SQ has 8 countersgpu-agent2 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote only from/to LDS (only works if EARLY_TA_DONE is enabled). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction issue. In units of 4 cycles. (per-simd, nondeterministic)block SQ has 8 countersgpu-agent2 : SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction arbiter is working on a VALU instruction. (per-simd, nondeterministic). Units in quad-cycles(4 cycles)block SQ has 8 countersgpu-agent2 : SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read scalar operations. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU operations (similar to INST_CYCLES_VALU but multiplied by # of active threads). (per-simd)block SQ has 8 countersgpu-agent2 : SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts. (emulated)block SQ has 8 countersgpu-agent2 : TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for this counter.block TA has 2 countersgpu-agent2 : TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.block TA has 2 countersgpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.block TA has 2 countersgpu-agent2 : TCC_HIT[0-15] : Number of cache hits.block TCC has 4 countersgpu-agent2 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.block TCC has 4 countersgpu-agent2 : TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.block TCC has 4 countersgpu-agent2 : TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.block TCC has 4 countersgpu-agent2 : TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.block TCC has 4 countersgpu-agent2 : TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)block TCC has 4 countersgpu-agent2 : TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requestsblock TCC has 4 countersgpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.block TCP has 4 counters

6.The derived metrics are defined on top of the basic counters using mathematical expression

rocprof --list-derived | grep -A 2  "gpu-agent2"

输出

  gpu-agent2 : TCC_EA1_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC EA1s.TCC_EA1_RDREQ_32B_sum = sum(TCC_EA1_RDREQ_32B,16)gpu-agent2 : TCC_EA1_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC EA1s.TCC_EA1_RDREQ_sum = sum(TCC_EA1_RDREQ,16)gpu-agent2 : TCC_EA1_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC EA1s.TCC_EA1_WRREQ_sum = sum(TCC_EA1_WRREQ,16)gpu-agent2 : TCC_EA1_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC EA1s.TCC_EA1_WRREQ_64B_sum = sum(TCC_EA1_WRREQ_64B,16)gpu-agent2 : TCC_WRREQ1_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.TCC_WRREQ1_STALL_max = max(TCC_EA1_WRREQ_STALL,16)gpu-agent2 : RDATA1_SIZE : The total kilobytes fetched from the video memory. This is measured on EA1s.RDATA1_SIZE = (TCC_EA1_RDREQ_32B_sum*32+(TCC_EA1_RDREQ_sum-TCC_EA1_RDREQ_32B_sum)*64)gpu-agent2 : WDATA1_SIZE : The total kilobytes written to the video memory. This is measured on EA1s.WDATA1_SIZE = ((TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)*32+TCC_EA1_WRREQ_64B_sum*64)gpu-agent2 : FETCH_SIZE : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.FETCH_SIZE = (TCC_EA_RDREQ_32B_sum*32+(TCC_EA_RDREQ_sum-TCC_EA_RDREQ_32B_sum)*64+RDATA1_SIZE)/1024gpu-agent2 : WRITE_SIZE : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.WRITE_SIZE = ((TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)*32+TCC_EA_WRREQ_64B_sum*64+WDATA1_SIZE)/1024gpu-agent2 : WRITE_REQ_32B : The total number of 32-byte effective memory writes.WRITE_REQ_32B = (TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)+(TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)+(TCC_EA_WRREQ_64B_sum+TCC_EA1_WRREQ_64B_sum)*2gpu-agent2 : TA_BUSY_avr : TA block is busy. Average over TA instances.TA_BUSY_avr = avr(TA_TA_BUSY,16)gpu-agent2 : TA_BUSY_max : TA block is busy. Max over TA instances.TA_BUSY_max = max(TA_TA_BUSY,16)gpu-agent2 : TA_BUSY_min : TA block is busy. Min over TA instances.TA_BUSY_min = min(TA_TA_BUSY,16)gpu-agent2 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)gpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)gpu-agent2 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.TCC_HIT_sum = sum(TCC_HIT,16)gpu-agent2 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.TCC_MISS_sum = sum(TCC_MISS,16)gpu-agent2 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)gpu-agent2 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)gpu-agent2 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)gpu-agent2 : TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC instances.TCC_EA_WRREQ_64B_sum = sum(TCC_EA_WRREQ_64B,16)gpu-agent2 : TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.TCC_WRREQ_STALL_max = max(TCC_EA_WRREQ_STALL,16)gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_sum : Total number of TCP stalls TA data interface.TCP_TCP_TA_DATA_STALL_CYCLES_sum = sum(TCP_TCP_TA_DATA_STALL_CYCLES,16)gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_max : Maximum number of TCP stalls TA data interface.TCP_TCP_TA_DATA_STALL_CYCLES_max = max(TCP_TCP_TA_DATA_STALL_CYCLES,16)gpu-agent2 : VFetchInsts : The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.VFetchInsts = (SQ_INSTS_VMEM_RD-TA_FLAT_READ_WAVEFRONTS_sum)/SQ_WAVESgpu-agent2 : VWriteInsts : The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.VWriteInsts = (SQ_INSTS_VMEM_WR-TA_FLAT_WRITE_WAVEFRONTS_sum)/SQ_WAVESgpu-agent2 : FlatVMemInsts : The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.FlatVMemInsts = (SQ_INSTS_FLAT-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVESgpu-agent2 : LDSInsts : The average number of LDS read or LDS write instructions executed per work item (affected by flow control).  Excludes FLAT instructions that read from or write to LDS.LDSInsts = (SQ_INSTS_LDS-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVESgpu-agent2 : FlatLDSInsts : The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).FlatLDSInsts = SQ_INSTS_FLAT_LDS_ONLY/SQ_WAVESgpu-agent2 : VALUUtilization : The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).VALUUtilization = 100*SQ_THREAD_CYCLES_VALU/(SQ_ACTIVE_INST_VALU*MAX_WAVE_SIZE)gpu-agent2 : VALUBusy : The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).VALUBusy = 100*SQ_ACTIVE_INST_VALU*4/SIMD_NUM/GRBM_GUI_ACTIVEgpu-agent2 : SALUBusy : The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).SALUBusy = 100*SQ_INST_CYCLES_SALU*4/SIMD_NUM/GRBM_GUI_ACTIVEgpu-agent2 : FetchSize : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.FetchSize = FETCH_SIZEgpu-agent2 : WriteSize : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.WriteSize = WRITE_SIZEgpu-agent2 : MemWrites32B : The total number of effective 32B write transactions to the memoryMemWrites32B = WRITE_REQ_32Bgpu-agent2 : L2CacheHit : The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).L2CacheHit = 100*sum(TCC_HIT,16)/(sum(TCC_HIT,16)+sum(TCC_MISS,16))gpu-agent2 : MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).MemUnitStalled = 100*max(TCP_TCP_TA_DATA_STALL_CYCLES,16)/GRBM_GUI_ACTIVE/SE_NUMgpu-agent2 : WriteUnitStalled : The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).WriteUnitStalled = 100*TCC_WRREQ_STALL_max/GRBM_GUI_ACTIVEgpu-agent2 : LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).LDSBankConflict = 100*SQ_LDS_BANK_CONFLICT/GRBM_GUI_ACTIVE/CU_NUMgpu-agent2 : GPUBusy : The percentage of time GPU was busy.GPUBusy = 100*GRBM_GUI_ACTIVE/GRBM_COUNTgpu-agent2 : Wavefronts : Total wavefronts.Wavefronts = SQ_WAVESgpu-agent2 : VALUInsts : The average number of vector ALU instructions executed per work-item (affected by flow control).VALUInsts = SQ_INSTS_VALU/SQ_WAVESgpu-agent2 : SALUInsts : The average number of scalar ALU instructions executed per work-item (affected by flow control).SALUInsts = SQ_INSTS_SALU/SQ_WAVESgpu-agent2 : SFetchInsts : The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).SFetchInsts = SQ_INSTS_SMEM/SQ_WAVESgpu-agent2 : GDSInsts : The average number of GDS read or GDS write instructions executed per work item (affected by flow control).GDSInsts = SQ_INSTS_GDS/SQ_WAVESgpu-agent2 : MemUnitBusy : The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).MemUnitBusy = 100*max(TA_TA_BUSY,16)/GRBM_GUI_ACTIVE/SE_NUMgpu-agent2 : ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).ALUStalledByLDS = 100*SQ_WAIT_INST_LDS*4/SQ_WAVES/GRBM_GUI_ACTIVE

7.Profing

tee input.txt<<-'EOF'
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0 : 1
gpu: 0
kernel:matrixTranspose
EOFrocprof -i input.txt ./ROCmMatrixTranspose
cat /root/input.csv
rocprofv2 -i input.txt ./ROCmMatrixTranspose
rocprofv2 --hsa-trace ./ROCmMatrixTranspose

输出

RPL: on '240920_102257' from '/opt/rocm-6.2.0' in '/root'
RPL: profiling '"./ROCmMatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_240920_102257_47892'RPL: result dir '/tmp/rpl_data_240920_102257_47892/input0_results_240920_102257'
ROCProfiler: input from "/tmp/rpl_data_240920_102257_47892/input0.xml"gpu_index = 0kernel = matrixTransposerange = 0:14 metricsWavefronts, VALUInsts, SALUInsts, SFetchInsts
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_240920_102257_47892/input0_results_240920_102257
File '/root/input.csv' is generating
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,SALUInsts,SFetchInsts
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,48178,48178,1048576,16,0,0,8,0,16,64,0x0,0x742031870880,65536.0000000000,14.0000000000,4.0000000000,3.0000000000ROCProfilerV2: Collecting the following counters:
- Wavefronts
- VALUInsts
- SALUInsts
- SFetchInsts
Enabling Counter Collection
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!
Dispatch_ID(0), GPU_ID(1), Queue_ID(1), Process_ID(48209), Thread_ID(48209), Grid_Size(1048576), Workgroup_Size(16), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("matrixTranspose(float*, float*, int) (.kd)"), Begin_Timestamp(951172884265490), End_Timestamp(951172884454463), Correlation_ID(0), SALUInsts(4.000000), SFetchInsts(3.000000), VALUInsts(14.000000), Wavefronts(65536.000000)