Hello,
I wanted to share a bit of information on how I do performance
profiling for AMD GPU BLAS operations on Debian. There are a few
key environment variables:
ROCBLAS_LAYER=7
ROCBLAS_LOG_TRACE_PATH=trace.log
ROCBLAS_LOG_BENCH_PATH=bench.log
ROCBLAS_LOG_PROFILE_PATH=profile.log
TENSILE_DB=0xFFFF
The trace logs will be filled with output that shows the calls
that were made to rocblas and the arguments passed. For example,
the trace of the first couple rocblas calls for llama.cpp with
dolphin-2.2.1-mistral-7b.Q5_K_M.gguf as the model:
rocblas_create_handle,atomics_allowed
rocblas_set_stream,0x56388d1e0a40,atomics_allowed
rocblas_query_int8_layout_flag,pack_int,atomics_allowed
rocblas_gemm_batched_ex,T,N,32,2,128,1,0x7fa04c608000,f16_r,1024,0x7fa04c608100,f16_r,4096,0,0x7fa04c609000,f16_r,32,0x7fa04c609000,f16_r,32,32,f16_r,0,0,pack_int,atomics_allowed
The profile log gives you the rocblas operation calls as a YAML
list of JSON objects. I'm not sure why this is useful, but it
looks like so:
- { rocblas_function: "rocblas_gemm_batched_ex", atomics_mode:
atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type:
"f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T',
transB: 'N', M: 128, N: 1, K: 320, alpha: 1.0, lda: 2048, ldb:
320, beta: 0.0, ldc: 128, ldd: 128, batch_count: 32, algo: 0,
solution_index: 0, flags: pack_int, call_count: 352 }
The bench log, however, is quite useful. It gives you each call to
rocblas, but transformed into rocblas-bench commands for isolated
performance testing of the rocblas operation of interest:
./rocblas-bench -f gemm_batched_ex --transposeA T --transposeB N
-m 32 -n 2 -k 128 --alpha 1 --a_type f16_r --lda 1024 --b_type
f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc --d_type f16_r
--ldd 32 --batch_count 32 --compute_type f16_r --algo 0
--solution_index 0 --flags 1
To use the above in Debian, you would install librocblas0-bench,
remove the --ldc flag (which appears to be missing its size
argument [3]), change ./rocblas-bench to
/usr/libexec/rocm/librocblas0-bench/rocblas-bench, and increase
the iteration count by adding something like --iters 1000 (or
more) to ensure you get enough samples to provide reasonable
timing results.
The output you'll see should look something like this:
$ ./rocblas-bench -f gemm_batched_ex --transposeA T --transposeB N
-m 32 -n 2 -k 128 --alpha 1 --a_type f16_r --lda 1024 --b_type
f16_r --ldb 4096 --beta 0 --c_type f16_r --d_type f16_r --ldd 32
--batch_count 32 --compute_type f16_r --algo 0 --solution_index 0
--flags 1 --iters 10000
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : AMD Radeon RX 6800 XT gfx1030
with 17.2 GB memory, max. SCLK 2575 MHz, max. MCLK 1000 MHz,
compute capability 10.3
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB,
maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------
rocBLAS info: Time taken to complete rocBLAS library
initialization is 1603 milliseconds.
transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,ldd,batch_count,rocblas-Gflops,us
T,N,32,2,128,1,1024,0,4096,128,32,32, 85.8785, 6.105
My system has a RX 6800 XT installed, so 0.858785 TFLOPS is quite
a bit less than the theoretical peak of 41.47 TFLOPS [4].
Although, I'm not sure you could actually get anywhere near the
theoretical peak performance with these parameters anyway. With
N=2, it's barely a GEMM and I imagine it's probably memory
limited. NVIDIA has a good background page on matrix
multiplication for deep learning performance, which may be helpful
for making sense of these parameters and results [5].
If we examine the kernel selection logs printed due to
TENSILE_DB=0xFFFF, the output seems to suggest that the best
available assembly kernel is a poor match. Although, it's not
clear to me whether a better match would achieve better
performance:
loaded code object
/lib/x86_64-linux-gnu/rocblas/2.47.0/library/Kernels.so-000-gfx1030.hsaco
loaded code object
/lib/x86_64-linux-gnu/rocblas/2.47.0/library/TensileLibrary_gfx1030.co
TensorDescriptor:calculate 3-tensor<Half>( sizes(128, 32,
32), strides(1, 1024, 32768),
offset(0))totalLogicalElements=131072 totalAllocatedElem=1047680
TensorDescriptor:calculate 3-tensor<Half>( sizes(128, 2,
32), strides(1, 4096, 8192), offset(0))totalLogicalElements=8192
totalAllocatedElem=258176
TensorDescriptor:calculate 3-tensor<Half>( sizes(32, 2,
32), strides(1, 32, 64), offset(0))totalLogicalElements=2048
totalAllocatedElem=2048
TensorDescriptor:calculate 3-tensor<Half>( sizes(32, 2,
32), strides(1, 32, 64), offset(0))totalLogicalElements=2048
totalAllocatedElem=2048
AMDGPU(matches: And(Processor(gfx90a): 0, CUCount(104): 0): 0
): 0
AMDGPU(matches: Processor(gfx1030): 1): 1
ProblemMap Searching for Contraction_l_Alik_Bljk_Cijk_Dijk found
Problem library (1 rows)
TruePred: 1
TruePred: 1
And(TypesEqual(a:Half == Float&& b:Half == Float&&
c:Half == Float&& d:Half == Float): 0): 0
And(): 1
TruePred: 1
Object key: 32, 2, 128
Key: 32, 2, 128
Starting point: 64, 64, 256
Rightward search...
TruePred: 1And(StridedBatched(1): 0): 0
speed: 0.625 | 64, 64, 256: 21252 < 1.79769e+308 <-- Best
distance, but no matching solution
TruePred: 1And(): 1
speed: 0.625 | 64, 64, 256: 21252 == 21252
speed: 0.981 | 64, 64, 1280: 1.33197e+06 > 21252
speed: 0.981 | 64, 64, 1280: 1.33197e+06 > 21252
<...>
speed: 53.044 | 128, 6784, 3328: 5.62447e+07 > 21252
speed: 53.044 | 128, 6784, 3328: 5.62447e+07 > 21252
256, 64, 256: Stopping rightward search early.
Leftward search...
Considered 11.6468% of entries.
Solution index selected: 22645
Running kernel:
Cijk_Alik_Bljk_HB_GB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASAE01_ASCE01_ASEM1_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS1_FL0_GRVW2_GSU1_GSUAMB_GLS0_ISA1030_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_SIA1_SS0_SU32_SUM3_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW2_VSn1_VW2_WSGRA0_WSGRB0_WS32_WG8_8_1_WGM1
Kernel
Cijk_Alik_Bljk_HB_GB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASAE01_ASCE01_ASEM1_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS1_FL0_GRVW2_GSU1_GSUAMB_GLS0_ISA1030_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_SIA1_SS0_SU32_SUM3_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW2_VSn1_VW2_WSGRA0_WSGRB0_WS32_WG8_8_1_WGM1
l(64, 1, 1) x g(2, 1, 32) = (128, 1, 32)
In any case, I collected rocblas performance data running the
following models with llama-cpp:
- dolphin-2.2.1-mistral-7b.Q5_K_M
- dolphin-2.2.1-mistral-7b.Q6_K
- dolphin-2.5-mixtral-8x7b.Q4_K_M
- dolphin-2.7-mixtral-8x7b.Q4_K_M
- Meta-Llama-3.1-8B-Instruct-Q6_K
The logs are available on GitHub [6]. This is obviously an
incomplete analysis, but I hope it helps folks to understand some
of the performance analysis tools that are available on Debian
right now.
Sincerely,
Cory Bloor
[1]:
https://rocm.docs.amd.com/projects/rocBLAS/en/docs-6.3.2/reference/logging.html
[2]:
https://rocm.docs.amd.com/projects/Tensile/en/docs-6.3.2/src/reference/environment-variables.html
[3]:
https://bugs.debian.org/1095201
[4]:
https://www.amd.com/en/products/graphics/desktops/radeon/6000-series/amd-radeon-rx-6800-xt.html
[5]:
https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
[6]:
https://gist.github.com/cgmb/5bc00ad3f04afbdd04e2ef12d4aabe2d