rocblas and llama.cpp performance profiling

Hello,

I wanted to share a bit of information on how I do performance profiling for AMD GPU BLAS operations on Debian. There are a few key environment variables:

ROCBLAS_LAYER=7
ROCBLAS_LOG_TRACE_PATH=trace.log
ROCBLAS_LOG_BENCH_PATH=bench.log
ROCBLAS_LOG_PROFILE_PATH=profile.log
TENSILE_DB=0xFFFF

The trace logs will be filled with output that shows the calls that were made to rocblas and the arguments passed. For example, the trace of the first couple rocblas calls for llama.cpp with dolphin-2.2.1-mistral-7b.Q5_K_M.gguf as the model:

rocblas_create_handle,atomics_allowed
rocblas_set_stream,0x56388d1e0a40,atomics_allowed
rocblas_query_int8_layout_flag,pack_int,atomics_allowed
rocblas_gemm_batched_ex,T,N,32,2,128,1,0x7fa04c608000,f16_r,1024,0x7fa04c608100,f16_r,4096,0,0x7fa04c609000,f16_r,32,0x7fa04c609000,f16_r,32,32,f16_r,0,0,pack_int,atomics_allowed

The profile log gives you the rocblas operation calls as a YAML list of JSON objects. I'm not sure why this is useful, but it looks like so:

- { rocblas_function: "rocblas_gemm_batched_ex", atomics_mode: atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 128, N: 1, K: 320, alpha: 1.0, lda: 2048, ldb: 320, beta: 0.0, ldc: 128, ldd: 128, batch_count: 32, algo: 0, solution_index: 0, flags: pack_int, call_count: 352 }

The bench log, however, is quite useful. It gives you each call to rocblas, but transformed into rocblas-bench commands for isolated performance testing of the rocblas operation of interest:

./rocblas-bench -f gemm_batched_ex --transposeA T --transposeB N -m 32 -n 2 -k 128 --alpha 1 --a_type f16_r --lda 1024 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc --d_type f16_r --ldd 32 --batch_count 32 --compute_type f16_r --algo 0 --solution_index 0 --flags 1

To use the above in Debian, you would install librocblas0-bench, remove the --ldc flag (which appears to be missing its size argument [3]), change ./rocblas-bench to /usr/libexec/rocm/librocblas0-bench/rocblas-bench, and increase the iteration count by adding something like --iters 1000 (or more) to ensure you get enough samples to provide reasonable timing results.

The output you'll see should look something like this:

$ ./rocblas-bench -f gemm_batched_ex --transposeA T --transposeB N -m 32 -n 2 -k 128 --alpha 1 --a_type f16_r --lda 1024 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --d_type f16_r --ldd 32 --batch_count 32 --compute_type f16_r --algo 0 --solution_index 0 --flags 1 --iters 10000
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : AMD Radeon RX 6800 XT gfx1030
with 17.2 GB memory, max. SCLK 2575 MHz, max. MCLK 1000 MHz, compute capability 10.3
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------

rocBLAS info: Time taken to complete rocBLAS library initialization is 1603 milliseconds.
transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,ldd,batch_count,rocblas-Gflops,us
T,N,32,2,128,1,1024,0,4096,128,32,32, 85.8785, 6.105

My system has a RX 6800 XT installed, so 0.858785 TFLOPS is quite a bit less than the theoretical peak of 41.47 TFLOPS [4]. Although, I'm not sure you could actually get anywhere near the theoretical peak performance with these parameters anyway. With N=2, it's barely a GEMM and I imagine it's probably memory limited. NVIDIA has a good background page on matrix multiplication for deep learning performance, which may be helpful for making sense of these parameters and results [5].

If we examine the kernel selection logs printed due to TENSILE_DB=0xFFFF, the output seems to suggest that the best available assembly kernel is a poor match. Although, it's not clear to me whether a better match would achieve better performance:

loaded code object /lib/x86_64-linux-gnu/rocblas/2.47.0/library/Kernels.so-000-gfx1030.hsaco
loaded code object /lib/x86_64-linux-gnu/rocblas/2.47.0/library/TensileLibrary_gfx1030.co
TensorDescriptor:calculate 3-tensor<Half>( sizes(128, 32, 32), strides(1, 1024, 32768), offset(0))totalLogicalElements=131072 totalAllocatedElem=1047680
TensorDescriptor:calculate 3-tensor<Half>( sizes(128, 2, 32), strides(1, 4096, 8192), offset(0))totalLogicalElements=8192 totalAllocatedElem=258176
TensorDescriptor:calculate 3-tensor<Half>( sizes(32, 2, 32), strides(1, 32, 64), offset(0))totalLogicalElements=2048 totalAllocatedElem=2048
TensorDescriptor:calculate 3-tensor<Half>( sizes(32, 2, 32), strides(1, 32, 64), offset(0))totalLogicalElements=2048 totalAllocatedElem=2048
AMDGPU(matches: And(Processor(gfx90a): 0, CUCount(104): 0): 0
): 0
AMDGPU(matches: Processor(gfx1030): 1): 1
ProblemMap Searching for Contraction_l_Alik_Bljk_Cijk_Dijk found Problem library (1 rows)
TruePred: 1
TruePred: 1
And(TypesEqual(a:Half == Float&& b:Half == Float&& c:Half == Float&& d:Half == Float): 0): 0

And(): 1

TruePred: 1
Object key: 32, 2, 128
Key: 32, 2, 128
Starting point: 64, 64, 256
Rightward search...
TruePred: 1And(StridedBatched(1): 0): 0

speed: 0.625 | 64, 64, 256: 21252 < 1.79769e+308 <-- Best distance, but no matching solution
TruePred: 1And(): 1

speed: 0.625 | 64, 64, 256: 21252 == 21252
speed: 0.981 | 64, 64, 1280: 1.33197e+06 > 21252
speed: 0.981 | 64, 64, 1280: 1.33197e+06 > 21252
<...>
speed: 53.044 | 128, 6784, 3328: 5.62447e+07 > 21252
speed: 53.044 | 128, 6784, 3328: 5.62447e+07 > 21252
256, 64, 256: Stopping rightward search early.
Leftward search...
Considered 11.6468% of entries.
Solution index selected: 22645
Running kernel: Cijk_Alik_Bljk_HB_GB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASAE01_ASCE01_ASEM1_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS1_FL0_GRVW2_GSU1_GSUAMB_GLS0_ISA1030_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_SIA1_SS0_SU32_SUM3_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW2_VSn1_VW2_WSGRA0_WSGRB0_WS32_WG8_8_1_WGM1
Kernel Cijk_Alik_Bljk_HB_GB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASAE01_ASCE01_ASEM1_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS1_FL0_GRVW2_GSU1_GSUAMB_GLS0_ISA1030_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_SIA1_SS0_SU32_SUM3_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW2_VSn1_VW2_WSGRA0_WSGRB0_WS32_WG8_8_1_WGM1
l(64, 1, 1) x g(2, 1, 32) = (128, 1, 32)

In any case, I collected rocblas performance data running the following models with llama-cpp:

- dolphin-2.2.1-mistral-7b.Q5_K_M
- dolphin-2.2.1-mistral-7b.Q6_K
- dolphin-2.5-mixtral-8x7b.Q4_K_M
- dolphin-2.7-mixtral-8x7b.Q4_K_M
- Meta-Llama-3.1-8B-Instruct-Q6_K

The logs are available on GitHub [6]. This is obviously an incomplete analysis, but I hope it helps folks to understand some of the performance analysis tools that are available on Debian right now.

Sincerely,
Cory Bloor

[1]: https://rocm.docs.amd.com/projects/rocBLAS/en/docs-6.3.2/reference/logging.html
[2]: https://rocm.docs.amd.com/projects/Tensile/en/docs-6.3.2/src/reference/environment-variables.html
[3]: https://bugs.debian.org/1095201
[4]: https://www.amd.com/en/products/graphics/desktops/radeon/6000-series/amd-radeon-rx-6800-xt.html
[5]: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
[6]: https://gist.github.com/cgmb/5bc00ad3f04afbdd04e2ef12d4aabe2d