[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: rocblas and llama.cpp performance profiling



Further analysis...

On 2025-02-06 18:06, Cordell Bloor wrote:

The profile log gives you the rocblas operation calls as a YAML list of JSON objects. I'm not sure why this is useful, but it looks like so:

- { rocblas_function: "rocblas_gemm_batched_ex", atomics_mode: atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 128, N: 1, K: 320, alpha: 1.0, lda: 2048, ldb: 320, beta: 0.0, ldc: 128, ldd: 128, batch_count: 32, algo: 0, solution_index: 0, flags: pack_int, call_count: 352 }

I suppose the primary additional benefit is that this style of log includes the call_count, so you can judge the relative importance of optimizing each function.

The bench log, however, is quite useful. It gives you each call to rocblas, but transformed into rocblas-bench commands for isolated performance testing of the rocblas operation of interest:

./rocblas-bench -f gemm_batched_ex --transposeA T --transposeB N -m 32 -n 2 -k 128 --alpha 1 --a_type f16_r --lda 1024 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc --d_type f16_r --ldd 32 --batch_count 32 --compute_type f16_r --algo 0 --solution_index 0 --flags 1

To use the above in Debian, you would install librocblas0-bench, remove the --ldc flag (which appears to be missing its size argument [3]), change ./rocblas-bench to /usr/libexec/rocm/librocblas0-bench/rocblas-bench, and increase the iteration count by adding something like --iters 1000 (or more) to ensure you get enough samples to provide reasonable timing results.

Note that you can find the missing ldc value from the other logs. In this case, it was --ldc 32.

The output you'll see should look something like this:

$ ./rocblas-bench -f gemm_batched_ex --transposeA T --transposeB N -m 32 -n 2 -k 128 --alpha 1 --a_type f16_r --lda 1024 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --d_type f16_r --ldd 32 --batch_count 32 --compute_type f16_r --algo 0 --solution_index 0 --flags 1 --iters 10000
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : AMD Radeon RX 6800 XT gfx1030
with 17.2 GB memory, max. SCLK 2575 MHz, max. MCLK 1000 MHz, compute capability 10.3
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------


rocBLAS info: Time taken to complete rocBLAS library initialization is 1603 milliseconds.
transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,ldd,batch_count,rocblas-Gflops,us
T,N,32,2,128,1,1024,0,4096,128,32,32, 85.8785, 6.105

My system has a RX 6800 XT installed, so 0.858785 TFLOPS is quite a bit less than the theoretical peak of 41.47 TFLOPS [4].

Missed a zero. 85.8785 GFLOPS is 0.0858785 TFLOPS.

Although, I'm not sure you could actually get anywhere near the theoretical peak performance with these parameters anyway. With N=2, it's barely a GEMM and I imagine it's probably memory limited. NVIDIA has a good background page on matrix multiplication for deep learning performance, which may be helpful for making sense of these parameters and results [5].

Following the analysis from [5]:

Floating Point Operations (FLOP) = 2 * M * N * K = 2 * 32 * 2 * 128 = 16000
Bytes accessed = 2 * (M*K + N*K + M*N) = 2 * (32*128 + 2*128 + 32*2) = 8832

The arithmetic intensity is therefore 1.8 FLOP / byte. Given that the RX 6800 XT can do 41.41 TFLOP/s and has a memory bandwidth of up to 512 GB/s [4], it would be memory limited at anything less than ~80 FLOP / byte. So, this is a heavily bandwidth-limited operation.

Before continuing, perhaps we should do a quick sanity check on the 85.8785 GFLOP/s result that rocblas-bench is reporting. We've determined that this matrix multiplication is 16000 operations , but we're actually calling the batched version of the function to do a batch of 32 multiplications at a time. So, we're really doing 16 kFLOP * 32 = 512 kFLOP. This operation was completed in 6.105 us on average, so 5.12*10^5 FLOP / 6.105*10^-6 s = 8.387*10^10 FLOP/s = 83.87 GFLOP/s. Eh. Close enough.

If we consider batching for the bytes accessed, we have 8832 bytes * 32 = 282624 bytes. Next, we can take the number of bytes accessed and divide by the number of bytes we could access in the time the operation took. 282624 B / 512*10^9 B/s * 6.105*10^-6 s = 0.09041 = 9.04% of theoretical bandwidth utilization.

So, in theory, with perfect optimization, this hardware might be able to run llama.cpp ~11x faster than it currently does. In practice, I don't think it's likely that we'll get anywhere near 100% utilization, but we'd get a 2x-3x improvement even just from pushing the utilization up to ~25% of the theoretical maximum. We'll have to see if Tensile tuning might help, although I think rocBLAS in general struggles with skinny matrices. I see there's an upstream bug relating to this for MI100 [7].

Please keep in mind that these are my rough notes. They have not been checked by anyone and may contain errors. I know a thing or two about performance analysis, but I am not by any means an expert on the subject. This is, nevertheless, the justification for why I believe there is plenty of room to improve the performance of llama.cpp on AMD GPU hardware.

Sincerely,
Cory Bloor

[1]: https://rocm.docs.amd.com/projects/rocBLAS/en/docs-6.3.2/reference/logging.html
[2]: https://rocm.docs.amd.com/projects/Tensile/en/docs-6.3.2/src/reference/environment-variables.html
[3]: https://bugs.debian.org/1095201
[4]: https://www.amd.com/en/products/graphics/desktops/radeon/6000-series/amd-radeon-rx-6800-xt.html
[5]: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
[6]: https://gist.github.com/cgmb/5bc00ad3f04afbdd04e2ef12d4aabe2d
[7]: https://github.com/ROCm/rocBLAS/issues/1425

Reply to: