Hi Xuanteng,
I don't have all the answers yet, but I'll take a shot at
answering regardless. It's possible I'm mistaken about some of the
details.
What I’m curious about is the moment the decompression happens, and the overhead to the end-to-end latency. Does it mean that the compressed GPU kernels should be decompressed first before their launch to GPU?
It's my understanding that it's the offload bundle that is compressed. The offload bundle is unbundled by libcomgr (in ROCm 5.7 and earlier) or libamdhip64 (in ROCm 6.0 and later). I would therefore reason that the code is decompressed before it is sent to the GPU.
I'm actually not sure when the upload of the kernels to the GPU is done. I suppose it must be either at shared library load time or handled lazily when a kernel is launched. I suspect the latter... ... ... ...and after a bit of research, I think I've more or less confirmed it is the latter. It seems that there was a move to lazily uploading kernels back in 2019 [1].
Each translation unit gets its own offload bundle (by default),
so my guess is that the first time that a translation unit
launches a kernel, the bundle for that translation unit is
decompressed and all the compatible code from the bundle is
uploaded to the GPU.
Does it happen every time or for the first time?
I believe it would need to perform the decompression once per
program execution.
Or the decompression happens at the time when the package gets installed?
No.
Sincerely,
Cory Bloor
[1]:
https://github.com/ROCm/HIP/issues/1304#issuecomment-519691962