[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Building llama.cpp for AMD GPU using only Debian packages?



Hi Jamie,

My expertise is in computer architecture, numerical simulations, computer graphics, stupid C++/CMake tricks, and the implementation details of the ROCm technology stack. Assume that the only thing I know about AI models is that they involve lots of matrix multiplications and convolutions.

On 2025-01-31 16:48, Jamie Bainbridge wrote:
The tuning would be for an architecture, not a specific model, right?

The tuning would be for a particular BLAS operation. For example, a half-precision GEMM between two transposed matrices, with alpha and beta equal to one, using single-precision to accumulate intermediate results, and producing a 512x512 output matrix in a new buffer. If you change any one of those mentioned parameters, it's a different operation that requires different tuning [1].

afaics the most used model architectures are:

- Qwen2ForCausalLM (Qwen 2.5, Qwen 2.5 Coder)
- LlamaForCausalLM (Llama 3.0, 3.1, 3.2)
- MistralForCausalLM (Mistral, Ministral, etc)

These models are popular for local inference due to their realistic
RAM requirements (1 to 4 gaming GPUs or old cheap datacentre GPUs).
Also most organisations and individuals doing finetunes are using
these models as a base for their improvements.

Of course DeepSeek R1 has become popular in the last month but is not
often run locally due to the high memory requirement (671B
parameters).

I'm afraid I don't know what a model architecture is. What aspects of the model are defined by the architecture, and which aspects may vary between different models of the same architecture?

Sincerely,
Cory Bloor

[1]: https://rocm.docs.amd.com/projects/Tensile/en/docs-6.3.2/src/conceptual/solution-selection-catalogs.html


Reply to: