[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Building llama.cpp for AMD GPU using only Debian packages?



On Sun, 2 Feb 2025 at 19:10, Cordell Bloor <cgmb@slerp.xyz> wrote:
>
> Hi Jamie,
>
> My expertise is in computer architecture, numerical simulations, computer graphics, stupid C++/CMake tricks, and the implementation details of the ROCm technology stack. Assume that the only thing I know about AI models is that they involve lots of matrix multiplications and convolutions.
>
> On 2025-01-31 16:48, Jamie Bainbridge wrote:
>
>> The tuning would be for an architecture, not a specific model, right?
>
> The tuning would be for a particular BLAS operation. For example, a half-precision GEMM between two transposed matrices, with alpha and beta equal to one, using single-precision to accumulate intermediate results, and producing a 512x512 output matrix in a new buffer. If you change any one of those mentioned parameters, it's a different operation that requires different tuning [1].
>
>> afaics the most used model architectures are:
>>
>> - Qwen2ForCausalLM (Qwen 2.5, Qwen 2.5 Coder)
>> - LlamaForCausalLM (Llama 3.0, 3.1, 3.2)
>> - MistralForCausalLM (Mistral, Ministral, etc)
>>
>> These models are popular for local inference due to their realistic
>> RAM requirements (1 to 4 gaming GPUs or old cheap datacentre GPUs).
>> Also most organisations and individuals doing finetunes are using
>> these models as a base for their improvements.
>>
>> Of course DeepSeek R1 has become popular in the last month but is not
>> often run locally due to the high memory requirement (671B
>> parameters).
>
> I'm afraid I don't know what a model architecture is. What aspects of the model are defined by the architecture, and which aspects may vary between different models of the same architecture?

This is hitting the limits of my knowledge too. aiui the model
architecture defines how to run the weights. it covers things like the
tokeniser, number and type of attention heads, layers, feed-forward
networks, positional encodings. The way to perform calculations
between the tokeniser and logit sampling.

As implied above, larger organisations seem to have settled on a model
architecture, and a new model release just involves updating weights
to run on the same architecture.

The difference between say Llama 3.1 and 3.2 is in the number of
weights and the training method those weights are arrived at, not in
way the weights are executed at runtime.

Perhaps someone could confirm or correct me.

Jamie


Reply to: