Hi Jamie,
My expertise is in computer architecture, numerical simulations, computer graphics, stupid C++/CMake tricks, and the implementation details of the ROCm technology stack. Assume that the only thing I know about AI models is that they involve lots of matrix multiplications and convolutions.
The tuning would be for an architecture, not a specific model, right?
The tuning would be for a particular BLAS operation. For example, a half-precision GEMM between two transposed matrices, with alpha and beta equal to one, using single-precision to accumulate intermediate results, and producing a 512x512 output matrix in a new buffer. If you change any one of those mentioned parameters, it's a different operation that requires different tuning [1].
afaics the most used model architectures are: - Qwen2ForCausalLM (Qwen 2.5, Qwen 2.5 Coder) - LlamaForCausalLM (Llama 3.0, 3.1, 3.2) - MistralForCausalLM (Mistral, Ministral, etc) These models are popular for local inference due to their realistic RAM requirements (1 to 4 gaming GPUs or old cheap datacentre GPUs). Also most organisations and individuals doing finetunes are using these models as a base for their improvements. Of course DeepSeek R1 has become popular in the last month but is not often run locally due to the high memory requirement (671B parameters).
I'm afraid I don't know what a model architecture is. What
aspects of the model are defined by the architecture, and which
aspects may vary between different models of the same
architecture?
Sincerely,
Cory Bloor