[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++



Hi Mo,

thanks for the feedback!

On 2024-12-22 17:38, Mo Zhou wrote:
> Did you have a chance to test int8 and int4? They are heavily relying on
> newer SIMD instructions especially things like AVX512, and maybe they
> face a larger performance impact without -march=native. BTW, for recent
> large language models, in fact int4 does not lose much performance[1],
> and should be the default precision to run locally since it ought to be
> anyway faster than CPU float point.

It's not as bad as f16. int8 is about 5x worse, int4 3x worse.

Though I personally still consider too bad for this particular software.
But see below.

> If llama.cpp really lose a lot of int4 performance without SIMD, that
> could be more demotivating to be honest.
> 
> I'm also a llama.cpp user through Ollama[2]'s convenient wrapping work.
> It is too complicated to consider for packaging -- I mention it here in
> order to give you a better idea on how the ecosystem uses llama.cpp, in
> case you did not see it before.

Not yet, but I intend to look at it as it does seem to be the most
popular interface. Though yeah, probably too complicated for packaging.

I haven't packaged Golang stuff yet but the dependencies can get
daunting and that makes backports especially difficult.

> To my point of view, llama.cpp is more suitable for source-based
> distributions like Gentoo. In the past I proposed something similar for
> Debian but the community was not interested in that.

Well, we could still keep this in mind and implement it experimentally
for some packages. After all, it's already happening in some sort with DKMS.

Given how data is processed today, our amd64 baseline really does impede
machine-learning software, so some solution will need to be found.

> In terms of the BLAS/MKL-like approach for SIMD capability
> dispatching ... I bet focusing on something else is more worthwhile.

I probably misunderstood something about the MKL packages. I saw
  libmkl-avx
  libmkl-avx2
  libmkl-avx512
  ...
and I assumed that these are all the same library, just built with
different optimizations. But they don't seem to conflict with each
other, so I guess it's more subtle than that.

My idea was to just build libllama X times with various optimizations
enabled. So users can select the version best applicable to their CPU
and/or GPU, but we would also have a non-optimized version that would
satisfy our amd64 ISA requirements.

Best,
Christian


Reply to: