[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++



Hi,

back from the holidays. I hope everyone else got a chance to rest well, too.

On 2024-12-22 20:42, Mo Zhou wrote:
> On 12/22/24 13:03, Christian Kastner wrote:
>> Given how data is processed today, our amd64 baseline really does impede
>> machine-learning software, so some solution will need to be found.

Agreed. A few years ago, this state was probably just "bad", but given
recent developments, a proper solution is no longer avoidable if Debian
wants to compete in this space -- which it should, of course.

Sure, it would be best if upstreams solve this via dispatch, and many
do. But not all, and we neither can force them to change, nor do we have
the resources to contribute fixes to every upstream.

> Apart from source-based alternative distribution for Debian, "bumping amd64
> baseline for selected packages" is another project I proposed long time
> ago:
> 
>   https://github.com/SIMDebian/SIMDebian
> 
> Software like Eigen3, TensorFlow can heavily benefit from the baseline
> bump.

Ah yes, I remember now. Indeed, that would be one solution, and probably
the easiest one to implement.

It can be slightly tricky though. What would happen if -march is also
set by the maintainer in in d/rules? I guess you could catch those, and
possibly other corner cases, with a trivial linter?

Another issue would be the differentiation by version approach. It's
fine if there is only one external source (with one march) activated but
as soon as a second one is added (users being users), things can get messy.

The more complicated route would be to bootstrap new architectures, eg:
amd64-v3, which users could activate in addition to amd64. Though while
my gut says that long-term this would be preferable, I'm sure that there
are pretty bad corner cases here, too.

The glibc "hardware capabilities" route you refer to below would be the
jackpot, as things would "just work" as long as the maintainer goes
through the trouble of implementing it (which, I'm sure, AI/ML
maintainers would be happy to do).

Proposal: How about sometime during 2025, we restart the discussion
project-wide, aiming for a general solution as a release goal for trixie+1?

And in the meantime, we (debian-ai) can experiment with various
approaches, to provide real-world data for such a discussion. llama.cpp
really seems like a good test ballon for this, as it touches all of the
corner cases (I think) while still being "simple".

> BTW, my personal conclusion on the SIMDebian project is that, while bumping
> baseline can indeed benefit a lot, it is not necessary to do so by ourselves,
> because if SIMD performance is really important to this software,
> 
> (1) the users will figure out how to compile using -march=native. Typically
> consider this is also done in highly-customized environments like HPC. Those
> power users will anyway recompile on their own to fit their need whatever we
> package.

I agree that HPC environment will probably do this. I'm not so sure
about individuals.

In any case, globally it would save a lot of time and trouble if Debian
could provide a solution out of the box.

Eg: Recompiling a package is obviously not an obstacle to me (or even
hosting my own APT repo), but it would be nevertheless much nicer if I
could just `apt-get install` or `upgrade`. Especially when more machines
are involved.

> (2) the upstream will implement them soon after being requested, if the
> software remains popular while not interested in implementing so, that
> means SIMD was not necessary for them. 

Hm, not so sure about that, either. Some (or even most?) upstreams
definitely will. But llama.cpp is a good counterexample for when even a
popular upstream can decide to go the compile-time route. Even PyTorch
didn't originally have this, as you mention.

>> My idea was to just build libllama X times with various optimizations
>> enabled. So users can select the version best applicable to their CPU
>> and/or GPU, but we would also have a non-optimized version that would
>> satisfy our amd64 ISA requirements.
> No need to do that manually. The Glibc already provided that kind of
> dispatching functionality when you build multiple solibs with different
> baselines. Please check
> 
>   https://lists.debian.org/debian-devel/2019/04/msg00057.html
> 
> Or concretely,
> 
>   section "Hardware capabilities" from ld.so(8)

Oh, this would be fantastic! In the ideal case, that would not just
solve my problem, but present a general solution.

For llama.cpp, I'll focus on this approach. Thanks!

Two possible downsides here: (1) does this work with every linker (but
do we even support alternatives?), and (2) ARM is not mentioned. Though
I know too little about ARM ISAs to say whether this is a problem. Might
be a non-issue there.

> I believe this is what you were expecting. But it seems that the
> avx2/ avx512 dispatch is missing from the man page. Don't know
> what's happening for their support.

Looking at the man page in unstable [1], this seems to have changed with
glibc 2.37, which apparently dropped support for "legacy hardware
capabilities" (= anything 32-bit), and improved support for the 64-bit
ones, including avx512 support via x86-64-v4.

> I'm surprised those discussions were already something happened 5 years
> ago.

Indeed. But now, more than ever, is the time to re-raise the issue, I
think. Compute is changing, and Debian needs to adapt.

Best,
Christian

[1]: https://manpages.debian.org/unstable/manpages/ld.so.8.en.html


Reply to: