[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

llama.cpp, whisper.cpp, ggml: Next steps



Hi,

here is what I gathered looking at llama.cpp, whisper.cpp, and ggml in
detail.

TL;DR: Having performant llama.cpp and whisper.cpp means having a
performant libggml. Factoring libggml out, and addressing performance
there, should substantially facilitate {llama,whisper}.cpp build and
maintenance. I also think that in the end, this will be less work then
adapting the other build processes, especially that of whisper.cpp.

In fact, their processes should be much simpler. Hence, I filed an ITP
for ggml, will fix all the perfomance stuff there, and then llama.cpp
and whisper.cpp can be adjusted for that.


Overview
========

llama.cpp and whisper.cpp both share a dependency on ggml, a tensor
library, which they embed in the source distribution. Perhaps due to
ggml being under active development, the integration into llama.cpp and
whisper.cpp is very tight, rather than your typical vendoring.

Nevertheless, at any given time ggml upstream and the embedded versions
are pretty much in sync. Consequently, whatever patch or feature
addition works for llama.cpp's ggml, should also apply without change to
whisper.cpp's ggml and vice versa.

llama.cpp and whisper.cpp both build a number of binaries that depend on
libllama resp. libwhisper, which in turn depend on libggml. Some
binaries also depend on libggml directly.

Practically all of the performance stuff is in in ggml. libllama and
libwhisper by themselves don't benefit from hwcaps builds, for example.


libggml
=======

libggml is about performant tensor operations. To this end, performance
can be controlled by a number of options.

The CPU backend, available on all architectures, supports the following
approaches:
  (A) -march=native (the default)
  (B) Turning individual instructions (eg: AVX2) on/off through flags
  (C) On amd64 only, a sort of dynamic dispatch
  (D) GLIBC Hardware Capabilities, which build multpile variations of (B)

In addition to the above, there are other backends available (BLAS, HIP,
CUDA, ...)

Backends support being dlopen()ed, however there are some issues with
this, see below. Consequently, each of the current llama.cpp builds are
"full" builds, so llama.cpp-cpu, llama.cpp-blas, llama.cpp-hip all
include their own versions of libllama, rather than just the backends
they add -- hence the Conflicts relationship between these packages.


Interface stability
===================

whisper.cpp has a SOVER, so I assume its interface is considered stable.

llama.cpp and libggml are still under very active development and do not
yet have stable interfaces.


Package Layout
==============

In accordance with Policy §8.1, I think the libraries of llama.cpp and
ggml should go into private directories. We otherwise risk breaking
user's locally built applications on updates, which is what Policy §8.1
aims to prevent. Well, the same breakage happens in private directories,
but we make no guarantees for those.

Under these assumptions, packages would ideally be laid out like this:

#llama.cpp: all libs in a private directory
/usr/bin/llama-*
/usr/lib/${DEB_HOST_MULTIARCH}/llama.cpp/{libllama,ggml}.so*
/usr/include/llama.cpp/{llama,ggml*}.h

# whisper.cpp: public and private libs
/usr/bin/whisper-*
/usr/lib/${DEB_HOST_MULTIARCH}/libwhisper.so*
/usr/lib/${DEB_HOST_MULTIARCH}/whisper.cpp/libggml.so*
/usr/include/whisper.h
/usr/include/whisper.cpp/{whisper,ggml*}.h

With CPU backend option (D), to support hwcaps, image
/usr/lib/${DEB_HOST_MULTIARCH}/glibc-hwaps/<profile> subdirectories for
the libraries, where each profile has variations of build flags.


Challenges
==========

(1) Upstream sees the interfaces of libllama and libggml as public;
privatizing them requires some work. It's not just the layout of the
packages, it's adjusting headers, pkgconfig, cmake files, etc. This
privatization is particularly tricky with whisper.cpp, which should
remain public.

(2) hwcaps builds, CPU build option (D), are easy to implement in
general (llama.cpp), but get messy when build results need to be split
between public and private interfaces as with whisper.cpp, because
simple debian/<package>.install patterns won't work anymore.

First hint: Both (1) and (2) would be easier if libggml were packaged
standalone.

(3) Dynamic dispatch, CPU build option (C), currently only works on
AMD64, and only for architectures with AVX or above, which is beyond our
amd64 baseline. For non-amd6, we'd have to limit to the baseline, which
on PowerPC would be POWER8, for example.

(4) The dynamic loading of backends features requires (3), and can be a
bit buggy (eg: Lack of an AMD GPU causes an error-out when the HIP
backend is present, rather than just having it ignored).


Approach
========

Solving (1): For llama.cpp, where everything is private, this is a
pretty simple tweak of common CMake variables. For whisper.cpp, I have
patched the build process to produce the desired layout, but I still
need to adjust the -dev files.

Solving (3): I think I should be able to extend this to our other CPU
cases without much work. This is mostly must compiler flag driven.
Upstream's CMake configuration for this is pretty neat.

Solving (4): After solving (3), I'm sure that there are easy fixes for
the bugs I encountered. I've already seen one recent PR fixing one of these.

For whisper.cpp, this means that (1) is only partially solved as of now,
and (2) would remain unsolved. However, I believe solving (3) and (4)
would remove the need for hwcaps. Solving (3) and (4) would also be less
work, also considering the remaining work that needs to be done for (1).


Summary
=======

For now, I'd leave llama.cpp as-is, and focus on packaging ggml. Within
ggml, I can start with a hwcaps implementation, and work my way up to
solving (3) and (4).

With libggml packaged, llama.cpp and whisper.cpp builds can then be
adapted to use that package, rather than build their own versions. This
build process should then be pretty simple.


Best,
Christian


Reply to: