Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

To: Christian Kastner <ckk@debian.org>, Petter Reinholdtsen <pere@hungry.com>, debian-ai@lists.debian.org, 1063673@bugs.debian.org
Subject: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
From: "M. Zhou" <lumin@debian.org>
Date: Thu, 06 Feb 2025 09:14:10 -0500
Message-id: <[🔎] f5b3cee6bf0c867807fc06c93e83c3f835316b45.camel@debian.org>
In-reply-to: <[🔎] af0ba2ed-ee57-4c54-b544-fa567884f7a8@debian.org>
References: <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <sa6mss4bytd.fsf@hjemme.reinholdtsen.name> <fdedee66-9a55-475e-9e23-acfdfc351025@debian.org> <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <sa65xxw4jhn.fsf@hjemme.reinholdtsen.name> <22d3e2d2-cfbd-431d-9211-e902ac3dfe4b@debian.org> <22d3e2d2-cfbd-431d-9211-e902ac3dfe4b@debian.org> <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <de29a469-6c9b-4025-bbed-988e10dc5a38@slerp.xyz> <0aa4f182-da25-4ba5-8d9f-a1d1f8ad9221@debian.org> <ece647c1-3dba-4737-a215-c93112990fe4@debian.org> <7976e018-a547-4bba-82ba-13847980356e@debian.org> <efae84f0-dfa9-4cd2-a869-752ae1bd22cd@debian.org> <[🔎] c9735f7c-982a-4e81-a048-bc588833dccf@debian.org> <[🔎] 2f158aa02fac5d00dcdcfc8a6ce0ee2a147bc3c0.camel@debian.org> <[🔎] sa6ldukhzw2.fsf@hjemme.reinholdtsen.name> <[🔎] 9d8ea37e-310e-4a61-83c2-b8820a17f016@debian.org> <[🔎] sa67c63j4iv.fsf@hjemme.reinholdtsen.name> <[🔎] 35dc70e806d5a4c273a385b1b02770b8550e1b55.camel@debian.org> <[🔎] af0ba2ed-ee57-4c54-b544-fa567884f7a8@debian.org>

On Thu, 2025-02-06 at 09:13 +0100, Christian Kastner wrote:
> 
> I meant to ask anyway: performance-wise, is it comparable to your local
> build? I mean, I wouldn't know what in the code would alter this, but I
> built and tested this on platti.d.o and performance was poor, so another
> data point would be useful.

For ppc64el, the llama.cpp-blas backend is way slower than the -cpu backend.
I did not test on amd64. But on ppc64el the package does not feel different
than local build.

CPU is slow anyway. How does HIP performs?

phi-4-q4.gguf | power9, cpu (8-threads) | 0.62 tokens/s
phi-4-q4.gguf | amd64, 13900H           | 6.7 tokens/s

GPU is way faster than this. The phi-4 model does not fit in my nvidia GPU.
No number for GPU this time.

Reply to:

Follow-Ups:
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: Christian Kastner <ckk@debian.org>

References:
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: Christian Kastner <ckk@debian.org>
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: "M. Zhou" <lumin@debian.org>
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: Petter Reinholdtsen <pere@hungry.com>
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: Christian Kastner <ckk@debian.org>
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: Petter Reinholdtsen <pere@hungry.com>
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: "M. Zhou" <lumin@debian.org>
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: Christian Kastner <ckk@debian.org>

Prev by Date: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
Next by Date: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
Previous by thread: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
Next by thread: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
Index(es):
- Date
- Thread