Re: Full open source datasets for testing and benchmarking?

To: Christian Kastner <ckk@debian.org>, Debian AI <debian-ai@lists.debian.org>
Subject: Re: Full open source datasets for testing and benchmarking?
From: "M. Zhou" <lumin@debian.org>
Date: Fri, 31 Jan 2025 11:59:36 -0500
Message-id: <[🔎] 3a259cf5b9168156742607f5c77576ff9b931349.camel@debian.org>
In-reply-to: <[🔎] 5d9ef67d-8938-447f-bc96-fd38f07f3146@debian.org>
References: <[🔎] 5d9ef67d-8938-447f-bc96-fd38f07f3146@debian.org>

On Fri, 2025-01-31 at 11:21 +0100, Christian Kastner wrote:
> Hi,
> 
> (this is more of a long-term question, but we have to start somewhere)
> 
> Working on llama.cpp, I of course ran into the problem that testing and
> benchmarking tasks require some model, but AFAIK we don't have any
> models in the Archive.
> 
> How are we going to address this long-term?

My answer towards this blocked by the planned general resolution.

Generally speaking, I think the whole ecosystem will rely on downloading
models from internet at run-time for a long time.

In the archive there is an MIT-licensed dataset:
https://tracker.debian.org/pkg/dataset-fashion-mnist
I put it to archive because it can be used in autopkgtest of pytorch to
really train a convolutional neural network as a functionality test.
A small convolutional neural network can be trained on CPU very quickly
on this dataset. That is for vision, but not natural language.

If you really want to test llama.cpp using nothing outside the archive,
you may need some similar (small-scale dataset) like the above Fasion-MNIST.
Small scale datasets, plus a small model definition so that it trains very
fast with CPU. Then we can figure out how to convert the pytorch/onnx model
into the gguf format so llama.cpp can run it.

Or even simpler, just define a very small LLM and train it on llama.cpp's
source code with pytorch. Then convert the model into gguf and run the test.

The resulting model is very likely to produce nonsense, but ok as a
unit test.

> > From a technical POV, we'd probably need individual packages for each
> model, so we would need to come up with some kind of naming scheme.
> Also, we would need to clarify hosting: these models consume significant
> disk space and bandwidth.

The best solution, in my opinion, is to package huggingface-cli and
huggingface-hub, and simply download model from internet.

Mirroring huggingface models through our archive puts a very high pressure
on all debian archive infrastructure as well as downstream mirrors.

Plus, whether you can really put an LLM into the archive, depends on
the ftp-master and the planned general resolution.

> > From a policy POV, there's the question of how we would ensure DFSG
> compliance. Even if a model were to be distributed with DFSG-free
> training data, there is currently no reasonable way that the model could
> be "rebuilt from source" because even for relatively trivial stuff, that
> would require dozens of GPU hours [1]. If data, process, code, model are
> all DFSG-free, would could theoretically make a sort-of exception to the
> rebuilding, but I'm skeptical that this would be allowed, much less
> correct.

I would not worry about DFSG compliance now. And no model seems to be
compliant to DFSG to me. That said, we need to see the general resolution.

> For autopkgtests, one workaround I could use now would be to use the
> "needs-internet" to download a model from huggingface at test time, but
> I'm again skeptical that using a non-free model would be permissible.
> I'll have to check.
> 

My personal opinion is to stick to the most common practice, "download
from internet at runtime". In that sense, everybody, including the upstream
are still on the same boat. Moving to a new independent Debian boat
sounds good but the ratio between investment and gain is simply too scary.

A small (1.1GB) and popular model to test is the 1.5B version of DeepSeek-R1.
https://ollama.com/library/deepseek-r1:1.5b
This should not incur much network and disk burden for the test machine.

> [1]: Though TBH I would be strongly inclined to ask cloud providers for
> credits on machines to do exactly that.

The LLM world goes very fast. You may need to update more than 1TB LLM
models every month even if we only select the top-performing LLMs. How
do you convince potential supporters for a lagging duplication to
the free huggingface service?

Reply to:

Follow-Ups:
- Re: Full open source datasets for testing and benchmarking?
  - From: Joost van Baal-Ilić <joostvb-debian@mdcc.cx>
- Re: Full open source datasets for testing and benchmarking?
  - From: Christian Kastner <ckk@debian.org>

References:
- Full open source datasets for testing and benchmarking?
  - From: Christian Kastner <ckk@debian.org>

Prev by Date: Re: Bug#1094806: ITP: ollama -- large language model tools
Next by Date: Re: Full open source datasets for testing and benchmarking?
Previous by thread: Full open source datasets for testing and benchmarking?
Next by thread: Re: Full open source datasets for testing and benchmarking?
Index(es):
- Date
- Thread