Re: Full open source datasets for testing and benchmarking?

To: "M. Zhou" <lumin@debian.org>, Debian AI <debian-ai@lists.debian.org>
Subject: Re: Full open source datasets for testing and benchmarking?
From: Christian Kastner <ckk@debian.org>
Date: Fri, 31 Jan 2025 22:24:07 +0100
Message-id: <[🔎] 5f3df39a-f157-45f2-905e-1eae09929dcc@debian.org>
In-reply-to: <[🔎] 3a259cf5b9168156742607f5c77576ff9b931349.camel@debian.org>
References: <[🔎] 5d9ef67d-8938-447f-bc96-fd38f07f3146@debian.org> <[🔎] 3a259cf5b9168156742607f5c77576ff9b931349.camel@debian.org>

On 2025-01-31 17:59, M. Zhou wrote:
> Generally speaking, I think the whole ecosystem will rely on downloading
> models from internet at run-time for a long time.

> The best solution, in my opinion, is to package huggingface-cli and
> huggingface-hub, and simply download model from internet.

> I would not worry about DFSG compliance now. And no model seems to be
> compliant to DFSG to me. That said, we need to see the general resolution.

> My personal opinion is to stick to the most common practice, "download
> from internet at runtime". In that sense, everybody, including the upstream
> are still on the same boat. Moving to a new independent Debian boat
> sounds good but the ratio between investment and gain is simply too scary.

Honestly, I came to the same conclusion -- this seems to be the only
way. But I wanted to solicit opinions from people with more experience
(especially you) before accepting this.

> A small (1.1GB) and popular model to test is the 1.5B version of DeepSeek-R1.
> https://ollama.com/library/deepseek-r1:1.5b
> This should not incur much network and disk burden for the test machine.

This was the one I had in mind. Though I believe that even this would be
unacceptable for the official debci (non-free download, and 1.1GB might
still be quite a bit for the official workers).

>> [1]: Though TBH I would be strongly inclined to ask cloud providers for
>> credits on machines to do exactly that.
> 
> The LLM world goes very fast. You may need to update more than 1TB LLM
> models every month even if we only select the top-performing LLMs. How
> do you convince potential supporters for a lagging duplication to
> the free huggingface service?

Because they could benefit from it. Meta, for example, is betting on
"open" (except for data); their revenue is not tied to AI/ML as OpenAI's
is. Thanks to DeepSeek, "open" has been all over the media again.

Also, I don't think we would need to have a top-performing LLM. For
testing and verification purposes, it might be sufficient to train
something on just the Wikipedia corpus, for example. I mean, even MNIST
has its uses.

Donating eg: 1000 GPU-hours a month to Debian is not just a rounding
error to someone with hundreds of thousands of GPUs, it is also a tax
write-off. I believe the PR could well be worth it.

Best,
Christian

Reply to:

References:
- Full open source datasets for testing and benchmarking?
  - From: Christian Kastner <ckk@debian.org>
- Re: Full open source datasets for testing and benchmarking?
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Re: Bug#1094806: ITP: ollama -- large language model tools
Next by Date: Re: upcoming Blender 4.4 ROCm driver requirements
Previous by thread: Re: Full open source datasets for testing and benchmarking?
Next by thread: Re: Bug#1094806: ITP: ollama -- large language model tools
Index(es):
- Date
- Thread