[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Full open source datasets for testing and benchmarking?



On 2025-01-31 17:59, M. Zhou wrote:
> Generally speaking, I think the whole ecosystem will rely on downloading
> models from internet at run-time for a long time.

> The best solution, in my opinion, is to package huggingface-cli and
> huggingface-hub, and simply download model from internet.

> I would not worry about DFSG compliance now. And no model seems to be
> compliant to DFSG to me. That said, we need to see the general resolution.

> My personal opinion is to stick to the most common practice, "download
> from internet at runtime". In that sense, everybody, including the upstream
> are still on the same boat. Moving to a new independent Debian boat
> sounds good but the ratio between investment and gain is simply too scary.

Honestly, I came to the same conclusion -- this seems to be the only
way. But I wanted to solicit opinions from people with more experience
(especially you) before accepting this.

> A small (1.1GB) and popular model to test is the 1.5B version of DeepSeek-R1.
> https://ollama.com/library/deepseek-r1:1.5b
> This should not incur much network and disk burden for the test machine.

This was the one I had in mind. Though I believe that even this would be
unacceptable for the official debci (non-free download, and 1.1GB might
still be quite a bit for the official workers).

>> [1]: Though TBH I would be strongly inclined to ask cloud providers for
>> credits on machines to do exactly that.
> 
> The LLM world goes very fast. You may need to update more than 1TB LLM
> models every month even if we only select the top-performing LLMs. How
> do you convince potential supporters for a lagging duplication to
> the free huggingface service?

Because they could benefit from it. Meta, for example, is betting on
"open" (except for data); their revenue is not tied to AI/ML as OpenAI's
is. Thanks to DeepSeek, "open" has been all over the media again.

Also, I don't think we would need to have a top-performing LLM. For
testing and verification purposes, it might be sufficient to train
something on just the Wikipedia corpus, for example. I mean, even MNIST
has its uses.

Donating eg: 1000 GPU-hours a month to Debian is not just a rounding
error to someone with hundreds of thousands of GPUs, it is also a tax
write-off. I believe the PR could well be worth it.

Best,
Christian


Reply to: