[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Full open source datasets for testing and benchmarking?



Hi,

(this is more of a long-term question, but we have to start somewhere)

Working on llama.cpp, I of course ran into the problem that testing and
benchmarking tasks require some model, but AFAIK we don't have any
models in the Archive.

How are we going to address this long-term?

>From a technical POV, we'd probably need individual packages for each
model, so we would need to come up with some kind of naming scheme.
Also, we would need to clarify hosting: these models consume significant
disk space and bandwidth.

>From a policy POV, there's the question of how we would ensure DFSG
compliance. Even if a model were to be distributed with DFSG-free
training data, there is currently no reasonable way that the model could
be "rebuilt from source" because even for relatively trivial stuff, that
would require dozens of GPU hours [1]. If data, process, code, model are
all DFSG-free, would could theoretically make a sort-of exception to the
rebuilding, but I'm skeptical that this would be allowed, much less
correct.

For autopkgtests, one workaround I could use now would be to use the
"needs-internet" to download a model from huggingface at test time, but
I'm again skeptical that using a non-free model would be permissible.
I'll have to check.

Best,
Christian

[1]: Though TBH I would be strongly inclined to ask cloud providers for
credits on machines to do exactly that.


Reply to: