Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"
Hi Andreas,
On 2019-05-21 09:07, Andreas Tille wrote:
> Not sure whether this is sensible to be added to the issue
> tracker.
I always abuse issue track in my personal repository.
> Quoting from your section "Questions Not Easy to Answer"
>
>
> 1. Must the dataset for training a Free Model present in our archive?
> Wikipedia dump is a frequently used free dataset in the computational
> linguistics field, is uploading wikipedia dump to our Archive sane?
>
> I have no idea about the size of this kind of dump. Recently I've read
> that data sets for other programs tend into the direction of 1GB. In
> Debian Med I'm maintaining metaphlan2-data with 204MB which would be
> even larger if there would not be some method for "data reduction" would
> be used that is considered a bug (#839925) by other DDs.
As pointed out by Mattias Wadenstein (thanks for the data point), the
wikipedia dump is large enough to challenge the .deb format (recent
threads).
> 2. Should we re-train the Free Models on buildd? This is crazy. Let's
> don't do that right now.
>
> If you ask me bothering buildd with this task is insane. However I'm
> positively convinced that we should ship the training data and be able
> to train the models from these.
It's always good if we can do these things purely with our archive.
However sometimes it's just not easy to enforce: datasets used by DL
are generally large, (several hundred MB ~ several TB or even larger).
Reply to: