[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"



Hi Andreas,

On 2019-05-21 09:07, Andreas Tille wrote:
> Not sure whether this is sensible to be added to the issue
> tracker.

I always abuse issue track in my personal repository.

> Quoting from your section "Questions Not Easy to Answer"
> 
> 
>   1. Must the dataset for training a Free Model present in our archive?
>      Wikipedia dump is a frequently used free dataset in the computational
>      linguistics field, is uploading wikipedia dump to our Archive sane?
> 
> I have no idea about the size of this kind of dump.  Recently I've read
> that data sets for other programs tend into the direction of 1GB.  In
> Debian Med I'm maintaining metaphlan2-data with 204MB which would be
> even larger if there would not be some method for "data reduction" would
> be used that is considered a bug (#839925) by other DDs.

As pointed out by Mattias Wadenstein (thanks for the data point), the
wikipedia dump is large enough to challenge the .deb format (recent
threads).

>   2. Should we re-train the Free Models on buildd? This is crazy. Let's
>      don't do that right now.
> 
> If you ask me bothering buildd with this task is insane.  However I'm
> positively convinced that we should ship the training data and be able
> to train the models from these.

It's always good if we can do these things purely with our archive.
However sometimes it's just not easy to enforce: datasets used by DL
are generally large, (several hundred MB ~ several TB or even larger).


Reply to: