[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"



On Tue, 21 May 2019, Andreas Tille wrote:

Quoting from your section "Questions Not Easy to Answer"


 1. Must the dataset for training a Free Model present in our archive?
    Wikipedia dump is a frequently used free dataset in the computational
    linguistics field, is uploading wikipedia dump to our Archive sane?

I have no idea about the size of this kind of dump.

The current size of wikimedia dumps is 18T, but that includes several versions of data (five dated versions are shipped for most dumps), etc. As a sample, I think this[1] is the english pages main text (not history or metadata), which is 15G compressed.

1) https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20190501/enwiki-20190501-pages-articles.xml.bz2

/Mattias Wadenstein, mirror admin who also mirrors the wikimedia dumps


Reply to: