[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Training and large data sets?




On 28/09/2021 09:37, Wouter Verhelst wrote:

I think if we are going to require maintainers to upload pretrained
models, especially if those are models that can only be trained on
nvidia GPUs, that we've essentially given up. That can't be the intent.

We are finally at a point where *all* software in Debian stable was
built on Debian hardware, most of it reproducibly. Yes, training data on
Debian hardware is a hard problem to tackle; but saying "so let's not do
it" is throwing in the towel, and we shouldn't do that, at least IMO.

Even if not, I personally just do not have the infrastructure to train
the model myself (I'd have to keep my laptop running for days on end,
and, well, it's a laptop, so...), so if that's going to be required,
it's going to be a no from me.

In $dayjob i'm working on a project that would allow users to train ML models on large EO datasets; think "imagenet for Earth Observation".

(A standard labelled dataset, easily loaded into most toolkits, on which you can develop and test your own  ML models).

This progresses to enabling a user to train a model on "all" satellite data (at least PBs of imagery) with associated ground-truth information.

This would be done on open datasets, Open European infrastructure, and allowing users to upload their own workflows to do so - imagine uploading a training job in the form of a singularity container with a Debian instance containing packaged model, trained on PBs of "local storage" on a cluster, generating a reasonable-sized (MB ) model.

Is this model "open" and available to be used in Debian ? its reproducible in all ways barring possibly cost and the "Desert Island" criteria.

It wouldn't be a Debian archive as its extremely expensive, tax-payer funded and open-ended archive. (ESA).

It can, sort of, make the  nVidia problem moot, depending on the interfacing to hardware under tensorflow/keras/pytorch, as the code works  on smaller versions of the dataset "locally" (for development) and the hardware at cloud-scale is something you're not going to own at home (shades of GNUs original issues to running on any hardware needing "3-phase power").

regards

Alastair

--
Alastair McKinstry, <alastair@sceal.ie>, <mckinstry@debian.org>, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.


Reply to: