Re: Training and large data sets?

To: Wouter Verhelst <wouter@debian.org>, "M. Zhou" <lumin@debian.org>
Cc: debian-ai@lists.debian.org
Subject: Re: Training and large data sets?
From: Alastair McKinstry <mckinstry@debian.org>
Date: Tue, 28 Sep 2021 14:46:12 +0100
Message-id: <[🔎] afd4e8b6-2650-0b67-e8bd-9c9489e7f920@debian.org>
In-reply-to: <[🔎] YVLURqKHOOL6TXb7@pc181009.grep.be>
References: <YSpWqvYVO279japJ@pc181009.grep.be> <[🔎] e3c97417ff70514911155fb30429a9d7d2b606a0.camel@riseup.net> <[🔎] YVH73HIMVbo3eT+X@pc181009.grep.be> <[🔎] a6ad7c8ad53ea1b6445aadcfe101f25353dd405a.camel@riseup.net> <[🔎] YVLURqKHOOL6TXb7@pc181009.grep.be>


On 28/09/2021 09:37, Wouter Verhelst wrote:


I think if we are going to require maintainers to upload pretrained
models, especially if those are models that can only be trained on
nvidia GPUs, that we've essentially given up. That can't be the intent.

We are finally at a point where *all* software in Debian stable was
built on Debian hardware, most of it reproducibly. Yes, training data on
Debian hardware is a hard problem to tackle; but saying "so let's not do
it" is throwing in the towel, and we shouldn't do that, at least IMO.

Even if not, I personally just do not have the infrastructure to train
the model myself (I'd have to keep my laptop running for days on end,
and, well, it's a laptop, so...), so if that's going to be required,
it's going to be a no from me.

In $dayjob i'm working on a project that would allow users to train MLmodels on large EO datasets; think "imagenet for Earth Observation".

(A standard labelled dataset, easily loaded into most toolkits, on whichyou can develop and test your own ML models).

This progresses to enabling a user to train a model on "all" satellitedata (at least PBs of imagery) with associated ground-truth information.

This would be done on open datasets, Open European infrastructure, andallowing users to upload their own workflows to do so - imagineuploading a training job in the form of a singularity container with aDebian instance containing packaged model, trained on PBs of "localstorage" on a cluster, generating a reasonable-sized (MB ) model.

Is this model "open" and available to be used in Debian ? itsreproducible in all ways barring possibly cost and the "Desert Island"criteria.

It wouldn't be a Debian archive as its extremely expensive, tax-payerfunded and open-ended archive. (ESA).

It can, sort of, make the nVidia problem moot, depending on theinterfacing to hardware under tensorflow/keras/pytorch, as the codeworks on smaller versions of the dataset "locally" (for development)and the hardware at cloud-scale is something you're not going to own athome (shades of GNUs original issues to running on any hardware needing"3-phase power").


regards

Alastair

--
Alastair McKinstry, <alastair@sceal.ie>, <mckinstry@debian.org>, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.

Reply to:

References:
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>
- Re: Training and large data sets?
  - From: Wouter Verhelst <w@uter.be>
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>
- Re: Training and large data sets?
  - From: Wouter Verhelst <wouter@debian.org>

Prev by Date: Re: Training and large data sets?
Next by Date: Bug#995360: pytorch: autopkgtest regression: fft: ATen not compiled with MKL support
Previous by thread: Re: Training and large data sets?
Next by thread: Re: googleapis packaging
Index(es):
- Date
- Thread