[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Concerns to software freedom when packaging deep-learning based appications.



Russ Allbery writes ("Re: Concerns to software freedom when packaging deep-learning based appications."):
> I'm not sure I disagree, but I think it's worth poking at this a bit and
> seeing if it holds up in extension by analogy to other precomputed data.

Perhaps.

> Suppose that NASA releases a database of features of astronomical objects,
> described as a bunch numeric parameters.  Suppose that database is
> released in the public domain or some other obviously free license
> allowing use, derivative works, and everything else we expect.  However,
> suppose that data is derived (using significant computational resources
> and analysis code) from huge databases of observational data.  Suppose
> that observational data is *largely* released, but it's not at all clear
> what pieces were used to build the database of features, and the code to
> do so wasn't released alongside the database.
> 
> Is that database DFSG-compatible and something we could package?

Yes, because it's an estimate of facts which are directly modifiable.

If someone were to decide that the magnitude (brightness) of some
star, or the orbital elements of some rock, are wrong in the database,
no-one would suggest that this should be fixed by figuring out which
original telescope observations were wrong or which scientific papers
about the analysis of that data were wrong, or whatever (if any) and
fixing them, and then rerunning the whole analysis from start finish.

Sure, scientists might do the analysis to discover why the database
had a wrong answer (presumably after having measured the same thing
again and got different answers), but the outcome would be published
in a scientific paper.

The way one would be expected to make use of it would be to copy the
value from that new paper, into one's own data tables.  Presumably
NASA would update their database, but if they don't anyone (Debian or
a user) can do so to their own copy.

> Similar analogies may come up with databases about genome sequences.  For
> a lot of scientific data, reproducing a result data set is not trivial and
> the concept of "source" is pretty murky.

Pre-trained neural networks are not conclusions of (proper) scientific
research and are not amenable to being updated the same way.


Taking a step back: the point of this exercise is to preserve user
freedom.  That is, a user should be able to make their computer serve
their interests, and should not be subordinated to upstreams (nor to
Debian or to one of our derivatives0.

A user who uses NASA's tables for astronomical data is not
subordinated to NASA.  If for any reason they don't agree with NASA's
views on the orbits of planets or whatever, then they can put in their
own figures.

(And as for the processing: suppose we have someone who can book time
on a massive institutional telescope, or launch their own spacecraft,
so that they can make observations to replace those which fed into the
analysis done by the scientific community to make NASA's tables.  Such
a person (institution!) is not likely to be overly troubled by the
fact that in Debian the tables themselves are treated as the source
code.)


Compare neural networks: a user who uses a pre-trained neural network
is subordinated to the people who prepared its training data and set
up the training runs.

If the user does not like the results given by the neural network, it
is not sensibly possible to diagnose and remedy the problem by
modifying the weighting tables directly.  The user is rendered
helpless.

If training data and training software is not provided, they cannot
retrain the network even if they choose to buy or rent the hardware.

There is also the issue that a table of astronomical measurements is
much less likely to embed problematic assumptions.  I have yet to
encounter reports of significant racial bias in asteroid catalogues.
(Barring naming, I guess.)

Neural networks and other kinds of data-mining-style machine learning
have absolutely terrible problems with all kinds of awful biases.


Ian.

-- 
Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.


Reply to: