Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2

To: "M. Zhou" <lumin@debian.org>
Cc: Sean Whitton <spwhitton@spwhitton.name>, debian-project@lists.debian.org
Subject: Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
From: Sam Johnston <samj@samj.net>
Date: Sat, 25 Jan 2025 17:08:59 +0100
Message-id: <[🔎] CAKTR03_hbHpsD-wf+yUU8cEJoF=FmREt0wyz5yrfsyozzWVOHw@mail.gmail.com>
In-reply-to: <[🔎] f2fdc7407975c76a566d2502bdb85560760419c9.camel@debian.org>
References: <CA+f80t5Wh72we9A+t72BWSXq9WH7O5vn03Z5FpoxtCp_QKD2-w@mail.gmail.com> <cd87d2cb-4a4d-4ea3-8e5d-187e9e1ee0de@debian.org> <[🔎] 87tt9nw0wm.fsf@zephyr.silentflame.com> <[🔎] f2fdc7407975c76a566d2502bdb85560760419c9.camel@debian.org>

On Sat, 25 Jan 2025 at 16:24, M. Zhou <lumin@debian.org> wrote:
>
> On Sat, 2025-01-25 at 12:09 +0000, Sean Whitton wrote:
> >
> > Wondered if you'd had another chance to look at this.
>
> Ummm... You know what may happen when there is no deadline.

The best time to do this was last year around the OSAID 1.0 release.
The next best time is now. Do you need our help?

I'm working on an article about how the chickens have come home to
roost with the VLC demo at CES 2025. With VLC advertising and users
now expecting real-time AI subtitling that "appears to be built
directly into the VLC app"[1], we have a situation where VLC is
considered Open Source by the OSD, but NOT according to the OSAID and
OSI leadership[2] because of Whisper being embedded. With more and
more software being written by and incorporating AI, this situation is
untenable. Distros like Debian would have to lobotomise popular apps
like VLC, or accept more binary blobs.

The OSI also just released a whitepaper[3] that further deliberately
obfuscates the issue, prompting me to post this:

The Open Source Initiative (OSI) goes to the effort of defining four
classes of data *source* (hence the term!) in their Open Source AI
Definition (OSAID) FAQ and again in the Open Future Foundation’s name
in this new paper, only to then accept ANY of them… or NONE at all:

- OPEN data under open licenses, which is the ONLY class that has any
role in Open Source AI
- PUBLIC data like Common Crawl Foundation dumps of the Internet,
which are routinely ab/used without creators’ consent
- OBTAINABLE data “including for a fee” like The New York Times
articles and Adobe/Getty Images stock photos, which are guaranteed to
get end users (but not necessarily vendors given limited liability
clauses) sued
- UNSHAREABLE NONPUBLIC data that obviously has no place in Open
Source, like Facebook & Instagram feeds

With the meaning of Open Source AI being defined solely by the LOWEST
bar — no data delivered at all (which is allowed under the OSAID) —
why bother with the smokescreen if not to deliberately deceive us
users? An honest FAQ entry would have read like this:

What kind of data should be required in the Open Source AI Definition?
None.

1. https://hackaday.com/2025/01/15/floss-weekly-episode-816-open-source-ai/
2. https://www.theverge.com/2025/1/9/24339817/vlc-player-automatic-ai-subtitling-translation
3. https://openfuture.eu/publication/data-governance-in-open-source-ai/

Reply to:

Follow-Ups:
- Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
  - From: "M. Zhou" <lumin@debian.org>

References:
- Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
  - From: Sean Whitton <spwhitton@spwhitton.name>
- Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
Next by Date: Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
Previous by thread: Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
Next by thread: Re: Concerns regarding the "Open Source AI Definition" 1.0-RC2
Index(es):
- Date
- Thread