Re: openai-whisper_0~20230314-1_amd64.changes REJECTED

To: Petter Reinholdtsen <pere@hungry.com>, Thorsten Alteholz <ftpmaster@ftp-master.debian.org>, Debian Deep Learning Team <debian-ai@lists.debian.org>
Subject: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
From: "M. Zhou" <lumin@debian.org>
Date: Thu, 22 Jun 2023 10:08:36 -0700
Message-id: <[🔎] d5136464551f68a74b3d4a40ac6f0ddc6f83ed6e.camel@riseup.net>
In-reply-to: <[🔎] sa6ilbgkf40.fsf@hjemme.reinholdtsen.name>
References: <[🔎] E1q6ayY-00AIcy-4h@fasolo.debian.org> <[🔎] sa6ilbgkf40.fsf@hjemme.reinholdtsen.name>

On Wed, 2023-06-21 at 17:30 +0200, Petter Reinholdtsen wrote:
> [Thorsten Alteholz]
> > Hi Petter,
> > 
> > can you please explain how I can recreate the files *.tiktoken?
> > There seem to be some sources missing ...
> 
> Thank you for the feedback.  I had not noticed those files.
> 
> I have no idea how they are created, and will have to ask for help to
> figure out.  I agree that the 50k line text files look like they are
> generated from something.
> 
> If I can not work out the process used to build these and where they are
> derived from, is non-free a better fit for the package?
> 

I'm 100% percent sure the .tiktoken files are vocabulary files summarized
from some corpus. It's just a "word" -> "id" mapping in plain text.

Each word is encoded in base64, for example, the last line
IGdhemVk is the base64 encoding of "gazed", which maps to
50255. Mapping a user input natural language sentence to a sequence
of token IDs is exactly what a tokenizer does for a language model,
this has been a commonly seen preprocessing step for decades.

Base64 encoding is used here to prevent complicating file parsing,
especially when there are complicated symbols, white spaces, etc.

In order to reproduce a similar vocabulary list, I believe you can
do it with a wikipedia dump. But I believe GPT2 was not trained
on wikipedia dump, but a much larger corpus.

That said, if openai-whisper is a inference-only package which does
not provide training scripts and enough details for the training
dataset, it should go non-free even if the tokenizers are crystal clear.

I think this is mentioned in ML-Policy for reference.

Reply to:

Follow-Ups:
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Petter Reinholdtsen <pere@hungry.com>

References:
- openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Thorsten Alteholz <ftpmaster@ftp-master.debian.org>
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Petter Reinholdtsen <pere@hungry.com>

Prev by Date: rocsparse_5.3.0+dfsg-7_source.changes ACCEPTED into unstable
Next by Date: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Previous by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Index(es):
- Date
- Thread