Re: openai-whisper_0~20230314-1_amd64.changes REJECTED

To: "M. Zhou" <lumin@debian.org>, Thorsten Alteholz <ftpmaster@ftp-master.debian.org>, Debian Deep Learning Team <debian-ai@lists.debian.org>
Subject: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
From: Petter Reinholdtsen <pere@hungry.com>
Date: Thu, 22 Jun 2023 19:28:00 +0200
Message-id: <[🔎] sa65y7ftnj3.fsf@hjemme.reinholdtsen.name>
In-reply-to: <[🔎] d5136464551f68a74b3d4a40ac6f0ddc6f83ed6e.camel@riseup.net>
References: <[🔎] E1q6ayY-00AIcy-4h@fasolo.debian.org> <[🔎] sa6ilbgkf40.fsf@hjemme.reinholdtsen.name> <[🔎] d5136464551f68a74b3d4a40ac6f0ddc6f83ed6e.camel@riseup.net>

[M. Zhou]
> I'm 100% percent sure the .tiktoken files are vocabulary files summarized
> from some corpus. It's just a "word" -> "id" mapping in plain text.

Thanks.

> In order to reproduce a similar vocabulary list, I believe you can do
> it with a wikipedia dump. But I believe GPT2 was not trained on
> wikipedia dump, but a much larger corpus.

Do you have a recipe on how to create such vocabulary list?

> That said, if openai-whisper is a inference-only package which does
> not provide training scripts and enough details for the training
> dataset, it should go non-free even if the tokenizers are crystal
> clear.

I do not know if such training scripts are present, as I do not know how
to recognize training scripts.  If I knew how training was done, I might
grep in the source to see if the relevant keywords are in the source,
but I do not know how it is done, and thus am a bit lost.

-- 
Happy hacking
Petter Reinholdtsen

Reply to:

Follow-Ups:
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: "M. Zhou" <lumin@debian.org>

References:
- openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Thorsten Alteholz <ftpmaster@ftp-master.debian.org>
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Petter Reinholdtsen <pere@hungry.com>
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by Date: Processing of faiss_1.7.4-1_source.changes
Previous by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Index(es):
- Date
- Thread