Re: openai-whisper_0~20230314-1_amd64.changes REJECTED

To: Debian Deep Learning Team <debian-ai@lists.debian.org>
Subject: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
From: Petter Reinholdtsen <pere@hungry.com>
Date: Tue, 06 Jun 2023 20:29:33 +0200
Message-id: <[🔎] sa6a5xcla0i.fsf@hjemme.reinholdtsen.name>
In-reply-to: <[🔎] E1q6ayY-00AIcy-4h@fasolo.debian.org>
References: <[🔎] E1q6ayY-00AIcy-4h@fasolo.debian.org>

Only replying to the list for now in the hope someone can help answering
this question from the ftpmasters.

[Thorsten Alteholz]
> can you please explain how I can recreate the files *.tiktoken?
> There seem to be some sources missing ...

I do not know much about the inner workings for whisper, so I do not
really know the answer to this question.  As far as I can tell, the
files in question are whisper/assets/gpt2.tiktoken and
whisper/assets/multilingual.tiktoken in the source.  I believe these are
loaded by get_tokenizer() in whisper/tokenizer.py, and that the files in
question, which are ascii files starting like this, are tiktoken
tokenizer rule files:

IQ== 0
Ig== 1
Iw== 2
JA== 3
JQ== 4
Jg== 5
Jw== 6
KA== 7
KQ== 8
Kg== 9

I have no idea what tiktoken tokenizer rule files really are, now how
they are created.  The size make me suspect they are generated, as they
consist of over 50k lines following the structure shown.

Anyone understand more about tiktok stuff to spread some light on the
topic?

-- 
Happy hacking
Petter Reinholdtsen

Reply to:

Follow-Ups:
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: "M. Zhou" <lumin@debian.org>

References:
- openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Thorsten Alteholz <ftpmaster@ftp-master.debian.org>

Prev by Date: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by Date: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Previous by thread: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Index(es):
- Date
- Thread