[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: openai-whisper_0~20230314-1_amd64.changes REJECTED



Only replying to the list for now in the hope someone can help answering
this question from the ftpmasters.

[Thorsten Alteholz]
> can you please explain how I can recreate the files *.tiktoken?
> There seem to be some sources missing ...

I do not know much about the inner workings for whisper, so I do not
really know the answer to this question.  As far as I can tell, the
files in question are whisper/assets/gpt2.tiktoken and
whisper/assets/multilingual.tiktoken in the source.  I believe these are
loaded by get_tokenizer() in whisper/tokenizer.py, and that the files in
question, which are ascii files starting like this, are tiktoken
tokenizer rule files:

IQ== 0
Ig== 1
Iw== 2
JA== 3
JQ== 4
Jg== 5
Jw== 6
KA== 7
KQ== 8
Kg== 9

I have no idea what tiktoken tokenizer rule files really are, now how
they are created.  The size make me suspect they are generated, as they
consist of over 50k lines following the structure shown.

Anyone understand more about tiktok stuff to spread some light on the
topic?

-- 
Happy hacking
Petter Reinholdtsen


Reply to: