Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
On Wed, 2023-06-21 at 17:30 +0200, Petter Reinholdtsen wrote:
> [Thorsten Alteholz]
> > Hi Petter,
> >
> > can you please explain how I can recreate the files *.tiktoken?
> > There seem to be some sources missing ...
>
> Thank you for the feedback. I had not noticed those files.
>
> I have no idea how they are created, and will have to ask for help to
> figure out. I agree that the 50k line text files look like they are
> generated from something.
>
> If I can not work out the process used to build these and where they are
> derived from, is non-free a better fit for the package?
>
I'm 100% percent sure the .tiktoken files are vocabulary files summarized
from some corpus. It's just a "word" -> "id" mapping in plain text.
Each word is encoded in base64, for example, the last line
IGdhemVk is the base64 encoding of "gazed", which maps to
50255. Mapping a user input natural language sentence to a sequence
of token IDs is exactly what a tokenizer does for a language model,
this has been a commonly seen preprocessing step for decades.
Base64 encoding is used here to prevent complicating file parsing,
especially when there are complicated symbols, white spaces, etc.
In order to reproduce a similar vocabulary list, I believe you can
do it with a wikipedia dump. But I believe GPT2 was not trained
on wikipedia dump, but a much larger corpus.
That said, if openai-whisper is a inference-only package which does
not provide training scripts and enough details for the training
dataset, it should go non-free even if the tokenizers are crystal clear.
I think this is mentioned in ML-Policy for reference.
Reply to: