[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: openai-whisper_0~20230314-1_amd64.changes REJECTED



On Wed, 2023-06-21 at 17:30 +0200, Petter Reinholdtsen wrote:
> [Thorsten Alteholz]
> > Hi Petter,
> > 
> > can you please explain how I can recreate the files *.tiktoken?
> > There seem to be some sources missing ...
> 
> Thank you for the feedback.  I had not noticed those files.
> 
> I have no idea how they are created, and will have to ask for help to
> figure out.  I agree that the 50k line text files look like they are
> generated from something.
> 
> If I can not work out the process used to build these and where they are
> derived from, is non-free a better fit for the package?
> 

I'm 100% percent sure the .tiktoken files are vocabulary files summarized
from some corpus. It's just a "word" -> "id" mapping in plain text.

Each word is encoded in base64, for example, the last line
IGdhemVk is the base64 encoding of "gazed", which maps to
50255. Mapping a user input natural language sentence to a sequence
of token IDs is exactly what a tokenizer does for a language model,
this has been a commonly seen preprocessing step for decades.

Base64 encoding is used here to prevent complicating file parsing,
especially when there are complicated symbols, white spaces, etc.

In order to reproduce a similar vocabulary list, I believe you can
do it with a wikipedia dump. But I believe GPT2 was not trained
on wikipedia dump, but a much larger corpus.

That said, if openai-whisper is a inference-only package which does
not provide training scripts and enough details for the training
dataset, it should go non-free even if the tokenizers are crystal clear.

I think this is mentioned in ML-Policy for reference.


Reply to: