[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: openai-whisper_0~20230314-1_amd64.changes REJECTED



[M. Zhou]
> I'm 100% percent sure the .tiktoken files are vocabulary files summarized
> from some corpus. It's just a "word" -> "id" mapping in plain text.

Thanks.

> In order to reproduce a similar vocabulary list, I believe you can do
> it with a wikipedia dump. But I believe GPT2 was not trained on
> wikipedia dump, but a much larger corpus.

Do you have a recipe on how to create such vocabulary list?

> That said, if openai-whisper is a inference-only package which does
> not provide training scripts and enough details for the training
> dataset, it should go non-free even if the tokenizers are crystal
> clear.

I do not know if such training scripts are present, as I do not know how
to recognize training scripts.  If I knew how training was done, I might
grep in the source to see if the relevant keywords are in the source,
but I do not know how it is done, and thus am a bit lost.

-- 
Happy hacking
Petter Reinholdtsen


Reply to: