Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
[M. Zhou]
> I'm 100% percent sure the .tiktoken files are vocabulary files summarized
> from some corpus. It's just a "word" -> "id" mapping in plain text.
Thanks.
> In order to reproduce a similar vocabulary list, I believe you can do
> it with a wikipedia dump. But I believe GPT2 was not trained on
> wikipedia dump, but a much larger corpus.
Do you have a recipe on how to create such vocabulary list?
> That said, if openai-whisper is a inference-only package which does
> not provide training scripts and enough details for the training
> dataset, it should go non-free even if the tokenizers are crystal
> clear.
I do not know if such training scripts are present, as I do not know how
to recognize training scripts. If I knew how training was done, I might
grep in the source to see if the relevant keywords are in the source,
but I do not know how it is done, and thus am a bit lost.
--
Happy hacking
Petter Reinholdtsen
Reply to: