Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
[M. Zhou]
> Generally speaking, a tokenizer is used to translate the user sentence
> (natural language) into a specific form of token sequence that the
> machine learning model could understand. Different models have
> different vocabularies and tokenization methods. There is nothing
> standard.
>
> More details are here:
> https://github.com/openai/tiktoken
I got that far. Also checked out the wikipedia page linked from there.
> The lengthy .tiktoken files look like sort of token vocabulary to me,
> but I don't have time to verify. GPT-2 is a large language
> model. gpt2.tiktoken cannot mean anything else than GPT-2's tokenizer.
The question is really, how is this token vocabulary created? Manuall
edited by someone or generated from some source? Who created the gpt-2
tokenizer, and how?
--
Happy hacking
Petter Reinholdtsen
Reply to: