[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: openai-whisper_0~20230314-1_amd64.changes REJECTED



[M. Zhou]
> Generally speaking, a tokenizer is used to translate the user sentence
> (natural language) into a specific form of token sequence that the
> machine learning model could understand.  Different models have
> different vocabularies and tokenization methods. There is nothing
> standard.
>
> More details are here:
> https://github.com/openai/tiktoken

I got that far.  Also checked out the wikipedia page linked from there.

> The lengthy .tiktoken files look like sort of token vocabulary to me,
> but I don't have time to verify.  GPT-2 is a large language
> model. gpt2.tiktoken cannot mean anything else than GPT-2's tokenizer.

The question is really, how is this token vocabulary created?  Manuall
edited by someone or generated from some source?  Who created the gpt-2
tokenizer, and how?

-- 
Happy hacking
Petter Reinholdtsen


Reply to: