Re: openai-whisper_0~20230314-1_amd64.changes REJECTED

To: Debian Deep Learning Team <debian-ai@lists.debian.org>
Subject: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
From: Petter Reinholdtsen <pere@hungry.com>
Date: Tue, 06 Jun 2023 22:53:56 +0200
Message-id: <[🔎] sa65y80l3bv.fsf@hjemme.reinholdtsen.name>
In-reply-to: <[🔎] b25c9b47ab31ae0b04e02bb7b739eb03b2f5c56a.camel@riseup.net>
References: <[🔎] E1q6ayY-00AIcy-4h@fasolo.debian.org> <[🔎] sa6a5xcla0i.fsf@hjemme.reinholdtsen.name> <[🔎] b25c9b47ab31ae0b04e02bb7b739eb03b2f5c56a.camel@riseup.net>

[M. Zhou]
> Generally speaking, a tokenizer is used to translate the user sentence
> (natural language) into a specific form of token sequence that the
> machine learning model could understand.  Different models have
> different vocabularies and tokenization methods. There is nothing
> standard.
>
> More details are here:
> https://github.com/openai/tiktoken

I got that far.  Also checked out the wikipedia page linked from there.

> The lengthy .tiktoken files look like sort of token vocabulary to me,
> but I don't have time to verify.  GPT-2 is a large language
> model. gpt2.tiktoken cannot mean anything else than GPT-2's tokenizer.

The question is really, how is this token vocabulary created?  Manuall
edited by someone or generated from some source?  Who created the gpt-2
tokenizer, and how?

-- 
Happy hacking
Petter Reinholdtsen

Reply to:

References:
- openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Thorsten Alteholz <ftpmaster@ftp-master.debian.org>
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Petter Reinholdtsen <pere@hungry.com>
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by Date: gloo-cuda_0.0~git20220518.5b14351-4_amd64.changes REJECTED
Previous by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Index(es):
- Date
- Thread