On 29/04/2021 14:03, Albretch Mueller wrote: >> What is "alpha-offset format"? > we, corpora research kinds of folks, need to process thousand of > files as other people process bytes. UTF8 was basically an > Americanizierung of alle alphabets. UTF is great to describe an > alphabet but not for text files. > > UTF8 turned all files into streams not good for questions such as > what is the charatcer/string sequence starting on the nth addressable > unit of a file ... Depends on what you mean by "addressable unit", surely? UTF8 is a variable-length record format, but it's still addressable. Basically, it's like taking a CSV file and saying "what's the contents of the cell starting at byte 123"? CSV cells are variable length. Perhaps there isn't such a cell. If you want to know the contents of the cell which includes byte 123, then you need some context, don't you? > > Doing that with utF8 is from way too complicated to impossible. Also > alpha offset nicely splits the files segments into its different > parts: ALPHABETICAL text, js, css, ... So, do you use something more like UTF-32? > > lbrtchx >
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature