On Tuesday 07 December 2004 10:40 am, Richard Atterer wrote: > No, you do not have to do this. You can keep working with "char", the > changes when switching to UTF-8 will mostly have to deal with the fact that > one Unicode character is represented by more than one char. This means that > you need to use a different strlen function, take care only to chop strings > of char at character boundaries, ensure that input strings are actually > valid UTF-8, etc. This might work for programs that relatively blindly manipulate character strings and can pass them off to the terminal for processing. In fact, aptitude does a *lot* of processing and formatting of strings internally. That means, for instance: splitting strings into words and paragraphs, truncating strings, finding out how wide strings are. More importantly, it also makes significant (and increasing) use of strings annotated with the terminal attributes of each character (think colors, bold/reverse video, etc). Needless to day, it performs all of the above operations on those strings as well. All of these are impacted by extended charsets: for instance, you need to use a different function to find whitespace, and combining characters with their attributes requires the use of a structure where an integer previously sufficed. That's not to mention finding the length of a string, which is necessary to perform most types of layout. The changes that are necessary are at least: At a minimum, the class used for formatted strings will have to be re-targeted to support either formatted wide strings or formatted utf8 strings. If wide characters or are not used internally, it is also necessary to audit every occurrence of s.size() and check whether the length-in-memory or the length-in-characters of the string is being queried. If neither wide characters nor a utf8-specialized basic_string are used, it is necessary to audit every string constructor (which might cut a substring) and make sure that it doesn't play havoc with utf8 codings. Every use of isspace() and friends will have to be replaced with Unicode-aware equivalents. And that's just the problems I can think of off the top of my head. It's also necessary to use a completely different set of terminal i/o routines, but this is pretty much expected. None of these problems are insurmountable, of course, and I know pretty much how to solve must of them. However, it's also true that none of them exist *at all* when using iso-8859-1, which is why I object to the comment that it's no harder to handle utf8 than iso-8859-1. (in fact, if your terminal speaks iso-8859-1, aptitude will handle it just fine without any changes) Daniel -- /------------------- Daniel Burrows <dburrows@debian.org> ------------------\ | Hi, I'm a .signature virus! | | Copy me into your .signature to help me spread! | \---------------- The Turtle Moves! -- http://www.lspace.org ---------------/
Attachment:
pgpGbO_pCHqSG.pgp
Description: PGP signature