[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Make Unicode bugs release critical?



On Sat, 12 Feb 2011, Adam Borowski wrote:
> On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote:
> > 2. Anything that cannot deal with Supplementary planes.
> > 
> >    This includes the use of UCS-2 instead of UTF-16, as it cannot represent
> >    the Supplementary planes.  python 3 when not compiled to use UCS-4 memory
> >    hog mode is an example, I am told.
> 
> Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient
> charset.  Using either UTF-16 or UCS-4 can be a memory hog, that's why to
> pick UTF-8 for regular use.  Except for some rare cases (CJK with no

Python 3 uses UCS-2 (or UCS-4) for the internal representation.  Likely
they wanted to have something that made it easy to address each
character in an Unicode string in O(1).

That might actually give better performance given how much people like
to do string slicing and splicing in python.  The O(N) often required by
UTF-8 and UTF-16 might well be more painful than the much larger data
cache footprint of UCS-4... but that is a damn big *maybe*, and very
unlikely to be consistent across very different architectures.

Well, not like I care.  I don't even have Python 3 installed, and I will
only do so the day something I need decides to pull it as a dependency.

> Picking a random subset of Unicode is like putting day-of-the-year in one

UCS-2 is deprecated as all heck.  As far as I could research through
Google, it is not a valid Unicode representation since Unicode 2.0 (i.e.
1996).  So it wouldn't even count as a "random subset of Unicode".

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh


Reply to: