[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: #215067 mozilla FTBFS



Okay ARM hackers, as a user/neophyte, I need your help.

As discussed below (the good stuff is at the end), I've traced the
mozilla segfault to their PR_dtoa function, which converts doubles to
strings.  Because it directly manipulates the bits of doubles, it is
making some gross errors, like apparently converting 1.0 to
5^1242306295!  This overflows a table, which I've worked around to fix
the segfault (patch attached to previous post), but the workaround
results in insane use of time and memory to analyze these huge numbers.

Does ARM store doubles in a non-IEEE way?  Is it secretly big-endian for
its float/double emulation, and little-endian for ints?  What else could
be wrong with mozilla's assumptions about double format?

Please help out here, I think this may have been the cause of a lot of
related problems, and when it's gone I think we'll have a decently
working mozilla/galeon/epiphany on ARM -- or at least, one that builds
and installs!

On Tue, 2003-10-21 at 11:15, Adam C Powell IV wrote:
> On Mon, 2003-10-20 at 22:24, Adam C Powell IV wrote:
> > Okay, just a bit more "manual backtrace" info:
> > 
> > On Mon, 2003-10-20 at 21:06, Adam C Powell IV wrote:
> > > During the call to NSS_Init, nss_makeFlags(1,0,0,0,0,1) returns 0x219a8,
> > > and the resulting moduleSpec is:
> > > 
> > > name="NSS Internal Module" parameters="configdir='/home/hazelsct/.netscape' certPrefix='' keyPrefix='' secmod='secmod.db' flags=readOnly,optimizeSpace " NSS="flags=internal,moduleDB,moduleDBOnly,critical"
> > > 
> > > Then SECMOD_LoadModule() returns something non-null, but apparently
> > > ->loaded is zero because nss_Init returns -1.
> > > 
> > > During the call to NSS_NoDB_Init, nss_makeFlags(1,1,1,1,0,1) returns
> > > 0x25268 (okay, maybe this is an address whose value is meaningless, not
> > > what I thought), and the resulting moduleSpec is:
> > > 
> > > name="NSS Internal Module" parameters="configdir='' certPrefix='' keyPrefix='' secmod='' flags=readOnly,noCertDB,noModDB,forceOpen,optimizeSpace " NSS="flags=internal,moduleDB,moduleDBOnly,critical"
> > > 
> > > Then ->loaded seems to work, because it calls secoid_Init(), then
> > > segfaults in the call to STAN_LoadDefaultNSS3TrustDomain().  Which in
> > > turn segfaults in NSSTrustDomain_Create(), which segfaults in
> > > NSSArena_Create().  (God, how I wish I could just "backtrace"!!)
> > 
> > This calls nss_ClearErrorStack() in nss/lib/base/arena.c, which calls
> > error_get_my_stack(), and since error_stack_index=0, it calls
> > PR_CallOnce() in nsprpub/pr/src/misc/prinit.c; that's where the segfault
> > is.
> 
> Okay, now I can't tear myself away...
> 
> PR_CallOnce() calls the function passed to it by error_get_my_stack(),
> which is error_once_function(); that calls nss_NewThreadPrivateIndex(),
> which calls set_whatnspr(), which calls PR_dtoa() in
> nsprpub/pr/src/misc/prdtoa.c, which is supposed to print a double value
> to an ASCII string.
> 
> Its pow5mult() is EXTREMELY slow on ARM, taking about 30 seconds to
> segfault, with most of the time spent in the very slow mult() function
> it calls, which is where the segfault is.  (Is there really no better
> way to print double values?)  It segfaults in mult() on "c = Balloc(k)"
> with k=16 (after succeeding in about 20 previous mult() calls with k=1
> to 15).
> 
> Okay.  So Balloc() uses a freelist to find a small chunk of memory of
> size k.  Most of the time (in fact, all but one other time when k=1),
> freelist[k] is NULL, so it allocates (2^k-1)*sizeof(Long)+sizeof(Bigint)
> bytes.  For some reason, when k=16, freelist[k] is non-null, and
> PR_Unlock(freelist_lock) segfaults, perhaps because it sets freelist[16]
> to rv->next.  [Doesn't glibc already use something like this freelist to
> handle small malloc/free entries efficiently?]

Scratch the glibc comment, the freelist seems an efficient way to handle
repeated allocs of a small set of fixed sizes.

> Akhaa!  The freelist is just Kmax=15 entries long, explaining the bogus
> non-null freelist[16], and the segfault.  The attached patch thus cures
> the segfault, and should be sent upstream; note however that it leaks
> memory, as I couldn't figure out how to use PR_Free (compiler error
> 'missing binary operator before token "PR_Free"' even though it's void).

This does indeed fix the segfault.  But it takes about 30 seconds to
reach the point where k>Kmax, so a better fix would be to just throw
some kind of overflow error when it reaches this point -- or better yet,
when pow5mult gets an argument large enough to push k above kmax.

> So why does it take so long (25 minutes and counting) and so much memory
> (1400K and counting) to represent double numbers -- like 1.0 (which
> PR_dtoa is called with here) -- as strings?

Make that 45 hours and 19 MB and counting...  (Yes, it's rather
pointless at this point, but just for fun. :-)

> There may be a deeper ARM
> issue involved, which is why it calls pow5mult(i2b(1),1242306295). 
> Perhaps it's trying to print 5^1242306295?  So d=1.0, but d2=4.29078e+9,
> and d2 -> k -> s5 is the argument to pow5mult.  d2 is set using the
> macros:
> 
> dval(d2) = x = word1(d) << (32-i), where i=30 and word1(d)=0x3ff00000,
> word0(d2) -= 31*Exp_msk1, where Exp_msk1=0x80,
> 
> where word0() and word1() are unsigned longs representing the first and
> second sub-words of d (the dtoa argument, double 1.0).  I'd hate to see
> what this does on a 64-bit arch, where a double doesn't have two long
> sub-words!  I'd speculate it's that word1(d) which is causing the
> problem here.
> 
> I'm out of time to work on this today, but needless to say, PR_dtoa is
> quite broken on ARM, and almost certainly needlessly duplicates
> something which is done very well by glibc.
-- 
-Adam P.

GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://lyre.mit.edu/~powell/The_Best_Stuff_In_The_World_Today_Cafe.ogg



Reply to: