Re: Good news on Debian Sparc port stability

On Jun 4, 2015, at 11:07 AM, James Y Knight <jyknight@google.com> wrote:

GLibc

=====

After that, everything seemed to be going fine, except that programs like GCC would randomly segfault and give parse errors. This has been reported before, e.g. http://thread.gmane.org/gmane.linux.ports.sparc/16835, from 2 years ago. Things were stable enough to use interactively, if you're willing to keep retrying a build until it works, but not stable enough to use for any autobuild system.

After a getting a hint from Aurelien that disabling optimized memcpy routines in glibc (eglibc 2.19-1, on Wed, 04 Jun 2014 20:32:06 +0200) had improved, but did not fix, the problem, I started looking into that....

...And found that recompiling glibc, disabling the sparcv9 optimizations (that is: eliminating debian/patches/sparc/local-sparcv9-target.diff), *appears* to have completely fixed the stability issue!

To try to verify that, I ran a loop building and rebuilding 'clang' (with full "ninja" parallelism) overnight, and it's had zero crashes in all 14 builds of clang that it got through. Prior to fixing glibc, at least one of the ~2300 build steps (gcc/as/ld) was sure to crash unreproducibly.

It'd be great if someone wants to try to figure out exactly /which/ of the asm routines in the various sysdeps/**/sparc32/sparcv9 are broken, to narrow down the problem better, too. I highly suspect there's just something wrong in one or more of the hand-written asm files, but it's certainly possible there's some wider problem that the sparcv9 optimizations of glibc (but nothing else I've seen so far), just happens to expose.

So, bad news and good news:

Bad News: the above solution of simply disabling sparcv9 breaks some things (other than gcc). It breaks something about atomics or semaphores, likely due to a mismatch of expectations between libc and other things (the sparc32 routines, when *NOT* compiled in a shared library, dynamically choose between the v8 and v9 ways of doing things, so it's entirely reasonable to assume that doing it the v8 way cannot work right).

Good News:

My next attempt at a fix, is to just disable the optimized string ops:

rm sysdeps/sparc/sparc32/sparcv9/*mem* sysdeps/sparc/sparc32/sparcv9/*st*

That seems to still have fixed the random gcc crashes, AND doesn't break other things. :)

Looking into what the deleted routines are doing that's "interesting":

* memcpy and memset:

They're using LDBLOCKF STBLOCKF "block copy" instructions, which are:

1) Not actually part of the Sparcv9 standard instruction set, but rather are processor-specific (Although, these processor-specific instructions have been implemented since the UltraSPARC I).

"The LDBLOCKF instruction is intended to be a processor-specific instruction, which may or may not be implemented in future Oracle SPARC Architecture implementations. Therefore, it should only be used in platform-specific dynamically-linked libraries or in software created by a runtime code generator that is aware of the specific virtual processor implementation on which it is executing."

2) Marked deprecated.

"The LDBLOCKF instructions are deprecated and should not be used in new software. A sequence of LDDF instructions should be used instead."

3) Don't follow the normal TSO memory model ordering that everything else does; they require explicit MEMBARs in the right places to ensure even *single-thread/cpu* memory ordering correctness.

"Block operations do not generally conform to dependence order on the issuing virtual processor; that is, no read-after-write or write-after-read checking occurs between block loads and stores. Explicit MEMBARs are required to enforce dependence ordering between block operations that reference the same address."

It certainly looks like the author of those routines *tried* to do the right thing w.r.t. inserting membar instructions in the right place, but I can easily imagine it's wrong somehow. And it is entirely plausible that the behavior would be hardware-generation specific, since it has, by design, weird hardware-specific memory semantics. I'm placing my bets on this one being the problem.

* memchr, memcmp, strcmp, strcpy, etc.

These are using a nonfaulting load instruction. The nonfaulting load doesn't actually mean the hardware doesn't fault on loading from an unmapped page. Actually, unmapped pages still cause a fault, but the fault is supposed to be handled by the OS. It's also possible to map pages as "for use by nonfaulting loads only" (linux doesn't appear to do this).

That's a rare instruction -- not generated by GCC I think, so I could imagine there being a bug in the fault handler for it. I think that's less likely though, since it doesn't seem like it'd be CPU-architecture specific.

James