[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [uml-devel] 2.4.22-[67] problems



On Sun, Dec 21, 2003 at 07:25:47PM -0500, Jeff Dike wrote:

> mdz@debian.org said:
> > I have just verified this myself.  Building user-mode-linux
> > 2.4.22-7um-1 on woody works fine (even when running on unstable), but
> > building it on unstable does not.
> 
> Conversely, does a unstable-built UML run on woody?

The unstable-built UML is broken on woody as well.  So far, my most
reproducible test case so far (not 100%, but close) is to start up a netcat
listener, and connect to it with input from /dev/zero, i.e. just push a
bunch of data over a TCP connection.  What happens is this:

rootstrap:~# nc -v -l -p 1234 >/dev/null </dev/null &
[2] 138
rootstrap:~# listening on [any] 1234 ...

rootstrap:~# nc -v -v localhost 1234 </dev/zero
connect to [127.0.0.1] from localhost [127.0.0.1] 1028
localhost [127.0.0.1] 1234 (?) open
select fuxored : Function not implemented
too many output retries : Broken pipe
 sent 27820032, rcvd 0
[2]+  Exit 1                  nc -v -l -p 1234 >/dev/null </dev/null

The relevant netcat source code isn't doing anything unusual:

    rr = select (16, ding2, 0, 0, timer2);      /* here it is, kiddies */
    if (rr < 0) {
        if (errno != EINTR) {           /* might have gotten ^Zed, etc ?*/
          holler ("select fuxored");
          close (fd);
          return (1);
        }
    } /* select fuckup */

so select is returning ENOSYS, but, as can be seen from the transfer
statistics, it succeeds many times before it fails.

Some other times, a program will simply hang (sometimes even stalling the
boot process), or segfault.

> > The one built on unstable randomly sees ENOSYS from certain system
> > calls, such as select, read and mmap.
> 
> Only those, or are there others that you can tell are failing?  Offhand, I
> don't see any commonality between those three, in terms of their interactions
> with the host.

Those are the ones that I have been able to easily identify.

select came from the netcat test you see above.

mmap was evident from the APT HTTP method:

/usr/lib/apt/methods/http: error while loading shared libraries: libc.so.6: cannot map zero-fill pages: Error 38

(that error is from dl-load.c in glibc, and as far as I can tell indicates
that mmap gave ENOSYS).

basename from coreutils seemed to see write(2) failing:

basename: write error: Function not implemented

I also saw unlink do it, in dpkg:

dpkg: error processing /var/cache/apt/archives/debhelper_4.0.2_all.deb (--unpack):
failed to rmdir/unlink `/usr/share/man/man1/dh_compress.1.gz.dpkg-tmp': Function not implemented

apt occasionally blows up read()ing from a socket as well:

(none):~# apt-get update
Get:1 http://debian woody/main Packages [1774kB]
Err http://debian woody/main Packages
  Error reading from server - read (38 Function not implemented)
Get:2 http://debian woody/main Release [95B]
Fetched 95B in 0s (259B/s)
Failed to fetch http://debian/dists/woody/main/binary-i386/Packages  Error reading from server - read (38 Function not implemented)
Reading Package Lists... Done
Building Dependency Tree... Done
E: Some index files failed to download, they have been ignored, or old ones used instead.

> > I would appreciate any suggestions for how to track this problem down
> > further.
> 
> The randomness is strange.  It suggests that somehow interrupts are getting
> in the way.  One possibility would be host system calls returning ENOSYS
> instead of EINTR.  I don't see much possibility that that's what's actually
> happening, but that's the sort of thing I'd think about.

Can you think of any way that userland changes could produce that kind of
effect?  I don't think I would know where to look.  My kernel didn't change,
and the problem seems to occur on different host kernels.

I tried running UML under strace; this produces an impressive amount of
output, but made it much more difficult to reproduce the bug.  I finally got
it to happen under strace, and I have a 226M logfile (7M gzipped) from the
session, if you're interested in taking a look.  I've put it up at
http://people.debian.org/~mdz/temp/uml.strace.gz.  I don't see any host
system calls returning ENOSYS; the only failures are some very
innocuous-looking EINTRs and a few EAGAINs that looks like they're
associated with a terminal device.

-- 
 - mdz



Reply to: