Re: Why is troubleshooting Linux so hard?

To: debian-user@lists.debian.org
Subject: Re: Why is troubleshooting Linux so hard?
From: "Boyd Stephen Smith Jr." <bss@iguanasuicide.net>
Date: Sun, 15 Aug 2010 04:35:38 -0500
Message-id: <[🔎] 201008150435.42990.bss@iguanasuicide.net>
In-reply-to: <[🔎] 201008150200.52677.jrvp@bordenrhodes.com>
References: <[🔎] 201008150200.52677.jrvp@bordenrhodes.com>
In <[🔎] 201008150200.52677.jrvp@bordenrhodes.com>, Borden Rhodes wrote:
>1) Is there a way to apply debugging symbols retroactively to a dump? A few
>times I've had Linux crash on me and spit out a debugging dump.  I do my
>best to install debugging symbols for all 1400 packages I have on my system
>(when I can find them) but this requires a huge amount of hard disk space
>and, invariably, the odd dump is missing symbols.  Recreating the crash
>isn't always possible.  Is there (or could someone invent) a way to save a
>dump without the symbols, download the symbol tables and then regenerate
>the dump with the symbols so it's useful to developers?

Yes, it is, sometimes.  Ubuntu has a process to do it automatically, that 
mostly gets it right.

Modern versions of "strip" et. al. allow you to save the debugging information 
to a separate .so that just contains debugging information.  gdb (et. al.) can 
then use the debugging-info only .so to decorate an existing backtrace.

This is actually how a lot of distributions produce separate -dbg or -DEBUG 
packages.

However, this debugging-info only files only match with the *same exact build* 
of the real .so.  Taking a random backtrace, determining which build it came 
from and finding the appropriate -dbg packages is a bit difficult.

Also, things like prelink, that modify existing .so files result in the 
debugging-info only .so not matching.  This might also happen this some types 
of hardening that reduces the impact of heap/stack overflow/underflow attacks.

Compounding this problem is the large number of programs that are being 
written with parts in "scripting" languages, or otherwise non-C/C++ languages 
where the path from a symbol in a ELF file to the problematic code is not as 
simple.

In short, it can be done in some cases and there are programmers working on 
making backtraces from Joe Sixpack or Jane Boxwine more useful.  It does seem 
to be like there may need to be more people working on this, but it is not 
very "sexy" work.  Most programmers would rather spend their time improving 
the user experience when things are working; IME, that is where the user 
spends most of their time.

>2) I find that the logs contain lots of facts but not a whole lot of useful
>information (if any) when something goes wrong.  I've had KDE go
>black-screen on me, for example, and force a hard reboot but there's no
>mention whatsoever (that I can find) in xorg.log, kdm.log, messages, syslog
>or dmesg.  Windows seems to be fairly good at making its last breath a stop
>error before it dies which means when I get back into the system (or when
>I'm looking at a client's computer days after) I can find that stop error,
>look it up and figure out what went wrong.  Are Linux's logs designed for
>troubleshooting or only for monitoring?  Are proper troubleshooting logs
>kept somewhere else or in a special file? Is there a guide on how to read
>Linux's logs so I can make sense out of them like I can Windows' logs?

In the case of a kernel crash, the last breath of the system is unfortunately 
not writing to dmesg/syslog and sync()ing disks.  Depending on the nature of 
the crash, there are some good reasons not to do this, though.  (E.g. is the 
case of a PANIC(), the kernel developer is basically indicating that the 
kernel image has been compromised -- doing FS operations with a compromised 
kernel might cause [more] data loss.)

I think that logs in general are... dropping in quality.  They seem to be less 
focused around failed "sanity" checks, mis-configuration warnings, and I-was-
here before I called exit() message.  They seem to more filled with I-didn't-
comment-this-out-before-our-release build debugging messages for random 
developers.  This is not true of kernel logs for the most part; I find them 
informative, but it is rarely my kernel that causes me problems.

I speak as someone that has been working as a developer in some capacity for 8 
years.  Take that for what you will.

>3) Linux needs better troubleshooting and recovery systems.  The answer I
>usually get when I get an unexplained error is to run the program inside a
>dbg or with valgrind.  I'm not convinced that this is a practical way to
>troubleshoot serious problems (like kernel panics) and it requires a
>certain amount of foresight that a problem will occur.  According to this
>logic, the only way that someone can produce useful reports and feedback
>(or even get a clue as to what happened) on the day-to-day crashes and bugs
>is to start Linux and all of its sub process inside valgrind and/or gdb. 
>This is obviously not an intended use of these programs.

If we don't know how to reproduce the problem, we can't fix it.  If we do know 
how to reproduce the problem, the foresight needed to use gdb/valgrind is not 
too much more.  They shouldn't be your first tools, but they are necessary.

I've also had gdb/valgrind mask errors, which is truly unfortunate.  Still, if 
you know a way to make it crash every time EXCEPT when in gdb/valgrind, that 
tells me something as a developer 

NB: I've never had gdb/valgrind help with kernel errors, since they generally 
live in user space.

Being able to reproduce the error is the *most important* step.  IME, there 
are very few problems that can't be fixed/worked-around in 8 man-hours once 
you can reproduce the problem in under 15 minutes.

Also, if you have an unreproducable problem, I'm gonna blame the hardware or 
cosmic radiation, not the code.

>1) Logs need to have useful information.

Agreed.

>When I look at a client's Windows
>box days after they report something going wrong, the logs tell me at what
>time the problem happened, which process failed and what error it threw just
>before it blew.  I can look those error codes up and (usually) fix the
>problem within an hour.

As a less homogeneous environment, there's no ultimate table of error codes to 
look at.

>When something dies on Linux, the log entry
>(assuming it even makes one) only tells me how many seconds into that
>particular boot the problem occurred. I've never been able to go back a few
>days later and find the log entries related to a particular crash - maybe
>because they've been purged. 

I've still got logs from 2009 on my currently running desktop.  They *have* 
been archived, but they are still available.  You should check your logrotate 
settings to make sure your logs are being handled the way you'd like.

>I know that the Linux tradition is to identify
>processes only by ID but surely there must be a way that it can print a
>file or package name or anything more useful than memory addresses and
>registers so at least I know where to start pointing fingers.

The kernel doesn't know about packages.  It does know about files, but once 
the process is running, it doesn't identify the file using a pathname.  As it 
is dying is it difficult to extract accurate information, particularly if it 
has already "eaten" it's own memory image.

>Several
>people have told me that it's pointless trying to debug a dump in the logs.
> What's the point of dumping it in the first place if nobody can read it?

It is a place to start, but it's not a very good one.  A kdump or corefile is 
usually much better.  A backtrace tells you a set of functions to look at for 
obvious errors, a kdump or corefile allows you to inspect local variables and 
determine exactly which of your assumptions was violated.

>2) I wish error logs had simple codes or messages (which have documentation)
>like Windows Stop errors so I can look them up and figure out why something
>died.  Often times I try to Google the whole error message and either get
>directed to source code or totally irrelevant postings (since it seems that
>many messages are reused for all kinds of problems).  For example,
>'segfault' gets thrown so much that it only tells you that the program
>crashed - something I already know.

segfault is a very specific type of crash: A process attempted to access a 
memory address that was either not mapped or was mapped without the required 
permissions.  (Trying to move the IP to a place that is mapped NOEXEC, trying 
to write to a read-only mmap(), or even a simple dereference of a NULL 
pointer.)

Unfortunately, it is the most common type of hard crash.  It can be caused a 
multitude of programming errors.  If your program is not segfaulting, it can 
likely recover in some meaningful way, or at least write a log message and 
cleanly exit.  If it is segfaulting, there is relatively little you can do; a 
signal handler is C isn't allowed to call all of the library functions, and 
returning from the SIGSEGV handler causes the program to terminate or 
immediately get the signal again, so you can't set a flag.

Error codes and fixed error messages are established after the main body of 
code is written, so they can be standardized throughout the body of the code 
and documented.  However, with release early, release often being the mantra 
of many projects, that level of "freeze" never happens.  New error messages 
and conditions are added all the time, and (more often than not) old error 
messages and conditions go way when recovery code is added.

>xorg.conf files (which are depreciated
>anyway)

It's not depreciated.  xorg.conf is *the* correct place to configure your 
Xorg.  However, one of the goal of Xorg is to have enough auto-detection and 
dynamic re-configuration that an empty (or missing) xorg.conf is enough for 
everyone.

>Why isn't
>there one log that only deals with hardware status and changes, another one
>that only deals with network status and firewall logging, another one which
>only deals with dumps and crashes and so on?

There a a fixed number of "syslog" facilities, but they were designed in the 
days of AT&T UNIX, so not all of them are entirely relevant.  It seems like 
Linux could probably add some more, but portable programs would use them.  
Plus, a lot of programs don't log via syslog() anymore anyway.

Anyway, it could be a lot better, I agree.  I seem to remember that Debian and 
most upstream projects do accept volunteers.  ;)
-- 
Boyd Stephen Smith Jr.                   ,= ,-_-. =.
bss@iguanasuicide.net                   ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy         `-'(. .)`-'
http://iguanasuicide.net/                    \_/
Attachment: signature.asc
Description: This is a digitally signed message part.
Reply to:
References:
- Why is troubleshooting Linux so hard?
  - From: Borden Rhodes <jrvp@bordenrhodes.com>
Prev by Date: Re: wicd refuses to start
Next by Date: Re: Why is /usr/lib/jvm/java-6-sun/sun-java6-6-12/jdk1.6.0_12 not recognized in update-alternatives?
Previous by thread: Re: Why is troubleshooting Linux so hard?
Next by thread: Re: Why is troubleshooting Linux so hard?
Index(es):
- Date
- Thread