[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Why is troubleshooting Linux so hard?



Good morning,

I'm going to list some of the frustrations I've been having with 
troubleshooting Linux's quirks, crashes and problems in hopes that someone may 
be able to help me (and the community) become better bug reporters and 
troubleshooters.  I'll make comparisons to Windows only because I am used to 
fixing the same problems in Windows a certain way - maybe there are analogies 
in Linux or maybe I'm approaching these problems the wrong way.  I'm not 
trying to troll or flame-bait.  I'm using Debian Squeeze, by the way.

1) Is there a way to apply debugging symbols retroactively to a dump? A few 
times I've had Linux crash on me and spit out a debugging dump.  I do my best 
to install debugging symbols for all 1400 packages I have on my system (when I 
can find them) but this requires a huge amount of hard disk space and, 
invariably, the odd dump is missing symbols.  Recreating the crash isn't 
always possible.  Is there (or could someone invent) a way to save a dump 
without the symbols, download the symbol tables and then regenerate the dump 
with the symbols so it's useful to developers?

2) I find that the logs contain lots of facts but not a whole lot of useful 
information (if any) when something goes wrong.  I've had KDE go black-screen 
on me, for example, and force a hard reboot but there's no mention whatsoever 
(that I can find) in xorg.log, kdm.log, messages, syslog or dmesg.  Windows 
seems to be fairly good at making its last breath a stop error before it dies 
which means when I get back into the system (or when I'm looking at a client's 
computer days after) I can find that stop error, look it up and figure out what 
went wrong.  Are Linux's logs designed for troubleshooting or only for 
monitoring?  Are proper troubleshooting logs kept somewhere else or in a 
special file? Is there a guide on how to read Linux's logs so I can make sense 
out of them like I can Windows' logs?

3) Linux needs better troubleshooting and recovery systems.  The answer I 
usually get when I get an unexplained error is to run the program inside a dbg 
or with valgrind.  I'm not convinced that this is a practical way to 
troubleshoot serious problems (like kernel panics) and it requires a certain 
amount of foresight that a problem will occur.  According to this logic, the 
only way that someone can produce useful reports and feedback (or even get a 
clue as to what happened) on the day-to-day crashes and bugs is to start Linux 
and all of its sub process inside valgrind and/or gdb.  This is obviously not 
an intended use of these programs.

This is what would make it easier (at least for me) to troubleshoot Linux 
problems.  If these features exist, please let me know so I can start using 
them (they should probably be documented in the man pages too).

1) Logs need to have useful information.  When I look at a client's Windows 
box days after they report something going wrong, the logs tell me at what 
time the problem happened, which process failed and what error it threw just 
before it blew.  I can look those error codes up and (usually) fix the problem 
within an hour.  When something dies on Linux, the log entry (assuming it even 
makes one) only tells me how many seconds into that particular boot the 
problem occurred. I've never been able to go back a few days later and find the 
log entries related to a particular crash - maybe because they've been purged.  
I know that the Linux tradition is to identify processes only by ID but surely 
there must be a way that it can print a file or package name or anything more 
useful than memory addresses and registers so at least I know where to start 
pointing fingers.  Several people have told me that it's pointless trying to 
debug a dump in the logs.  What's the point of dumping it in the first place if 
nobody can read it?

2) I wish error logs had simple codes or messages (which have documentation) 
like Windows Stop errors so I can look them up and figure out why something 
died.  Often times I try to Google the whole error message and either get 
directed to source code or totally irrelevant postings (since it seems that 
many messages are reused for all kinds of problems).  For example, 'segfault' 
gets thrown so much that it only tells you that the program crashed - 
something I already know.

3) Logs need better organisation.  I'm looking at the most recent dump and 
each message is printed on its own line.  The problem is that interspersed in 
those individual lines may be other entries from other events not related to 
the problem in question.  When I look at a Windows log, each event is entirely 
contained in one entry.  It doesn't make one entry for "Stop", another entry 
for the Stop number, another 4 entries for the parameters and more entries for 
whatever other information usually is in them - whilst having other entries 
amid the list with what other things were doing at the time.  I find Linux logs 
very frustrating to read for that reason since I don't know when an event is 
finished reporting or which items are relevant.

4) Logs need to focus on reporting on one thing and making sure it reports 
that one thing well.  Other than formatting, I can't see many differences 
between syslog, dmesg and messages.  Xorg.log is some help for troubleshooting 
misconfigured xorg.conf files (which are depreciated anyway) but not very useful 
if your X session burns down.  kdm.log seems identical to Xorg.log except for 
a few KDE-specific entries.  I had to uninstall my firewall because it kept 
writing firewall entries to messages (and stdout) and I couldn't figure out how 
to get it to stop.  Why isn't there one log that only deals with hardware 
status and changes, another one that only deals with network status and 
firewall logging, another one which only deals with dumps and crashes and so 
on?

Maybe I just haven't found the right manual yet that has all of these answers 
so I'd appreciate any direction.

With regards,

Borden Rhodes


Reply to: