[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debian Testing Jess: Server Randomly Reboots - No Logs / Indication - Rsyslog Issue?



Adam Brenner wrote:
> The issue I am facing is that at random, unrepeatable times, the server
> locks up and requires a reboot. However, none of the system generated
> logs in /var/logs/ report any kernel panic or memory dump. I have ran a
> number of grep commands and even manually spent time tracing the logs at
> the time of day and nothing shows up. I ended up performing a chassis
> swap (replaced motherboard, CPU, memory, PSU, etc). Yet, this still occurs.

What type of hardware is your "server"?  I often use inexpensive
consumer grade systems for servers.  I am using a Raspberry Pi as a
server for one application.  (Hard to be greener at 2.5 watts for that
particular need.)  But the hardware in those is not as robust as
hardware designed for enterprise environments with ECC throughout and
other high quality hardware.  I still use them for that function but
the word server could mean many different things to different people.
What type of hardware do you have?

> This leads me to believe that the Rsyslog is not accurately logging
> messages.

I have been in the same unfortunate situation many times.  Machines
crash.  Nothing in the logs.  In the fortunate cases where I had a
physical console I about half the time would have a kernel panic
message to the console.  About half the time there was nothing useful
logged to the console.

That illustrates the problem with relying upon syslog.  Syslog is good
for reporting userland events.  But syslog is not very good for
reporting kernel panics.  When the kernel has fallen down userland
space stops running and syslog is just another userland program.
Syslog will stop running too.

> Is it the "delayed" logging "dashes" a cause of the no logs?

Not in my experience.  Nothing you change there will have any effect.

Instead I would monitor the hardware console if that is possible or
practical.  Does your server have remote console capability such as
LOM or iLO?  (https://en.wikipedia.org/wiki/Lights_out_management)

> Anyone have ideas about this?

There a many ways that things can fail.  Without knowing how your
system has failed it is impossible to say anything intelligent about
it.  It could be anything.  And unfortunately I have been in the same
place many times myself over the years and I don't have any great
advice for debugging it either.  But problems like this are one of the
reasons that enterprise customers feel justified in paying so much
money for high quality hardware with a support contract.  That way
they have someone to call and complain and to swap hardware until the
problem stops.

Since you have swapped the hardware and still have the problem I would
assume it is a software issue and not a hardware issue.  I would try
an older kernel.  Instead of the newest 3.14.5-1 I would try the 3.2.0
kernel from Wheezy.  I would try the still supported 2.6.32 kernel
from Squeeze.  Recent upstream Linux development has has a large
change in what hardware is well supported.  The older kernels might
work better.  Unfortunately if they do then you are still faced with
the problem of being able to upgrade once the support for those older
kernels expires.  But that may be better than the alternative.

You said you migrated from RHEL/CentOS.  Was that on this same
hardware or different hardware?  If the same hardware then I would
suspect that the newer kernels are the problem.

For another thing to think about I know some people run very slim dom0
systems on the bare hardware and then devote the rest of the system to
the domU guest user system.  Basically using virtualization to create
an insulating layer around everything.  I am just throwing that out
there as a brainstorm idea.  I would consider it if the problem was a
software one on otherwise known good hardware.  However it would be a
large paradigm shift and not a trivially easy thing to switch on
underneath your existing system.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: