[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Random reboots, maybe related to BMC watchdog?



Hello,

we operate a server running wheezy and with a SuperMicro BMC¹ and
experience random reboots.

¹) http://www.supermicro.com/products/motherboard/Xeon/C600/X9DRW-iF.cfm

I think I traced them back to the BMC watchdog, which we have
enabled:

  ipmitool> bmc watchdog get
  Watchdog Timer Use:     SMS/OS (0x44)
  Watchdog Timer Is:      Started/Running
  Watchdog Timer Actions: Hard Reset (0x01)
  Pre-timeout interval:   0 seconds
  Timer Expiration Flags: 0x00
  Initial Countdown:      900 sec
  Present Countdown:      899 sec

On the system, freeipmi-bmc-watchdog 1.1.5-3 is running.

In debug mode, a successful run looks like this:

  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 Get Watchdog Timer Request
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 [              25h] = cmd[ 8b]
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 Get Watchdog Timer Request
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 [              25h] = cmd[ 8b]
  Oct 09 05:28:04 [               0h] = comp_code[ 8b]
  Oct 09 05:28:04 [               4h] = timer_use[ 3b]
  Oct 09 05:28:04 [               0h] = reserved1[ 3b]
  Oct 09 05:28:04 [               1h] = timer_state[ 1b]
  Oct 09 05:28:04 [               0h] = log[ 1b]
  Oct 09 05:28:04 [               1h] = timeout_action[ 3b]
  Oct 09 05:28:04 [               0h] = reserved2[ 1b]
  Oct 09 05:28:04 [               0h] = pre_timeout_interrupt[ 3b]
  Oct 09 05:28:04 [               0h] = reserved3[ 1b]
  Oct 09 05:28:04 [               0h] = pre_timeout_interval[ 8b]
  Oct 09 05:28:04 [               0h] = reserved4[ 1b]
  Oct 09 05:28:04 [               0h] = timer_use_expiration_flag.bios_frb2[ 1b]
  Oct 09 05:28:04 [               0h] = timer_use_expiration_flag.bios_post[ 1b]
  Oct 09 05:28:04 [               0h] = timer_use_expiration_flag.os_load[ 1b]
  Oct 09 05:28:04 [               0h] = timer_use_expiration_flag.sms_os[ 1b]
  Oct 09 05:28:04 [               0h] = timer_use_expiration_flag.oem[ 1b]
  Oct 09 05:28:04 [               0h] = reserved5[ 1b]
  Oct 09 05:28:04 [               0h] = reserved6[ 1b]
  Oct 09 05:28:04 [            2328h] = initial_countdown_value[16b]
  Oct 09 05:28:04 [            20D2h] = present_countdown_value[16b]
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 Reset Watchdog Timer Request
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 [              22h] = cmd[ 8b]
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 Reset Watchdog Timer Request
  Oct 09 05:28:04 =====================================================
  Oct 09 05:28:04 [              22h] = cmd[ 8b]
  Oct 09 05:28:04 [               0h] = comp_code[ 8b]

Every now and then, the following will happen:

  Oct 09 05:29:04 =====================================================
  Oct 09 05:29:04 Get Watchdog Timer Request
  Oct 09 05:29:04 =====================================================
  Oct 09 05:29:04 [              25h] = cmd[ 8b]
  Oct 09 05:29:06 =====================================================
  Oct 09 05:29:06 Get Watchdog Timer Request
  Oct 09 05:29:06 =====================================================
  Oct 09 05:29:06 [              25h] = cmd[ 8b]
  Oct 09 05:29:06 [               0h] = comp_code[ 8b]
  Oct 09 05:29:06 [               3h] = timer_use[ 3b]
  Oct 09 05:29:06 [               7h] = reserved1[ 3b]
  Oct 09 05:29:06 [               0h] = timer_state[ 1b]
  [Oct 09 05:29:06]: _get_watchdog_timer_cmd: fiid_obj_get: 'present_countdown_value': data not available
  [Oct 09 05:29:06]: timer stopped by another process
  [Oct 09 05:29:06]: stopping bmc-watchdog daemon
  Oct 09 05:29:06 [               1h] = log[ 1b]

And then the machine reboots after the timer expires.

We've worked with Supermicro and the vendor, replaced the mainboard
and tried all different firmwares and BIOS versions, but the problem
persists. However, this is the only case in 533 exactly identical
such systems sold in the last 3 years by the vendor. I am the only
one using Debian, apparently.

Do you have any idea what this could be and — more importantly — how
I could address this? I'd like to keep the watchdog functionality,
but as it stands I have to turn it off, of course, unless I find
a cure.

If asking here yields no result, I will take this to the freeipmi
people…

Any input appreciated!

Thanks,

-- 
 .''`.   martin f. krafft <madduck@d.o> @martinkrafft
: :'  :  proud Debian developer
`. `'`   http://people.debian.org/~madduck
  `-  Debian - when you have better things to do than fixing systems
 
in africa some of the native tribes have a custom of beating the
ground with clubs and uttering spine chilling cries. anthropologists
call this a form of primitive self-expression. in america they call
it golf.

Attachment: digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)


Reply to: