[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Netra T1 200 watchdog timeouts



Richard Mortimer wrote:
On 18/09/2012 18:49, Mark Morgan Lloyd wrote:
Richard Mortimer wrote:
Hi Mark,

On 18/09/2012 14:36, Mark Morgan Lloyd wrote:
If I install either the current Wheezy/testing or Lenny on a Netra T1
200 with LOM (lomlite2) at 3.10, the first time OBP issues a boot
command I get

[string of hex]
Watchdog Reset
Externally Initiated Reset

I have a feeling that this is not a LOM watchdog reset but more a
SPARC processor watchdog reset (the processor running out of trap
levels in memory fault/interrupt processing).

You should be able to verify if it is a LOM watchdog reset by running
the "loghistory" command at the lom prompt.

No watchdog events shown, only power on/off and reset events (plus a
'LOM booted' near the start).
Good. I think that means that the LOM is definitely not involved in this problem.


If I subsequently issue a second boot the system runs as expected.
If I'm correct then this is probably due to something like retained
memory (not cleared during a soft reset/reboot just cleared during a
powercycle). That would explain why the second boot after the
Watchdog/XIR works fine.

But this also happens after a (soft) power-on, irrespective of whether
power has been physically removed (i.e. IEC connector pulled out of back
and left for a few minutes).

The IEC connector isn't really relevant to this. The LOM controls the power to the main CPU/circuit board. Actually thinking about it I think a hard reset (typing reset at the LOM prompt), CPU watchdog reset and a power off/on will cause full (poweron) reset processing to occur.

But given that you said it happens after a (soft) power on then maybe it isn't relevant anyway.

This affects both Lenny and Wheezy but does not affect Squeeze, i.e. it
appears to be a regression. Since this happens in between the OBP boot
command and SILO's boot prompt, I presume that it is a SILO problem or
that the installer is doing something odd to the disklabel.

Lenny:    1.4.13
Squeeze: 1.4.14
Wheezy:    1.4.14

I don't see how the LOM firmware would affect this. OBP maybe but if
it is a processor watchdog then it I doubt its LOM. SILO would be my
first suspect.

SILO is also my suspect (after a lot of fiddling trying to disable lom
watchdog from OBP etc.) and those are SILO version numbers :-/

Brain wasn't turned on enough to realise that!

From memory I don't think the LOM watchdog is ever enabled in OBP on the T1 200. It only ever gets enabled by the device drivers once Solaris is running (if the packages you mention below are installed of course).

OK but at the same time the README from Solaris patch 110208-21 explicitly says

5043823 Patch 110208-18 changes watchdog behavior and causes watchdog resets when probed

and

4412177  lomlite2 watchdog is not always disabled on "reboot" - 110208-07

both of which read as though there could be spurious watchdog events even without Solaris's intervention. However I note your point about the LOM log not showing anything.

Should I be raising this as a bug, or can I assume that the people who need to know about it are already aware of the issue?

The correct way of fixing this is probably to upgrade the LOM firmware
to 3.14. However this requires Solaris, and before the patch can be
installed it requires that the appropriate packages be installed:

"To use LOM commands you must install the Lights Out Management 2.0
packages (SUNWlomu, SUNWlomr and SUNWlomm) from the Solaris
Supplementary CD."
http://docs.oracle.com/cd/E19102-01/n1280.srvr/819-1269-11/poweron.html
The problem is that I don't believe that the supplementary CD is freely
available, which in practice means that this course is not available to
most Linux users.

I'm hoping there's enough detail in there that it shows up on Google, it
might save people work in the future.

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]


Reply to: