[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?



Stefan Monnier wrote: 
> > manufacturers in different memory banks, but since it's always
> > possible to power down, replace or just remove memory, and power
> > up again,
> 
> Hmm... "always"?  What about long running computations like that
> simulation (or LLM training) launched a month ago and that's expected to
> finish in another month or so?

If the job is that big, it's being run on multiple machines. This
machine's current chunk is corrupt, so you can't use it anyway.
The orchestrator stops using this machine, someone comes in to
replace the RAM. Later the machine is re-added to the pool.


> Some mainframes have supported hot (un)plugging RAM modules as well and
> I wouldn't be surprised if some x86 servers also support it nowadays.

https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html

That said, you won't find this feature without specifying it
when you buy it, and very few have a use case for it.

-dsr-


Reply to: