On Thu, Jan 11, 2024 at 03:25:51PM -0500, Stefan Monnier wrote:
manufacturers in different memory banks, but since it's always possible to power down, replace or just remove memory, and power up again,Hmm... "always"? What about long running computations like that simulation (or LLM training) launched a month ago and that's expected to finish in another month or so?
I'd expect something like that to have a checkpoint/restart capability if not starting over actually matters.
Some mainframes have supported hot (un)plugging RAM modules as well
Yes, mainframes have been engineered that way for a long time. It makes them very expensive, and their market share has been declining for decades because most problems can be solved more cheaply in software (even while maintaining high availability). Hot *spare* memory is relatively common, as it solves most problems without the complexity of hot *swapping*, at the (generally low) cost of having to schedule downtime at some point in the future to actually replace the failed module.