[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

checkpoint-restart (was Re: suspending programs)



On Tue, 2005-04-19 at 03:32 -0700, Alvin Oga wrote:
> hi ya
> 
> On Tue, 19 Apr 2005, roberto wrote:
> > Hello,
> >  i usually run large simulations, and it happened sometimes power to go down 
> > suddenly, i don't know why.
> 
> >  I'd like to restart my programs just at the same point where they have been 
> > suspended.
> 
> are you save the sate fo the simulation ?? if not ...it's impossible to 
> continue from where you crashed
> 

I've done a quick apt-cache search (sarge apt sources) for something
like IRIX's `cpr`:

     IRIX Checkpoint and Restart (CPR) offers a set of user-transparent
     software management tools, allowing system administrators,
operators, and
     users with suitable privileges to suspend a job or a set of jobs in
mid-
     execution, and restart them later on.  The jobs may be running on a
     single machine or on an array of networking connected machines.
CPR may
     be used to enhance system availability, provide load and resource
control
     or balancing, and to facilitate simulation or modeling.

which he could use, eg by checkpointing every N hours and then use the
restart with power back on. The chk pt in this case writes memory image
to disk, the other alt is for Roberto to do checkpointing within the
simulation (eg write state vars to a file) and knock up a quick restart
routine. For any decent/large sim code, I'd guess these already exist...

It's hugely unlikely you'd be able to restart at the exact same place,
but cpr would allow you to not lose everything (don't forget there's a
large cost assoc with writing lots of data to disk)
-- 
Michael Bane
Atmospheric Physics Group
University of Manchester



Reply to: