Re: Toc Toc... Someone's here...
On Wed, Oct 27, 1999 at 09:21:56PM -0400, Greg Johnson wrote:
> Does anyone know anything about GNQS (Generic NQS)? I would like to use
> it as a replacement for DQS (since DQS is non-free and GNQS is GPL'd),
> but I found it confusing to set up. I'm also not sure if it can cope
> with MPI/PVM jobs.
> In my opinion, we are really lacking a good free (DFSG) queuing system.
> Perhaps I should look at GNU queue again. Last time I looked at it, it
> wasn't really usable.
I'm not sure there is a good free queueing system. When last I looked
GNQS seemed to have strong support for parallel jobs on single-system-image
machines (SMP's, NUMAs, etc.) but didn't handle parallel clusters well.
>From the mailing lists GNQS code sounded fairly clean, so it might be easy
to add necessary features. If you or someone else package GNQS I'll modify
DQS to use alternatives for the Posix q* commands. Until I make the mods
just conflict with dqs and put the q* commands in the new package. It
doesn't make much sense to have multiple batch queueing systems on one box
anyway. DQS under Debian uses ports 610,611, and 612. I have no objections
to sharing with other batch systems that don't have IANA assigned ports
either (since both shouldn't be running at once anyway).
PBS has finally been released under a BSD-with-advertising-clause
license. pbs.mrj.com. Never used it, but since NASA paid a lot of money to
replace NQS with PBS presumably it's an improvement. Definitely worth a
GNU queue isn't remotely posix, but that may not be a bad thing. It
supports interactive jobs unlike the Posix batch queueing systems (DQS,
*NQS). I've never been fond of qsub and it's relatives. OTOH queue seemed
to have no scheduling or accounting systems to speak of. I'm unsure how it
would cope with a parallel job that spawned lots of children, they'd
probably escape the queue and run unfettered by the non-existent scheduling
PVM is going to be a problem for any clustering system due to overlap
between PVM's "virtual machine" and the queueing system's view of a cluster.
DQS tries (not very hard) to setup and take down a virtual machine to run
each job in, but if you allow multiple jobs on any nodes (SMP's for
instance) there will certainly be problems. If you run interactive PVM jobs
as well DQS may find itself unable to setup the virtual machine at all, and
it may try to take down your interactive VM when the job ends. To fix this
right PVM needs to be redesigned. I have the impression that ORNL is doing
just this with their next-generation projects.
On stability, DQS-3.1.8 had big memory leaks in the master daemon. 3.2.7
has few leaks (none that I've noticed), though I still restart the master
daemon daily just to be safe. In a big cluster you might need to restart
qmaster more often. Look at /etc/cron.daily/dqs.