Elisabetta-
Start by focusing on slurmctld. Slurmd not happy without it.
Start it manually in the foreground as in
/usr/sbin/slurmctld -d -vvv
This assumes slurmd,conf is in default location.
Pardon brevity; on my phone
Jenny Williams
Sent from Nine
From: Elisabetta Falivene <e.falivene@ilabroma.com>
Sent: Monday, January 15, 2018 7:14 AM
To: Slurm User Community List
Subject: [slurm-users] Slurm not starting
I did an upgrade from wheezy to jessie (automatically with a normal dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank you vey much to Gennaro Oliva, btw), now the system is running correctly with kernel 3.16.0.4, but slurm isn't starting. I tried restarting services, but it seems it isn't able to do it.
Error messages are not much helping me in guessing what is going on. What should I check to get what is failing?
Thank youElisabetta
PS: Here it is some tests I did
Runningsinfo
returns
PARTITION AVAIL TIMELIMIT NODES STATE NODELISTbatch* up infinite 8 unk* node[01-08]
Runningsystemctl status slurmctld.service
returns
slurmctld.service - Slurm controller daemonLoaded: loaded (/lib/systemd/system/slurmctld.service; enabled) Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s agoProcess: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
slurmctld[2100]: cons_res: select_p_reconfigureslurmctld[2100]: cons_res: select_p_node_initslurmctld[2100]: cons_res: preparing for 1 partitionsslurmctld[2100]: Running as primary controllerslurmctld[2100]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma x_sched_time=4,partition_job_ depth=0 slurmctld.service start operation timed out. Terminating.Terminate signal (SIGINT or SIGTERM) receivedslurmctld[2100]: Saving all slurm stateFailed to start Slurm controller daemon.Unit slurmctld.service entered failed state.
and running
/etc/init.d/slurmd status
returns
slurmd.service - Slurm node daemonLoaded: loaded (/lib/systemd/system/slurmd.service; enabled) Active: failed (Result: exit-code) since Mon 2018-01-15 12:44:52 CET; 21min agoProcess: 729 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
slurmd.service: control process exited, code=exited status=1systemd[1]: Failed to start Slurm node daemon.Unit slurmd.service entered failed state.