[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [slurm-users] Slurm not starting



> check slurmctld.log and slurmd.log, you can find them under
> /var/log/slurm-llnl

I have not these files unfortunately. Even 'locate' can't find them.


> Do you have a backup controller?
> Check your slurm.conf under:
> /etc/slurm-llnl

seems not 
#BackupController=
#BackupAddr=


> /usr/sbin/slurmctld -d -vvv
file not found :(

I can start with
 /etc/init.d/slurmctld restart

but I get

[....] Restarting slurmctld (via systemctl): slurmctld.serviceJob for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details.
 failed!

journalctl -xn

-- Logs begin at Mon 2018-01-15 12:44:48 CET, end at Mon 2018-01-15 15:23:31 CET. --
Jan 15 15:23:00 anyone.phys.uniroma1.it slurmctld[5133]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
 dhcpd[1550]: DHCPREQUEST for 192.168.1.106 from 10:bf:48:1a:02:5e via eth1
 dhcpd[1550]: DHCPACK on 192.168.1.106 to 10:bf:48:1a:02:5e via eth1
 dhcpd[1550]: DHCPREQUEST for 192.168.1.104 from bc:ae:c5:12:97:3b via eth1
 dhcpd[1550]: DHCPACK on 192.168.1.104 to bc:ae:c5:12:97:3b via eth1
 systemd[1]: slurmctld.service start operation timed out. Terminating.
 slurmctld[5133]: Terminate signal (SIGINT or SIGTERM) received
 slurmctld[5133]: Saving all slurm state
 systemd[1]: Failed to start Slurm controller daemon.
-- Subject: Unit slurmctld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit slurmctld.service has failed.
-- 
-- The result is failed.



2018-01-15 14:16 GMT+01:00 Williams, Jenny Avis <jennyw@email.unc.edu>:
Elisabetta-

Start by focusing on slurmctld. Slurmd not happy without it.
Start it manually in the foreground as in
/usr/sbin/slurmctld -d -vvv

This assumes slurmd,conf is in default location.
Pardon brevity; on my phone
Jenny Williams


Sent from Nine

From: Elisabetta Falivene <e.falivene@ilabroma.com>
Sent: Monday, January 15, 2018 7:14 AM
To: Slurm User Community List
Subject: [slurm-users] Slurm not starting

I did an upgrade from wheezy to jessie (automatically with a normal dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank you vey much to Gennaro Oliva, btw), now the system is running correctly with kernel 3.16.0.4, but slurm isn't starting. I tried restarting services, but it seems it isn't able to do it.

Error messages are not much helping me in guessing what is going on. What should I check to get what is failing?

Thank you 
Elisabetta

PS: Here it is some tests I did

Running  
sinfo

returns

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*       up   infinite      8   unk* node[01-08]


Running 
systemctl status slurmctld.service

returns 

slurmctld.service - Slurm controller daemon
   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s ago
  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

 slurmctld[2100]: cons_res: select_p_reconfigure
 slurmctld[2100]: cons_res: select_p_node_init
 slurmctld[2100]: cons_res: preparing for 1 partitions
 slurmctld[2100]: Running as primary controller
 slurmctld[2100]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
 slurmctld.service start operation timed out. Terminating.
Terminate signal (SIGINT or SIGTERM) received
 slurmctld[2100]: Saving all slurm state
 Failed to start Slurm controller daemon.
 Unit slurmctld.service entered failed state.

and running

/etc/init.d/slurmd status

returns

slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)
   Active: failed (Result: exit-code) since Mon 2018-01-15 12:44:52 CET; 21min ago
  Process: 729 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

slurmd.service: control process exited, code=exited status=1
systemd[1]: Failed to start Slurm node daemon.
Unit slurmd.service entered failed state.




Reply to: