Last modified: 1st November 2017
To be able to run many jobs automatically – e.g. running ADMIXTURE for different values for K without having to run each one manually -, you will want to use a job scheduling system, and for that you need to have Slurm installed.
I recommend that you install and configure this software first, although if it does not work you can keep installing the other packages (you will only use this for automated jobs, when you get a grasp of the processes involved).
sudo apt-get install munge
sudo apt-get install slurm-llnl
(usually, all needed dependencies will be installed)
Then it needs to be configured. Configurator files are found in
/usr/share/doc/slurm-ctld/
Or by default maybe in
/usr/share/doc/slurm-llnl/
They are called
slurm-wlm-configurator.easy.html
slurm-wlm-configurator.html
Or
slurm-llnl-configurator.easy.html
slurm-llnl-configurator.html
It is usually enough to open the easy configurator file for our basic needs.
Open one in your webbrowser, and fill the data as required. In my case, I needed to add my hostname (you can find yours with the command hostname -s
), for ControlMachine, NodeName, and Nodes, and the number of processors to 2.
The final lines of the slurm.conf file should be something like (change hostname with your hostname):
# COMPUTE NODES
NodeName=hostname CPUs=2 State=UNKNOWN
PartitionName=debug Nodes=hostname Default=YES MaxTime=INFINITE State=UP
NOTE. In my experience, installing slurm without having much idea about how it works can be very tricky, so you are better of sticking to what is known. Some advice, from my (little) experience and the instructions from https://slurm.schedmd.com/quickstart_admin.html:
– Do not add the full domain name. If you have a computer named
ubuntu.linux.net
, just putubuntu
.– Because virtual machines sometimes mess with localhost IP vs. your hostname IP (127.0.0.1 vs. 127.0.1.1)* I selected to name ControlMachine and NodeName with my hostname, instead of
localhost
. You might need to change your hostname to127.0.0.1
, but I wasn’t able to tweak these parameters without errors.* The reason why is documented in the Debian manual here: http://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution. Ultimately, it is a bug workaround; the original report is here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=316099
– I was tempted to delete the number of CPUs to use, so that it relies dinamically on the number of cores I selected for the virtual machine, but it didn’t work as expected. So I recomend selecting a specific number of CPUs or Procs (
CPUs=2
orProcs=2
in my case).
Press submit, and copy the output text in a new text file called slurm.conf, which needs to be saved (as root) in the default Slurm directory:
sudo cp slurm.conf /etc/slurm-llnl/
You should probably start (or restart) Slurm now. You might want to simply restart your system. Or try:
scontrol ping
will probably show that slurm services are down. Start it:
sudo slurmctld start
or
sudo /etc/init.d/slurmd start
or
sudo /etc/init.d/slurm start
Now you should be able to use sbatch commands. Try
sinfo
To see if everything is running ok.
Every time you submit a job, you can view if it is still working by using the command
squeue
You can kill that job if you want to stop it:
scancel X
(where X is the job number)
You can also hold and release that job:
scontrol hold X
scontrol release X
If you encounter any problems, use the following command to get a live report while you work with slurm:
sudo slurmctld -Dvvv
Or you can look into logs to see what has happened:
sudo tail -n 100 var/log/slurm-llnl/slurmctld.log
and/or
sudo tail -n 100 var/log/slurm-llnl/slurmd.log