HPC@Mines Using Slurm
General Slurm Commands
- ---sbatch - Submit a batch script to SLURM.
- ---squeue - view information about jobs located in the SLURM scheduling queue.
- ---sinfo - view information about SLURM nodes and partitions.
- ---scancel - Used to signal jobs or job steps that are under the control of Slurm.
- ---scontrol - Used to view Slurm configuration and state. (Example: @mio001[~]->scontrol show node phi001)
Text only version
- Shows mapping between common PBS/Slurm/Load Leveler commands
- Slurm Documentation
HPC@Mines Specific Commands
There are slurm related commands that are unique to HPC@Mines.
- Information about nodes available
- Information about queue and running jobs
- A utility for getting a full list of nodes used for a job
"-help" or "-h" options are available for each of the commands
Example: [joeuser@mio001 utility]$ printenv SLURM_NODELIST compute[004-005] [joeuser@mio001 utility]$ ./expands $SLURM_NODELIST compute004 compute004 compute004 compute004 compute005 compute005 compute005 compute005
Mio Specific Slurm Commands
The scheduler on mio has partitions. If you don't care which nodes you run on you do not need to specify a partition. If you would like to run on your groups nodes or want to run on the PHI or GPU nodes you need to specify a partition. As discussed below, the command for submitting a batch job is sbatch script where script is your script. To run in the phi partition, and thus on the phi nodes, the syntax would be sbatch -p phi script
As of July 14 15:51:39 MDT 2014 the following partitions are defined.
[joeuser@mio001 ~]$ sinfo -a PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 6-00:00:00 52 alloc compute[032-047,061,068-083,102-111,119-122,125-129] compute* up 6-00:00:00 74 idle compute[000-005,008-031,049-052,054-060,062-067,084-101,112-118,123-124] phi up 6-00:00:00 1 mix phi001 phi up 6-00:00:00 1 idle phi002 gpu up 6-00:00:00 3 idle gpu[001-003] hkazemi up 6-00:00:00 1 idle compute031 anewman up 6-00:00:00 1 idle compute055 asum up 6-00:00:00 8 idle compute[051-052,094-099] cciobanu up 6-00:00:00 3 idle compute[054,090-091] cmmaupin up 6-00:00:00 10 idle compute[016-025] geco up 6-00:00:00 6 idle compute[084-089] hpc up 6-00:00:00 2 idle compute[004-005] ireimani up 6-00:00:00 1 alloc compute102 jbrune up 6-00:00:00 6 idle compute[000-003,100-101] lcarr up 6-00:00:00 2 alloc compute[128-129] lcarr up 6-00:00:00 11 idle compute[026-030,062-067] mganesh up 6-00:00:00 1 alloc compute061 mganesh up 6-00:00:00 5 idle compute[056-060] mooney up 6-00:00:00 2 idle compute[049-050] nsulliva up 6-00:00:00 1 alloc compute122 nsulliva up 6-00:00:00 1 idle compute123 pconstan up 6-00:00:00 1 alloc compute125 pconstan up 6-00:00:00 1 idle compute124 psava up 6-00:00:00 44 alloc compute[032-047,068-083,103-111,119-121] psava up 6-00:00:00 7 idle compute[112-118] zhiwu up 6-00:00:00 6 idle compute[010-015] mlusk up 6-00:00:00 2 alloc compute[126-127] mlusk up 6-00:00:00 4 idle compute[008-009,092-093] mgpu3 up 6-00:00:00 1 idle gpu003
HPC@Mines Runtime Policies
The standard maximum walltime is: 6 days or
If you find you do need to request an increased walltime, the official policy is thus:
Each request will be handled on a case-by-case basis.
HPC@Mines strongly encourages other means to tackle larger problems, rather than just extending the maximum walltime; there are two primary approaches to do this.
- Increase the amount of parallelism
By increasing the number of cores/nodes used in your job, you can often decrease the total wall time needed.
Checkpointing is the processes of periodically or on certain events saving the state of the execution so that it can be picked up at a later time. This is extremely helpful if you are afraid a crash or error that could cause your entire run to be lost; this way you have save points every few hours, days, etc.
If you desire help in any of these areas as always the HPC@Mines team is available and willing to help you with the computing aspects of your research, you may email us at email@example.com. You may also find that first consulting with members of your group or other peers that are currently using the same code you are running may provide expedited answers to your questions since they already are more familiar with your specific context.