light cluster information
Specifications
- Head Node
- donllcluster (donllcluster.upc.edu)
- Storage Node
- Hardware
- 2 Nodes (light: 1, Brain: 1)
- head: Intel(R) Xeon(R) Gold 5218N CPU @ 2.30GHz × 2
light: AMD EPYC 7713P 64-Core Processor @ 2.98 GHz (light) × 64 (+ 64 hyperthreads)
brain: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz × 20 (+ 20 hyperthreads)
- nodes: RAM: 128 GB (light and brain)
- Software
- slurm
- Administration
- Sergi Ruiz-Barragan & Andrés Arco León (andres.arco@upc.edu)
Status
All OK.
Submit jobs to the queue system
Queue system used in our cluster is slurm.
The queue system permit to the users submit jobs (scripts) to the different nodes in the cluster.
SLURM is composed of user commands to submit, monitor, and manage jobs: srun, sbatch, squeue, scancel, scontrol (between others).
Submitting Jobs
Jobs are typically submitted via batch scripts. Example:
#!/bin/bash
#SBATCH -p <name partition>
#SBATCH -N <number of nodes> (usually 1)
#SBATCH -c <number of CPU>
#SBATCH --mem=10G
#SBATCH --time=1-00:00:00
#SBATCH -J <JobName>srun ./my_program
Where the #SBATCH entries provide information to the queue system. Importantly, in our cluster we have three partitions: BrainFull, LightFull, and Total. With these, users can choose the node they want: BrainFull to use only Brain node, LightFull for Light node or Total for any available node. The script job must be bubmited with:
sbatch myscript.sh
Monitoring Jobs
Check the job queue:
squeue
See detailed info:
scontrol show job <job_id>
Canceling Jobs
To cancel a job:
scancel <job_id>
To cancel all your jobs:
scancel -u $USER
Share: