light cluster information

Specifications

Head Node
donllcluster (donllcluster.upc.edu)
Storage Node
Hardware
2 Nodes (light: 1,  Brain: 1)
head: Intel(R) Xeon(R) Gold 5218N CPU @ 2.30GHz × 2

light: AMD EPYC 7713P 64-Core Processor @ 2.98 GHz (light) × 64 (+ 64 hyperthreads)

brain: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz × 20 (+ 20 hyperthreads)

nodes: RAM: 128 GB (light and brain)
Software
slurm
Administration
Sergi Ruiz-Barragan & Andrés Arco León (andres.arco@upc.edu)

Status

All OK.

Submit jobs to the queue system

Queue system used in our cluster is slurm

The queue system permit to the users submit jobs (scripts) to the different nodes in the cluster.

SLURM is composed of user commands to submit, monitor, and manage jobs: srun, sbatch, squeue, scancel, scontrol (between others).

Submitting Jobs

Jobs are typically submitted via batch scripts. Example:

#!/bin/bash
#SBATCH -p <name partition>
#SBATCH -N <number of nodes> (usually 1)
#SBATCH -c <number of CPU>
#SBATCH --mem=10G
#SBATCH --time=1-00:00:00 
#SBATCH -J <JobName>

srun ./my_program

Where the #SBATCH entries provide information to the queue system. Importantly, in our cluster we have three partitions: BrainFull, LightFull, and Total. With these, users can choose the node they want: BrainFull to use only Brain node, LightFull for Light node or Total for any available node. The script job must be bubmited with:

sbatch myscript.sh

Monitoring Jobs

Check the job queue:

squeue

See detailed info:

scontrol show job <job_id>

Canceling Jobs

To cancel a job:

scancel <job_id>

To cancel all your jobs:

scancel -u $USER

Software