Questa è una vecchia versione del documento!
Frontend
The Frontend is the node you connect to remotely. Its primary function is to allow remote access to the calculation clusters by all users and (in limited circumstances) to edit and compile source codes. It must never be used to execute resource-intensive codes, as these will slow down the work of other users and leads to loss of cluster functionality and eventually lead to the blocking of the entire infrastructure.
If an executable must necessarily be tested on the Frontend, the responsible user must actively monitor the job and be sure that it is not active for more than a few seconds.
Run a Job
To execute serial or parallel code, it is necessary to use the Slurm WorkLoad Manager, which will allocate the necessary resources and manage the priority of requests. Below are some of the basic functions and operating instructions for submitting serial and parallel execution (job) via Slurm; please refer to the official documentation for further information.
For each job, it is necessary to specify via a batch script the required resources (e.g. number of nodes, number of processors, memory, execution time) and, optionally, any other constraints (e.g. group of nodes). Optionally, other parameters may also be indicated
Submission via script
Although it is possible to provide job submission information to the WorkLoad Manager via command line parameters, it is normally preferred to create a bash script (job script) that contains the information permanently.
The job script is ideally divided into three sections:
- The header, consisting of commented text in which information and notes useful to the user but ignored by the system are given (the syntax of the comments is #text-for-user…);
- The Slurm settings, in which instructions for launching the actual job are specified (the syntax of the instructions is #SLURM –option);
- The module loading and code execution, the structure of which varies according to the particular software each user is using.
Below is an example job script (runParallel.sh) for parallel computing:
#!/bin/bash #---------------------------------------------------------------------------- # # University | DIFA - Dept of Physics and Astrophysics # of | Open Physics Hub # Bologna | (https://site.unibo.it/openphysicshub/en) #---------------------------------------------------------------------------- # # License # This is free software: you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # Author # Carlo Cintolesi # # Application # slurm workload manager # # Usage # run a job: sbatch run.sh # check processes: slurmtop # delete a job: scancel <jobID> # # Description # Run job on the new cluster of OPH with SLURM # # --------------------------------------------------------------------------- # # SLURM setup # --------------------------------------------------------------------------- # #- (1) Choose the partition where launch the job, # and the account of your research group #- ##SBATCH --partition=g1 ## GPU node #SBATCH --partition=m1 ## Matrix nodes 00-15 ##SBATCH --account="oph" ## Choose the name of the account to charge the job on #- (2) Select the nodes to work on (discouraged in Matrix), # the number of tasks to be used (or specify the number of node and tasks), # the Infiniband constraint (encouraged in Matrix) # the RAM memory available for each node #- #SBATCH --constraint=ib ## infiniband, keep for all matrix node #SBATCH --nodes=2 ## number of nodes to be allocated #SBATCH --tasks-per-node=28 ## number of tasks per node #SBATCH --ntasks=56 ## total number of tasks (should be compatible with nodes x tasks-per-node) #SBATCH --mem-per-cpu=2G ## ram per cpu (to be tuned) #- (3) Set the name of the job, the log and error files, # define the email address for comunications (just UniBo) #- #SBATCH --job-name="jobName" ## job name in the scheduler #SBATCH --output=%N_%j.out ## log file #SBATCH --error=%N_%j.err ## err file #SBATCH --mail-type=ALL ## send email at beginning and end of job #SBATCH --mail-user=nome.cognome@unibo.it ## email to send job information to # --------------------------------------------------------------------------- # # Modules setup and applications run # --------------------------------------------------------------------------- # #- (4) Modules to be load #- ADD MODULES YOU NEED module load mpi/openmpi/4.1.4 #- (5) Run the job: just an example #- mpirun -np 56 ./executable <params> # ------------------------------------------------------------------------end #
It is possible to use several job steps (several lines that launch executables such as mpirun
) in a single job script if each step requires the same resource allocation as the previous one and must start when the previous one has finished. If, on the other hand, the steps are independent or sequentially dependent on different resource requests, then it is better to use separate job scripts: the execution of the job steps takes place sequentially within a single resource allocation (e.g. in a single subset of nodes), while different jobs can have different allocations (thus reducing resource wastage) and also start in parallel.
To allocate the resource request in the job script by the WorkLoad Manager, the command must be executed:
sbatch runParallel.sh [other parameters]
For the management of running jobs, please refer to section “Job Management”.
Job Management
Once a job has been sent to the WorkLoad Manager via the command sbatch
command, it is possible to monitor the priority and progress status of the job with a series of management functions:
slurmtop
, displays the status of the cluster in a 'semigraphic' fashion. Among other features, it displays the status of jobs and the allocation of jobs to nodes.squeue
, displays queue statusscancel <job-ID>
, cancels the execution of a job with a given identification number (ID)scontrol show job <job-ID>
, displays the details of a job, including the queued priority
Other management functions for the job and the accounting issue include the following:
/home/software/utils/seff <job-ID>
, informs how efficiently the required resources have been utilised. The job must already be finished.sshare e /home/software/utils/showfullusage.sh
, informs on how many resources have already been used and by which user.