oph:cluster:jobs
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
| Prossima revisione | Revisione precedente | ||
| oph:cluster:jobs [2023/04/06 15:08] – creata marco.baldi5@unibo.it | oph:cluster:jobs [2024/11/18 09:08] (versione attuale) – diego.zuccato@unibo.it | ||
|---|---|---|---|
| Linea 1: | Linea 1: | ||
| - | ====== | + | ====== |
| + | |||
| + | The Frontend is the node you connect to remotely. Its primary function is to allow remote access | ||
| + | |||
| + | <WRAP center round important 60%> | ||
| + | If an executable must necessarily be tested on the Frontend, the responsible user must actively monitor the job and be sure that it is not active for more than a few seconds. | ||
| + | </ | ||
| + | |||
| + | That includes heavy IDEs((Integrated Development Environments)) (VScode, just to cite a name). If you're used to an IDE, use it on your client and just transfer the resulting files to the frontend. If it's worth using, it supports this workflow. | ||
| + | |||
| + | To better enforce the fair use of the frontend, the memory (RAM) usage is limited to 1GB per user. | ||
| + | |||
| + | |||
| + | ====== | ||
| + | |||
| + | To execute serial or parallel code, it is necessary to use the [[https:// | ||
| + | |||
| + | For each job, it is necessary to specify via a batch script the required resources (e.g. number of nodes, number of processors, memory, execution time) and, optionally, any other constraints (e.g. a group of nodes). Optionally, other parameters may also be indicated | ||
| + | |||
| + | |||
| + | ===== Submission via script ===== | ||
| + | |||
| + | Although it is possible to provide job submission information to the WorkLoad Manager via command line parameters, it is usually preferred to create a bash script (job script) that contains the information permanently. | ||
| + | |||
| + | The job script is ideally divided into three sections: | ||
| + | * The header, consisting of commented text in which information and notes useful to the user but ignored by the system are given (the syntax of the comments is ''# | ||
| + | * The Slurm settings, in which instructions for launching the actual job are specified (the syntax of the instructions is ''# | ||
| + | * The module loading and code execution, the structure of which varies according to the particular software each user is using. | ||
| + | |||
| + | Below is an example job script for parallel computing: | ||
| + | <code bash runParallel.sh> | ||
| + | # | ||
| + | # | ||
| + | # | ||
| + | # | ||
| + | # Bologna | ||
| + | # | ||
| + | # | ||
| + | # License | ||
| + | # This is free software: you can redistribute it and/or modify it | ||
| + | # under the terms of the GNU General Public License as published by | ||
| + | # the Free Software Foundation, either version 3 of the License, or | ||
| + | # (at your option) any later version. | ||
| + | # | ||
| + | # Author | ||
| + | # Carlo Cintolesi | ||
| + | # | ||
| + | # Application | ||
| + | # slurm workload manager | ||
| + | # | ||
| + | # Usage | ||
| + | # run a job: | ||
| + | # check processes: | ||
| + | # | ||
| + | # | ||
| + | # Description | ||
| + | # Run job on the new cluster of OPH with SLURM | ||
| + | # | ||
| + | # --------------------------------------------------------------------------- # | ||
| + | # SLURM setup | ||
| + | # --------------------------------------------------------------------------- # | ||
| + | |||
| + | #- (1) [optional] Choose the account of your research group | ||
| + | ##SBATCH --account=oph | ||
| + | ##SBATCH --reservation=prj-can ## Use the node reserved for CAN project | ||
| + | ##SBATCH --qos=normal | ||
| + | ## and ' | ||
| + | |||
| + | #- (2) Select the subcluster partition to work on (optional), | ||
| + | # the number of tasks to be used (or specify the number of nodes and tasks), | ||
| + | # and the RAM memory available for each node | ||
| + | #- | ||
| + | #SBATCH --constraint=matrix | ||
| + | ##SBATCH --constraint=blade | ||
| + | #SBATCH --ntasks=56 | ||
| + | ##SBATCH --nodes=2 | ||
| + | ##SBATCH --tasks-per-node=28 ## number of tasks per node (multiple of 28) | ||
| + | #SBATCH --mem-per-cpu=2G | ||
| + | |||
| + | #- (3) Set the name of the job, the log and error files, | ||
| + | # define the email address for communications (just UniBo) | ||
| + | #- | ||
| + | #SBATCH --job-name=" | ||
| + | #SBATCH --output=%N_%j.out | ||
| + | #SBATCH --error=%N_%j.err | ||
| + | #SBATCH --mail-type=ALL | ||
| + | #SBATCH --mail-user=nome.cognome@unibo.it | ||
| + | |||
| + | # --------------------------------------------------------------------------- # | ||
| + | # Modules setup and applications run | ||
| + | # --------------------------------------------------------------------------- # | ||
| + | |||
| + | #- (4) Modules to be load | ||
| + | #- | ||
| + | module load mpi/ | ||
| + | |||
| + | #- (5) Run the job: just an example. | ||
| + | # Note that the number of processors "-np 56" must be equal to --ntasks=56 | ||
| + | #- | ||
| + | mpirun -np 56 ./ | ||
| + | |||
| + | # ------------------------------------------------------------------------end # | ||
| + | </ | ||
| + | |||
| + | It is possible to use several job steps (several lines that launch executables such as '' | ||
| + | |||
| + | To allocate the resource request in the job script by the WorkLoad Manager, the command must be executed: | ||
| + | |||
| + | sbatch --time hh:mm:ss runParallel.sh [other parameters] | ||
| + | |||
| + | <WRAP center round info> | ||
| + | Estimating the value to use for '' | ||
| + | </ | ||
| + | <WRAP center round tip> | ||
| + | '' | ||
| + | |||
| + | While '' | ||
| + | </ | ||
| + | |||
| + | For the management of running jobs, please refer to section "Job Management" | ||
| + | |||
| + | |||
| + | ===== ' | ||
| + | |||
| + | Sometimes you have to run some heavy tasks (unsuitable for the frontend) that require interactivity. For example to compile a complex program that requires you to answer some questions, or to create a container. | ||
| + | |||
| + | You have to first request a node allocation, either by sbatch (as above, possibly with ' | ||
| + | salloc -N 1 --cpus-per-task=... --time=... --mem=... --constraint=blade | ||
| + | salloc will pause while waiting for the requested resources, so be prepared. It also tells you the value for $JOBID to be used in the following steps. | ||
| + | |||
| + | Then you can connect your terminal to the running job via: | ||
| + | srun --pty --overlap --jobid $JOBID bash | ||
| + | that gives you a new shell on the first allocated node for $JOBID (just like SSH-ing a node with the resources you asked for). | ||
| + | |||
| + | Once you're done, remember to call: | ||
| + | scancel $JOBID | ||
| + | to release resources for other users. | ||
| + | |||
| + | ===== Job Management ===== | ||
| + | |||
| + | Once a job has been sent to the WorkLoad Manager via the command '' | ||
| + | |||
| + | * ''/ | ||
| + | * '' | ||
| + | * '' | ||
| + | * '' | ||
| + | |||
| + | Other management functions for the job and the accounting issue include the following: | ||
| + | |||
| + | * ''/ | ||
| + | * '' | ||
oph/cluster/jobs.1680793738.txt.gz · Ultima modifica: da marco.baldi5@unibo.it
