oph:cluster:jobs
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
Entrambe le parti precedenti la revisioneRevisione precedenteProssima revisione | Revisione precedente | ||
oph:cluster:jobs [2023/04/12 09:33] – [The Frontend] carlo.cintolesi@unibo.it | oph:cluster:jobs [2024/11/18 09:08] (versione attuale) – diego.zuccato@unibo.it | ||
---|---|---|---|
Linea 6: | Linea 6: | ||
If an executable must necessarily be tested on the Frontend, the responsible user must actively monitor the job and be sure that it is not active for more than a few seconds. | If an executable must necessarily be tested on the Frontend, the responsible user must actively monitor the job and be sure that it is not active for more than a few seconds. | ||
</ | </ | ||
+ | |||
+ | That includes heavy IDEs((Integrated Development Environments)) (VScode, just to cite a name). If you're used to an IDE, use it on your client and just transfer the resulting files to the frontend. If it's worth using, it supports this workflow. | ||
+ | |||
+ | To better enforce the fair use of the frontend, the memory (RAM) usage is limited to 1GB per user. | ||
+ | |||
+ | |||
====== Run a Job ====== | ====== Run a Job ====== | ||
To execute serial or parallel code, it is necessary to use the [[https:// | To execute serial or parallel code, it is necessary to use the [[https:// | ||
- | For each job, it is necessary to specify via a batch script the required resources (e.g. number of nodes, number of processors, memory, execution time) and, optionally, any other constraints (e.g. group of nodes). Optionally, other parameters may also be indicated | + | For each job, it is necessary to specify via a batch script the required resources (e.g. number of nodes, number of processors, memory, execution time) and, optionally, any other constraints (e.g. a group of nodes). Optionally, other parameters may also be indicated |
===== Submission via script ===== | ===== Submission via script ===== | ||
- | Although it is possible to provide job submission information to the WorkLoad Manager via command line parameters, it is normally | + | Although it is possible to provide job submission information to the WorkLoad Manager via command line parameters, it is usually |
The job script is ideally divided into three sections: | The job script is ideally divided into three sections: | ||
- | * The header, consisting of commented text in which information and notes useful to the user but ignored by the system are given (the syntax of the comments is # | + | * The header, consisting of commented text in which information and notes useful to the user but ignored by the system are given (the syntax of the comments is '' |
- | * The Slurm settings, in which instructions for launching the actual job are specified (the syntax of the instructions is #SLURM --option); | + | * The Slurm settings, in which instructions for launching the actual job are specified (the syntax of the instructions is '' |
* The module loading and code execution, the structure of which varies according to the particular software each user is using. | * The module loading and code execution, the structure of which varies according to the particular software each user is using. | ||
- | Below is an example job script ([[https://liveunibo.sharepoint.com/:u:/s/ | + | Below is an example job script |
+ | <code bash runParallel.sh> | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # Bologna | ||
+ | # | ||
+ | # | ||
+ | # License | ||
+ | # This is free software: you can redistribute it and/or modify it | ||
+ | # under the terms of the GNU General Public License as published by | ||
+ | # the Free Software Foundation, either version 3 of the License, or | ||
+ | # (at your option) any later version. | ||
+ | # | ||
+ | # Author | ||
+ | # Carlo Cintolesi | ||
+ | # | ||
+ | # Application | ||
+ | # slurm workload manager | ||
+ | # | ||
+ | # Usage | ||
+ | # run a job: | ||
+ | # check processes: | ||
+ | # | ||
+ | # | ||
+ | # Description | ||
+ | # Run job on the new cluster of OPH with SLURM | ||
+ | # | ||
+ | # --------------------------------------------------------------------------- # | ||
+ | # SLURM setup | ||
+ | # --------------------------------------------------------------------------- # | ||
- | | + | #- (1) [optional] Choose the account of your research group |
- | #---------------------------------------------------------------------------- | + | ## |
- | # University | + | ## |
- | # of | Open Physics Hub | + | ## |
- | # | + | ## and ' |
- | # | + | |
- | | + | #- (2) Select the subcluster partition to work on (optional), |
- | | + | # the number |
- | # | + | # and the RAM memory available for each node |
- | # | + | #- |
- | # the Free Software Foundation, either version 3 of the License, or | + | #SBATCH --constraint=matrix |
- | # (at your option) any later version. | + | ## |
- | # | + | # |
- | # Author | + | ## |
- | | + | ## |
- | | + | # |
- | # Application | + | |
- | | + | #- (3) Set the name of the job, the log and error files, |
- | | + | # |
- | | + | #- |
- | # run a job: | + | #SBATCH --job-name=" |
- | | + | #SBATCH --output=%N_%j.out |
- | # delate | + | #SBATCH --error=%N_%j.err |
- | # | + | #SBATCH --mail-type=ALL |
- | # | + | #SBATCH --mail-user=nome.cognome@unibo.it |
- | | + | |
- | # | + | # --------------------------------------------------------------------------- # |
- | # --------------------------------------------------------------------------- # | + | # Modules |
- | # SLURM setup | + | # --------------------------------------------------------------------------- # |
- | # --------------------------------------------------------------------------- # | + | |
- | + | #- (4) Modules to be load | |
- | #- (1) Choose the partition where launch the job, | + | #- |
- | # | + | module load mpi/ |
- | | + | |
- | ## | + | #- (5) Run the job: just an example. |
- | # | + | # |
- | + | #- | |
- | | + | mpirun |
- | # the number of tasks to be used (or specify the number of node and tasks), | + | |
- | # the Infiniband constraint (encoraged in Matrix) | + | |
- | # the RAM memory available for each node | + | |
- | #- | + | |
- | #SBATCH --constraint=ib | + | |
- | # | + | |
- | #SBATCH | + | |
- | + | ||
- | #- (3) Set the name of the job, the log and error files, | + | |
- | # define the email address for comunications (just UniBo) | + | |
- | #- | + | |
- | #SBATCH --job-name=" | + | |
- | #SBATCH --output=infoRun%j | + | |
- | #SBATCH --error=err%j | + | |
- | #SBATCH --mail-type=ALL | + | |
- | #SBATCH --mail-user=nome.cognome@unibo.it | + | |
| | ||
- | | + | # ------------------------------------------------------------------------end # |
- | # Modules setup and applications run | + | </code> |
- | # --------------------------------------------------------------------------- # | + | |
- | + | ||
- | #- (4) Modules to be load | + | |
- | #- | + | |
- | ADD MODULES YOU NEED | + | |
- | + | ||
- | #- (5) Run the job: just an example | + | |
- | #- | + | |
- | mpirun --prefix $MPI_HOME -n 2 --mcapmlucx-x UCX_NET_DEVICES=mlx5_0: | + | |
- | + | ||
- | # ------------------------------------------------------------------------end # | + | |
It is possible to use several job steps (several lines that launch executables such as '' | It is possible to use several job steps (several lines that launch executables such as '' | ||
- | To send the resource request | + | To allocate |
- | sbatch runParallel.sh [other parameters] | + | sbatch |
+ | |||
+ | <WRAP center round info> | ||
+ | Estimating the value to use for '' | ||
+ | </ | ||
+ | <WRAP center round tip> | ||
+ | '' | ||
+ | |||
+ | While '' | ||
+ | </ | ||
For the management of running jobs, please refer to section "Job Management" | For the management of running jobs, please refer to section "Job Management" | ||
+ | |||
+ | |||
+ | ===== ' | ||
+ | |||
+ | Sometimes you have to run some heavy tasks (unsuitable for the frontend) that require interactivity. For example to compile a complex program that requires you to answer some questions, or to create a container. | ||
+ | |||
+ | You have to first request a node allocation, either by sbatch (as above, possibly with ' | ||
+ | salloc -N 1 --cpus-per-task=... --time=... --mem=... --constraint=blade | ||
+ | salloc will pause while waiting for the requested resources, so be prepared. It also tells you the value for $JOBID to be used in the following steps. | ||
+ | |||
+ | Then you can connect your terminal to the running job via: | ||
+ | srun --pty --overlap --jobid $JOBID bash | ||
+ | that gives you a new shell on the first allocated node for $JOBID (just like SSH-ing a node with the resources you asked for). | ||
+ | |||
+ | Once you're done, remember to call: | ||
+ | scancel $JOBID | ||
+ | to release resources for other users. | ||
===== Job Management ===== | ===== Job Management ===== | ||
Linea 105: | Linea 141: | ||
Once a job has been sent to the WorkLoad Manager via the command '' | Once a job has been sent to the WorkLoad Manager via the command '' | ||
- | * '' | + | * '' |
* '' | * '' | ||
* '' | * '' | ||
Linea 112: | Linea 148: | ||
Other management functions for the job and the accounting issue include the following: | Other management functions for the job and the accounting issue include the following: | ||
- | * ''/ | + | * ''/ |
- | * '' | + | * '' |
oph/cluster/jobs.1681292026.txt.gz · Ultima modifica: 2023/04/12 09:33 da carlo.cintolesi@unibo.it