Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

--- oph:cluster:startupguide [2023/04/12 18:16] – [Data storage area: /scratch] carlo.cintolesi@unibo.it
+++ oph:cluster:startupguide [2025/04/10 06:47] (versione attuale) – [Cluster Access] Uso jumphost diego.zuccato@unibo.it
@@ Linea 7: / Linea 7: @@
 To access the cluster, two things are required: that you have an Unibo account (of the type //name.surname00@unibo.it//) and that this account has been authorised for access by the OPH responsible person of your research sector. See [[oph:cluster:access|acessing the cluster]] for the contact details of the OPH responsible persons.
-Once authorised, could access the cluster via an ssh protocol. The required password is that one of your university e-mail. From the Linux terminal, type the following instruction:
+Once authorised, one could access the cluster via ssh protocol. The required password is the one of your university e-mail. From the Linux terminal, type the following instruction:
-  * if you are a **student**: ''ssh name.surname00@studio.unibo.it@137.204.50.177''
+  ssh -J name.surname00@137.204.165.34 name.surname00@ophfe1
-  * if you are a **staff member**: ''ssh name.surname00@137.204.50.177''
+(you can use any other available ophfeX address instead of ophfe1)
-Once logged-in, you will be landed on the [[oph:cluster:jobs|Frontend]], the workspace shared by all users that is utilised to submit jobs and access datasets stored in memory.
+Once logged in, you will land on the [[oph:cluster:jobs|Frontend]], the workspace shared by all users that is employed to submit jobs and access datasets stored in memory.
 <WRAP center round alert 60%>
-Do not use the Frontend to execute work codes, as this can lead to the shutdown of the entire system!
+Do not use the Frontend to execute long and demanding jobs, as this can lead to the shutdown of the entire system!
 </WRAP>
-===== Setup the Environme  =====
+===== Setup the Environment  =====
 The first time you access the cluster, you must set up your working environment. Here are a few tips to help you manage your account and data correctly. In particular, attention must be paid to the correct use of data storage areas, see [[oph:cluster:storage|Storage types]]. The OPH cluster currently has two main memory spaces with two different functions:
@@ Linea 24: / Linea 24: @@
 ==== Working storage area: /home ====
-At the first connection to the cluster, the system automatically generates a folder for you in the ''/home'' partition. This is a limited memory area and should not be used to store massive amounts of data. Typically, software codes and intensive-use documents or scripts that does not take up much space is saved on this.
+At the first connection to the cluster, the system automatically generates a folder for you in the ''/home'' partition. This is a limited memory area and should not be used to store massive amounts of data. Typically, software codes and intensive-use documents or scripts that do not take up much space is saved under this area.
-You can manage the data in this folder directly from the Frontend and to your liking, as long as you limit the storage space used and the number of files saved.
+You can manage the data in this folder directly from the Frontend and to your liking, as long as you limit the storage space used.
 ==== Data storage area: /scratch ====
@@ Linea 34: / Linea 34: @@
 Create your own folder in ''/scratch'' within the general folder of your research sector (see under research sector names [[oph:cluster:access|here]]). For example, if you work in the //astro// sector, type:
-  mkdir -p /scratch/atmos/name.surname00
+  mkdir -p /scratch/astro/name.surname00
 Then create the symbolic link in the folder accessible from the Frontend:
-  ln -s /scratch/atmos/name.surname00 run
+  ln -s /scratch/astro/name.surname00 run
 in this way, when you access the Frontend, you will immediately find a folder named ''run'' from which you can access the data saved in ''/scratch'' (note that ''run'' is not a folder but a symbolic link).
+In any case, if you need to work with some big files (e.g. a large dataset) use symlinks from ''/home'' to the files in ''/scratch'' to get access.
-**Note:** In any case, if you need to work with some big files (e.g. a large dataset) use symlinks from ''/home'' to the files in ''/scratch'' to get access.
 <WRAP center round alert 60%>
-The /scratch area cannot handle folders with a large number of files. Data folders in this area must be compacted into archives (.tgz or .zip, for example) unless in imminent use
+The /scratch area cannot handle folders with a large number of files. Data folders in this area must be compacted into archives (e.g., .tgz or .zip).
 </WRAP>
-Please pay attention to this policy. If the stable number of files saved in /scratch shall not exceed a few hundred per user. Otherwise, the system becomes incredibly slow and unstable. Periodic checks on the number of files of each user are carried out automatically.
+**Please pay attention to this policy**: the stable number of files saved in /scratch shall not exceed a few thousand per user. Otherwise, the system becomes incredibly slow and unstable. Periodic checks on the number of files of each user are carried out automatically.
+**Note to student supervisors **: Once you have created the data write folder for your students, you can request read and write rights to access the files through ''setfacl -m u:name.surname0:rw /home/pathToFolder'', where ''name.surname0'' is your account name and ''/home/pathToFolder'' is the universal path to the folder you want to access.
+===== Run a Job =====
+The job executed (in parallel or serial) on the cluster is managed by  [[https://slurm.schedmd.com/documentation.html|Slurm WorkLoad Manager]].
+The submission of a job is done via a bash-type script, consisting of: the header with metadata for users, the execution settings (e.g. number of processors, memory, execution time), the modules and the executable to be run. See the section [[oph:cluster:jobs|Run a Job]] for more details.
+<WRAP center round tip 60%>
+An example job script with comments can be downloaded here and adapted to personal needs:  [[https://apps.difa.unibo.it/wiki/_export/code/oph:cluster:jobs?codeblock=0|runParallel.sh]]
+</WRAP>
+To run the job, the script has to be submitted by:
+  sbatch runParallel.sh
+The output of the job execution is redirected into two files: ''infoRun000'' which contains the output and the job number in the name, ''errRun000'' which contains the error messages.
+===== Job Monitoring and Management =====
+Once a job has been submitted, it is possible to monitor the priority and progress status using the following:
+  * ''slurmtop'', displays the status of the cluster in a 'semigraphic' fashion
+  * ''squeue'', displays queue status
+  * ''scancel <job-ID>'', cancels the execution of a job with a given identification number (ID)
+Additional information and functions can be found in the official [[https://slurm.schedmd.com/documentation.html|documentation of Slurm]].
+===== Problems and Troubleshooting =====
+The cluster management requires quite a lot of time and energy at this stage. The management team kindly asks not to contact the technical administrators except for urgent matters or serious problems (which do not allow work to continue). Reports of malfunctions may be sent without guarantee of an immediate response.
+  * For information on accounting and cluster access problems, please contact the [[oph:cluster:access|reference person for your research area]].
+  * For problems accessing memory and executing jobs on the cluster, contact the system administrators at ''difa.csi@unibo.it''.
+The technical administrators do not offer assistance for problems related to the use of Slurm (see [[https://slurm.schedmd.com/documentation.html|on-line documentation]]) or related to your personal code or software.
+Thank you for your cooperation and understanding.