Strumenti Utente

Strumenti Sito


oph:cluster:messages

Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

Link a questa pagina di confronto

Entrambe le parti precedenti la revisioneRevisione precedente
Prossima revisione
Revisione precedente
oph:cluster:messages [2024/05/06 08:05] – [2024-05-06] Added space to /home/temp diego.zuccato@unibo.itoph:cluster:messages [2025/03/31 12:59] (versione attuale) – [2025-03-31] diego.zuccato@unibo.it
Linea 3: Linea 3:
 Newer messages at the top. Newer messages at the top.
  
-=== 2024-05-06 ===+<WRAP center round info> 
 +To report issues, please write **only** to difa.csi@unibo.it including a clear description of the problem ("my job doesn't work" is __not__ a clear description but is what we usually receive...), including jobID, misbehaving node(s), steps to reproduce, etc. 
 +</WRAP> 
 + 
 + 
 +===== 2025 ===== 
 + 
 +==== 2025-03-31 ==== 
 +  * <del>We're experiencing issues with logins on bastion-nav from domain STUDENTI.</del> Fixed 
 + 
 +==== 2025-03-27 ==== 
 + 
 +  * <del>/scratch and many nodes are currently down due to technical issues. We're working on it.</del> [9.00] Everything appears OK 
 + 
 +==== 2025-01-13 ==== 
 + 
 +  * /archive is now **read-only** to avoid potential data loss during cluster move 
 + 
 +===== 2024 ===== 
 + 
 +==== 2024-12-18 ==== 
 + 
 +  * Power sources are redundant again. There shouldn't be unexpected shutdowns till the "big one" on 2025-01-15 
 + 
 +==== 2024-12-17 ==== 
 + 
 +  * Possibly unstable power source: rotative unit failed yesterday and is being worked on by CNAF. Some nodes powered off and have been restored. Uptime is currently not guaranteed in case of blackout. 
 + 
 +==== 2024-11-04 ==== 
 + 
 +  * Started cleanup of /home/work : everything outside sectors folders is being moved inside the (hopefully) correct folder: might bring the sector folder over-quota -- just clean up (delete unneeded, move to /scratch or /archive)! 
 + 
 +==== 2024-10-30 ==== 
 + 
 +  * <del>Slurm is experiencing an unexpected misbehaviour and does not accept job sumbissions: we're already working on it</del> Reverted a problematic change, more tests required 
 +  * mtx18 detected a problem with a DIMM module: powered off for HW checkup 
 +  * /home/temp is being decommissioned: data have already been moved to /scratch 
 + 
 +==== 2024-10-22 ==== 
 + 
 +  * :!: <wrap hi>Tomorrow</wrap> nodes mtx[30-40] will be temporarily shut down for maintenance; jobs running on those nodes will be aborted without further notice 
 + 
 +==== 2024-10-17 ==== 
 + 
 +  * Started transfer from /home/temp to /scratch of all the remaining files: /home/temp/sector/name.surname -> /scratch/sector/name.surname/name.surname (note the doubling of name.surname to avoid clashes with existing files) 
 +  * <wrap hi>automatic deletion</wrap> from /scratch is being worked on: all files older than **40 days** will be deleted automatically very soon, archive what you ned to keep! There will be <wrap hi>**__no way to recover deleted files__**</wrap>
 + 
 +==== 2024-10-03 ==== 
 + 
 +  * test e-mail 
 + 
 +==== 2024-09-16 ==== 
 + 
 +  * Completed data copy from /scratch/archive to /archive 
 +  * /archive is now available readwrite from frontends and <wrap hi>readonly from nodes</wrap> 
 +  * remember that quota on /archive is both about data size (20TB per sector) and inodes (max 10k files/dirs per sector)<WRAP center round important 60%> 
 +A "disk full" error means that one of the two limits have been reached. 
 +</WRAP> 
 +  * the data size can be checked by using 'ls -lh' on the parent directory, number of files requires a ''find path/to/dir | wc -l'' (:!: slow :!:) 
 + 
 +==== 2024-09-10 ==== 
 + 
 +  * /scratch should now be stable (<del>hopefully</del>) UPDATE: NOPE, only partially accessible 
 +  * /scratch/archive is now offlined to allow migration to new storage 
 + 
 +==== 2024-09-09 ==== 
 + 
 +  * /scratch is temporarily unavailable: we are working to restore the access. //UPDATE//: the issue has been fixed now so /scratch is usable again. 
 +         
 +==== 2024-08-19 ==== 
 + 
 +  * /home/temp reactivated: it crashes when overfilled! => <wrap hi>stop using /home/temp and migrate your data to /scratch</wrap> ASAP 
 +  * still testing /archive, it should become available soon (aiming to have it ready this week if no new issues arise)  
 + 
 +==== 2024-06-10 ==== 
 + 
 +  * new nodes for ECOGAL, RED-CARDINAL and ELSA projects are now available (mtx[33-40]), but they are currently without InfiniBand connection 
 +  * ophfe3 is now available for regular use 
 + 
 +==== 2024-06-10 ==== 
 + 
 +  * mtx03 is now OK 
 +  * all data from old /scratch should now be available under /scratch**/archive**, including many files that were unavailable or even deleted: <wrap hi>check for duplicates/unneeded files and clean up</wrap>! Currently there's no quota, but **quota will be enforced when moving data to /archive**! 
 + 
 +==== 2024-06-03 ==== 
 + 
 +  * switched /scratch from GlusterFS to BeeGFS (SSD-backed): 
 +    * can now be used for jobs as a (faster) replacement for /home/temp 
 +    * data from old /scratch is now under /scratch/archive (transfer is still proceeding) 
 +    * **do not write** under /scratch/archive 
 +    * auto-deletion in not active (yet) 
 +  * mtx03 is having hardware issues and will be down for some time 
 + 
 +==== 2024-05-28 ==== 
 +  * ophfe3 is currently reserved for file transfer, login is not allowed: you can use ophfe1 and ophfe2 
 +  * <WRAP round important>Planned maintenance starting on 2024-06-03T08:00:00 : details in following items</WRAP> 
 +  * data is currently being copied from /scratch to the new temporary storage that will be under /scratch/archive 
 +  * planned maintenance should be completed in 2h from the start, so the cluster will be useable on 2024-06-03 at 10am with new mounts 
 +  * this maintenance only affects /scratch area: /home is not touched (but the sooner you migrate your data from /home/temp/DDD to /scratch/DDD the sooner your jobs gets sped up) 
 +  * if you don't find something from current /scratch/DDD in new /scratch/archive/DDD you can write to difa.csi@unibo.it to ask for an extra transfer (sync would happen anyway, if you can wait some days) 
 +  * /archive (the current /scratch) will be **empty** for some time, then it will undergo its own migration to a different architecture and /scratch/archive will be moved there 
 + 
 +==== 2024-05-24 ==== 
 + 
 +  * New filesystem layout is being **__tested__ on OPHFE3**. What was /scratch is now mounted on /archive and the new (fast) scratch area is mounted in /scratch 
 +<WRAP center round important 60%> 
 +Please do not use /scratch on ophfe3 or consider your data **gone**! 
 +</WRAP> 
 + 
 +==== 2024-05-06 ==== 
   * /home/temp temporarily unavailable: overfilled (again) and crashed. <del>Now under maintenance.</del> Added another 4TB but **it will crash again when overfilled** :!:   * /home/temp temporarily unavailable: overfilled (again) and crashed. <del>Now under maintenance.</del> Added another 4TB but **it will crash again when overfilled** :!:
  
-=== 2024-04-16 ===+==== 2024-04-16 ===
   * ophfe2 is now reinstalled on new HW; please test it and report issues (if any)   * ophfe2 is now reinstalled on new HW; please test it and report issues (if any)
  
-=== 2024-04-09 ===+==== 2024-04-09 ===
   * <WRAP important>   * <WRAP important>
 Direct login to frontend nodes have been disabled, [[oph:cluster:access#step_1connecting_to_the_cluster|use Bastion service]]; nothing changed for connections from internal network (wired or AlmaWifi) Direct login to frontend nodes have been disabled, [[oph:cluster:access#step_1connecting_to_the_cluster|use Bastion service]]; nothing changed for connections from internal network (wired or AlmaWifi)
 </WRAP> </WRAP>
   * ophfe2 is going to be reinstalled: please leave it free ASAP; once reinstalled it will change its IP address from 50.177 to 50.72 .   * ophfe2 is going to be reinstalled: please leave it free ASAP; once reinstalled it will change its IP address from 50.177 to 50.72 .
-=== 2024-04-05 ===+ 
 +==== 2024-04-05 ===
   * Deployed new authorization config: please promptly report eventual slowdowns or other problems to difa.csi@unibo.it .   * Deployed new authorization config: please promptly report eventual slowdowns or other problems to difa.csi@unibo.it .
   * Bastion (137.204.50.15) is already working and direct access to ophfe* nodes is being phased out. Usually you only need to add "-J name.surname@137.204.50.15" to the ssh command you're using.   * Bastion (137.204.50.15) is already working and direct access to ophfe* nodes is being phased out. Usually you only need to add "-J name.surname@137.204.50.15" to the ssh command you're using.
  
-=== 2024-03-12 ===+==== 2024-03-12 ====
  
   * **New frontend available**: a new frontend can be reached at 137.204.50.71   * **New frontend available**: a new frontend can be reached at 137.204.50.71
   * Frontend at 137.204.50.177 is now **deprecated** and will be removed soon, to be replaced by a newer one at 137.204.50.72   * Frontend at 137.204.50.177 is now **deprecated** and will be removed soon, to be replaced by a newer one at 137.204.50.72
  
-=== 2024-02-21 ===+==== 2024-02-21 ====
  
   * 11.50 Outage resolved.   * 11.50 Outage resolved.
   * 06:30 Cluster operation is currently stopped due to a slurmctld error (daemon is not listening to network connections). I'm working to try to resolve the outage ASAP.   * 06:30 Cluster operation is currently stopped due to a slurmctld error (daemon is not listening to network connections). I'm working to try to resolve the outage ASAP.
  
 +===== 2023 =====
  
-=== 2023-11-10 ===+==== 2023-11-10 ====
  
 <wrap alert> <wrap alert>
Linea 37: Linea 152:
 </wrap> </wrap>
  
-=== 2023-10-20 ===+==== 2023-10-20 ====
  
 Tentatively re-enabled read/write mode for /home/temp . **Archive and delete** old data before starting new writes! Tentatively re-enabled read/write mode for /home/temp . **Archive and delete** old data before starting new writes!
  
-=== 2023-10-13 ===+==== 2023-10-13 ====
  
 /home/temp filesystem is currently offline for technical issues. Trying to reactivate **readonly** access ASAP. /home/temp filesystem is currently offline for technical issues. Trying to reactivate **readonly** access ASAP.
Linea 49: Linea 164:
 filesystem will be **wiped** soon. filesystem will be **wiped** soon.
 </wrap> </wrap>
-=== 2023-08-10 ===+ 
 +==== 2023-08-10 ====
  
 New login node available: ophfe3 (137.204.50.73) is now usable. New login node available: ophfe3 (137.204.50.73) is now usable.
 slurmtop is now in path, so no need to specify /home/software/utils/ . slurmtop is now in path, so no need to specify /home/software/utils/ .
  
-=== 2023-08-01 ===+==== 2023-08-01 ====
  
 :!: VSCode is bringing login node to a halt. Use it on your client and transfer the files. :!: VSCode is bringing login node to a halt. Use it on your client and transfer the files.
  
-=== older ===+===== older (undated) =====
  
 /scratch is now available, but readonly and only from the login nodes. /scratch is now available, but readonly and only from the login nodes.
oph/cluster/messages.1714982705.txt.gz · Ultima modifica: 2024/05/06 08:05 da diego.zuccato@unibo.it

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki