Strumenti Utente

Strumenti Sito


oph:cluster:messages

Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

Link a questa pagina di confronto

Entrambe le parti precedenti la revisioneRevisione precedente
Prossima revisione
Revisione precedente
oph:cluster:messages [2024/12/17 09:41] – [2024-12-17] diego.zuccato@unibo.itoph:cluster:messages [2025/08/25 07:22] (versione attuale) – [2025-08-25] Cluster recovered diego.zuccato@unibo.it
Linea 7: Linea 7:
 </WRAP> </WRAP>
  
-<WRAP center round alert> + 
-New year'timeline: +===== 2025 ===== 
-  On **Jan 13, 2025** /archive will be set **read-only** to avoid potential data loss+ 
-  * On **Jan 15, 2025** the cluster will be **completely shut down** for about 2 weeks.  +==== 2025-08-25 ==== 
-Operations will be resumed ASAP+  * All nodes except bld17 (that was already down) should be fully operational, but it'not possible for us to check for dataset coherenceplease **check your datasets/results**, expecially the ones you were working on at the time of the blackout; **double-check** (or discard and recreate) the ones you were writing to! 
-</WRAP>+ 
 +==== 2025-08-22 ==== 
 +  Started power-on. Some disks are corrupt and require some work to be recovered. 
 +  [22:42 GMT+1] Cluster is *mostly* operational, the downed nodes will be fixed in the next days 
 +==== 2025-08-21 ==== 
 +  Power line is still unreliable: deferring poweron till tomorrow morning 
 + 
 +==== 2025-05-09 ==== 
 +  slowness resolved (backup completed) 
 + 
 +==== 2025-05-08 ==== 
 +  generalized slowness: due to ongoing backup, access to /home is really slowA concurrent check of the underlying RAID volume made it even worse. Check have been paused and backup is nearing completion, so the system should soon return to normality 
 +  * mtx12 is offline again due to RAM issues 
 + 
 +==== 2025-04-22 ==== 
 +  recreated missing reservations -- please check names with ''scontrol show res'' 
 +  mtx12 is (temporarilywe hope) down due to RAM issues 
 + 
 +==== 2025-04-18 ==== 
 +  Maintenance (nearly) completed. Two nodes are still down (gpu01 and gpu02), and some jobs migh have failed when scheduled on misbehaving nodes (bld17 and bld18, now fixed) 
 +  **15:30 Update** all the nodes are currently working: don't break'em! Happy Easter! 
 +==== 2025-04-10 ==== 
 +  Created a reservation to avoid having running jobs during maintenance (**2025-04-16T10:00** to **2025-04-18T15:00**); we'll do our best to reduce downtime, so the cluster //might// come back online sooner than planned 
 + 
 +==== 2025-03-31 ==== 
 +  * <del>We're experiencing issues with logins on bastion-nav from domain STUDENTI.</del> Fixed 
 + 
 +==== 2025-03-27 ==== 
 + 
 +  * <del>/scratch and many nodes are currently down due to technical issues. We're working on it.</del[9.00] Everything appears OK 
 + 
 +==== 2025-01-13 ==== 
 + 
 +  * /archive is now **read-only** to avoid potential data loss during cluster move
  
 ===== 2024 ===== ===== 2024 =====
 +
 +==== 2024-12-18 ====
 +
 +  * Power sources are redundant again. There shouldn't be unexpected shutdowns till the "big one" on 2025-01-15
  
 ==== 2024-12-17 ==== ==== 2024-12-17 ====
oph/cluster/messages.1734428512.txt.gz · Ultima modifica: da diego.zuccato@unibo.it

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki