Strumenti Utente

Strumenti Sito


oph:cluster:messages

Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

Link a questa pagina di confronto

Entrambe le parti precedenti la revisioneRevisione precedente
Prossima revisione
Revisione precedente
oph:cluster:messages [2024/12/17 09:41] – [2024-12-17] diego.zuccato@unibo.itoph:cluster:messages [2026/01/12 12:05] (versione attuale) – [2026-01-12] diego.zuccato@unibo.it
Linea 7: Linea 7:
 </WRAP> </WRAP>
  
-<WRAP center round alert> +===== 2026 ===== 
-New year'timeline+ 
-  * On **Jan 13, 2025** /archive will be set **read-only** to avoid potential data loss+==== 2026-01-12 ==== 
-  * On **Jan 15, 2025** the cluster will be **completely shut down** for about 2 weeks.  +  * /archive returned writable from frontends and bld18 
-Operations will be resumed ASAP+ 
-</WRAP>+==== 2026-01-07 ==== 
 +  * Recovered (partially) from the emergency shutdown on XMas. /archive is currently readonly and it'possible not everything is accessible -- we're investigating the issue 
 + 
 +===== 2025 ===== 
 + 
 +==== 2025-12-25 ==== 
 +  * Emergency shutdowndata center temperature too high (>50°C) could cause long-lasting issues 
 + 
 +==== 2025-12-19 ==== 
 +  * Maintenance shutdown. The cluster will be unavailable till 2025-12-23 (if all goes well). 
 + 
 +==== 2025-12-02 ==== 
 +  Users from STUDENTI aren't able to reach the cluster. We're trying to determine the cause. It will be fixed ASAP. 
 + 
 +==== 2025-10-15 ==== 
 + 
 +  * **/scratch is full**: please delete unneeded data, and archive the rest; if it does't get better in a couple of days, we'll have to run the enforcement script that deletes older data -- **IT'S NOT POSSIBLE TO RECOVERY DATA DELETED BY THE SCRIPT!** See (again) [[oph:cluster:storage#scratch|the storage page]] for more info. 
 + 
 +==== 2025-08-25 ==== 
 +  * All nodes except bld17 (that was already down) should be fully operational, but it's not possible for us to check for dataset coherence: please **check your datasets/results**, expecially the ones you were working on at the time of the blackout; **double-check** (or discard and recreate) the ones you were writing to
 + 
 +==== 2025-08-22 ==== 
 +  * Started power-on. Some disks are corrupt and require some work to be recovered
 +  * [22:42 GMT+1] Cluster is *mostlyoperationalthe downed nodes will be fixed in the next days 
 +==== 2025-08-21 ==== 
 +  Power line is still unreliable: deferring poweron till tomorrow morning 
 + 
 +==== 2025-05-09 ==== 
 +  slowness resolved (backup completed) 
 + 
 +==== 2025-05-08 ==== 
 +  * generalized slowness: due to ongoing backup, access to /home is really slow. A concurrent check of the underlying RAID volume made it even worse. Check have been paused and backup is nearing completion, so the system should soon return to normality 
 +  mtx12 is offline again due to RAM issues 
 + 
 +==== 2025-04-22 ==== 
 +  recreated missing reservations -- please check names with ''scontrol show res'' 
 +  * mtx12 is (temporarily, we hope) down due to RAM issues 
 + 
 +==== 2025-04-18 ==== 
 +  Maintenance (nearly) completedTwo nodes are still down (gpu01 and gpu02), and some jobs migh have failed when scheduled on misbehaving nodes (bld17 and bld18, now fixed) 
 +  * **15:30 Update** all the nodes are currently working: don't break'em! Happy Easter! 
 +==== 2025-04-10 ==== 
 +  * Created a reservation to avoid having running jobs during maintenance (**2025-04-16T10:00** to **2025-04-18T15:00**); we'll do our best to reduce downtime, so the cluster //might// come back online sooner than planned 
 + 
 +==== 2025-03-31 ==== 
 +  * <del>We're experiencing issues with logins on bastion-nav from domain STUDENTI.</del> Fixed 
 + 
 +==== 2025-03-27 ==== 
 + 
 +  * <del>/scratch and many nodes are currently down due to technical issues. We're working on it.</del[9.00] Everything appears OK 
 + 
 +==== 2025-01-13 ==== 
 + 
 +  * /archive is now **read-only** to avoid potential data loss during cluster move
  
 ===== 2024 ===== ===== 2024 =====
 +
 +==== 2024-12-18 ====
 +
 +  * Power sources are redundant again. There shouldn't be unexpected shutdowns till the "big one" on 2025-01-15
  
 ==== 2024-12-17 ==== ==== 2024-12-17 ====
oph/cluster/messages.1734428512.txt.gz · Ultima modifica: da diego.zuccato@unibo.it

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki