Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

--- oph:cluster:messages [2024/12/13 12:52] – [2024] diego.zuccato@unibo.it
+++ oph:cluster:messages [2026/05/06 12:51] (versione attuale) – [2026-04-05] Partial resume diego.zuccato@unibo.it
@@ Linea 6: / Linea 6: @@
 To report issues, please write **only** to difa.csi@unibo.it including a clear description of the problem ("my job doesn't work" is __not__ a clear description but is what we usually receive...), including jobID, misbehaving node(s), steps to reproduce, etc.
 </WRAP>
+<WRAP center round alert>
+Remember that bld[15-16] are reserved for courses during the day. Jobs launched while not in a lab lesson will be terminated without further notice. If you need to run jobs to prepare an exam, just add:
+  #SBATCH --exclude=bld[15-16]
+to your job script.
+</WRAP>
+===== 2026 =====
+==== 2026-04-06 ====
+Started resuming some nodes. The biggest conditioner is still broken but the others have been fixed and are currently working. Hope not to have to shutdown again.
+==== 2026-04-05 ====
+The server room is experiencing overtemperature due to a failed AC: many (not all) nodes are being drained and will be resumed ASAP.
+==== 2026-03-30 ====
+Possible (hopefully unlikely) service interruption due to removal of electrical bypass installed on 25/12.
+In case of emergency, the cluster will be shut down without further notice between 08.00 and 09.00 and reopened as soon as possible.
+==== 2026-03-23 ====
+Planned network interruption around 06.00. Network should remain unreachable for about 5 minutes, but the AD integration (the one that lets you authenticate against bastion-nav and ophfe*) could have issues restarting afterwards and will be checked at about 07.45.
+Active connections and file transfers will be dropped. Running jobs should continue working unless they're using remote resources.
+==== 2026-03-10 ====
+<del>Possible problems due to electrical maintenance. Cluster might poweroff without further warning, even if we've been reassured that won't happen (yeah... sure... hope it's not just like the last time...).
+</del> No unplanned shutdown this time
+==== 2026-01-12 ====
+  * /archive returned writable from frontends and bld18
+==== 2026-01-07 ====
+  * Recovered (partially) from the emergency shutdown on XMas. /archive is currently readonly and it's possible not everything is accessible -- we're investigating the issue
+===== 2025 =====
+==== 2025-12-25 ====
+  * Emergency shutdown: data center temperature too high (>50°C) could cause long-lasting issues
+==== 2025-12-19 ====
+  * Maintenance shutdown. The cluster will be unavailable till 2025-12-23 (if all goes well).
+==== 2025-12-02 ====
+  * Users from STUDENTI aren't able to reach the cluster. We're trying to determine the cause. It will be fixed ASAP.
+==== 2025-10-15 ====
+  * **/scratch is full**: please delete unneeded data, and archive the rest; if it does't get better in a couple of days, we'll have to run the enforcement script that deletes older data -- **IT'S NOT POSSIBLE TO RECOVERY DATA DELETED BY THE SCRIPT!** See (again) [[oph:cluster:storage#scratch|the storage page]] for more info.
+==== 2025-08-25 ====
+  * All nodes except bld17 (that was already down) should be fully operational, but it's not possible for us to check for dataset coherence: please **check your datasets/results**, expecially the ones you were working on at the time of the blackout; **double-check** (or discard and recreate) the ones you were writing to!
+==== 2025-08-22 ====
+  * Started power-on. Some disks are corrupt and require some work to be recovered.
+  * [22:42 GMT+1] Cluster is *mostly* operational, the downed nodes will be fixed in the next days
+==== 2025-08-21 ====
+  * Power line is still unreliable: deferring poweron till tomorrow morning
+==== 2025-05-09 ====
+  * slowness resolved (backup completed)
+==== 2025-05-08 ====
+  * generalized slowness: due to ongoing backup, access to /home is really slow. A concurrent check of the underlying RAID volume made it even worse. Check have been paused and backup is nearing completion, so the system should soon return to normality
+  * mtx12 is offline again due to RAM issues
+==== 2025-04-22 ====
+  * recreated missing reservations -- please check names with ''scontrol show res''
+  * mtx12 is (temporarily, we hope) down due to RAM issues
+==== 2025-04-18 ====
+  * Maintenance (nearly) completed. Two nodes are still down (gpu01 and gpu02), and some jobs migh have failed when scheduled on misbehaving nodes (bld17 and bld18, now fixed)
+  * **15:30 Update** all the nodes are currently working: don't break'em! Happy Easter!
+==== 2025-04-10 ====
+  * Created a reservation to avoid having running jobs during maintenance (**2025-04-16T10:00** to **2025-04-18T15:00**); we'll do our best to reduce downtime, so the cluster //might// come back online sooner than planned
+==== 2025-03-31 ====
+  * <del>We're experiencing issues with logins on bastion-nav from domain STUDENTI.</del> Fixed
+==== 2025-03-27 ====
+  * <del>/scratch and many nodes are currently down due to technical issues. We're working on it.</del> [9.00] Everything appears OK
+==== 2025-01-13 ====
+  * /archive is now **read-only** to avoid potential data loss during cluster move
 ===== 2024 =====
+==== 2024-12-18 ====
+  * Power sources are redundant again. There shouldn't be unexpected shutdowns till the "big one" on 2025-01-15
+==== 2024-12-17 ====
+  * Possibly unstable power source: rotative unit failed yesterday and is being worked on by CNAF. Some nodes powered off and have been restored. Uptime is currently not guaranteed in case of blackout.
 ==== 2024-11-04 ====