oph:cluster:messages
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
| Entrambe le parti precedenti la revisioneRevisione precedenteProssima revisione | Revisione precedente | ||
| oph:cluster:messages [2024/08/19 09:46] – [2024-08-19] added deprecation of /home/temp and news about /archive diego.zuccato@unibo.it | oph:cluster:messages [2025/10/15 06:35] (versione attuale) – [2025-10-15] /scratch is full diego.zuccato@unibo.it | ||
|---|---|---|---|
| Linea 3: | Linea 3: | ||
| Newer messages at the top. | Newer messages at the top. | ||
| - | === 2024-08-19 === | + | <WRAP center round info> |
| + | To report issues, please write **only** to difa.csi@unibo.it including a clear description of the problem ("my job doesn' | ||
| + | </ | ||
| + | |||
| + | |||
| + | ===== 2025 ===== | ||
| + | |||
| + | ==== 2025-10-15 ==== | ||
| + | * **/scratch is full**: please delete unneeded data, and archive the rest; if it does't get better in a couple of days, we'll have to run the enforcement script that deletes older data -- **IT'S NOT POSSIBLE TO RECOVERY DATA DELETED BY THE SCRIPT!** See (again) [[oph: | ||
| + | |||
| + | ==== 2025-08-25 ==== | ||
| + | * All nodes except bld17 (that was already down) should be fully operational, | ||
| + | |||
| + | ==== 2025-08-22 ==== | ||
| + | * Started power-on. Some disks are corrupt and require some work to be recovered. | ||
| + | * [22:42 GMT+1] Cluster is *mostly* operational, | ||
| + | ==== 2025-08-21 ==== | ||
| + | * Power line is still unreliable: deferring poweron till tomorrow morning | ||
| + | |||
| + | ==== 2025-05-09 ==== | ||
| + | * slowness resolved (backup completed) | ||
| + | |||
| + | ==== 2025-05-08 ==== | ||
| + | * generalized slowness: due to ongoing backup, access to /home is really slow. A concurrent check of the underlying RAID volume made it even worse. Check have been paused and backup is nearing completion, so the system should soon return to normality | ||
| + | * mtx12 is offline again due to RAM issues | ||
| + | |||
| + | ==== 2025-04-22 ==== | ||
| + | * recreated missing reservations -- please check names with '' | ||
| + | * mtx12 is (temporarily, | ||
| + | |||
| + | ==== 2025-04-18 ==== | ||
| + | * Maintenance (nearly) completed. Two nodes are still down (gpu01 and gpu02), and some jobs migh have failed when scheduled on misbehaving nodes (bld17 and bld18, now fixed) | ||
| + | * **15:30 Update** all the nodes are currently working: don't break' | ||
| + | ==== 2025-04-10 ==== | ||
| + | * Created a reservation to avoid having running jobs during maintenance (**2025-04-16T10: | ||
| + | |||
| + | ==== 2025-03-31 ==== | ||
| + | * < | ||
| + | |||
| + | ==== 2025-03-27 ==== | ||
| + | |||
| + | * < | ||
| + | |||
| + | ==== 2025-01-13 ==== | ||
| + | |||
| + | * /archive is now **read-only** to avoid potential data loss during cluster move | ||
| + | |||
| + | ===== 2024 ===== | ||
| + | |||
| + | ==== 2024-12-18 ==== | ||
| + | |||
| + | * Power sources are redundant again. There shouldn' | ||
| + | |||
| + | ==== 2024-12-17 ==== | ||
| + | |||
| + | * Possibly unstable power source: rotative unit failed yesterday and is being worked on by CNAF. Some nodes powered off and have been restored. Uptime is currently not guaranteed in case of blackout. | ||
| + | |||
| + | ==== 2024-11-04 ==== | ||
| + | |||
| + | * Started cleanup of /home/work : everything outside sectors folders is being moved inside the (hopefully) correct folder: might bring the sector folder over-quota -- just clean up (delete unneeded, move to /scratch or / | ||
| + | |||
| + | ==== 2024-10-30 ==== | ||
| + | |||
| + | * < | ||
| + | * mtx18 detected a problem with a DIMM module: powered off for HW checkup | ||
| + | * /home/temp is being decommissioned: | ||
| + | |||
| + | ==== 2024-10-22 ==== | ||
| + | |||
| + | * :!: <wrap hi> | ||
| + | |||
| + | ==== 2024-10-17 ==== | ||
| + | |||
| + | * Started transfer from /home/temp to /scratch of all the remaining files: / | ||
| + | * <wrap hi> | ||
| + | |||
| + | ==== 2024-10-03 ==== | ||
| + | |||
| + | * test e-mail | ||
| + | |||
| + | ==== 2024-09-16 ==== | ||
| + | |||
| + | * Completed data copy from / | ||
| + | * /archive is now available readwrite from frontends and <wrap hi> | ||
| + | * remember that quota on /archive is both about data size (20TB per sector) and inodes (max 10k files/dirs per sector)< | ||
| + | A "disk full" error means that one of the two limits have been reached. | ||
| + | </ | ||
| + | * the data size can be checked by using 'ls -lh' on the parent directory, number of files requires a '' | ||
| + | |||
| + | ==== 2024-09-10 ==== | ||
| + | |||
| + | * /scratch should now be stable (< | ||
| + | * / | ||
| + | |||
| + | ==== 2024-09-09 ==== | ||
| + | |||
| + | * /scratch is temporarily unavailable: | ||
| + | |||
| + | ==== 2024-08-19 ==== | ||
| * /home/temp reactivated: | * /home/temp reactivated: | ||
| * still testing /archive, it should become available soon (aiming to have it ready this week if no new issues arise) | * still testing /archive, it should become available soon (aiming to have it ready this week if no new issues arise) | ||
| - | === 2024-06-10 === | + | ==== 2024-06-10 ==== |
| * new nodes for ECOGAL, RED-CARDINAL and ELSA projects are now available (mtx[33-40]), | * new nodes for ECOGAL, RED-CARDINAL and ELSA projects are now available (mtx[33-40]), | ||
| * ophfe3 is now available for regular use | * ophfe3 is now available for regular use | ||
| - | === 2024-06-10 === | + | |
| + | ==== 2024-06-10 ==== | ||
| * mtx03 is now OK | * mtx03 is now OK | ||
| - | * all data from old /scratch should now be available under /scratc**/ | + | * all data from old /scratch should now be available under /scratch**/ |
| + | |||
| + | ==== 2024-06-03 ==== | ||
| - | === 2024-06-03 === | ||
| * switched /scratch from GlusterFS to BeeGFS (SSD-backed): | * switched /scratch from GlusterFS to BeeGFS (SSD-backed): | ||
| * can now be used for jobs as a (faster) replacement for /home/temp | * can now be used for jobs as a (faster) replacement for /home/temp | ||
| Linea 22: | Linea 125: | ||
| * mtx03 is having hardware issues and will be down for some time | * mtx03 is having hardware issues and will be down for some time | ||
| - | === 2024-05-28 === | + | ==== 2024-05-28 |
| * ophfe3 is currently reserved for file transfer, login is not allowed: you can use ophfe1 and ophfe2 | * ophfe3 is currently reserved for file transfer, login is not allowed: you can use ophfe1 and ophfe2 | ||
| * <WRAP round important> | * <WRAP round important> | ||
| Linea 31: | Linea 134: | ||
| * /archive (the current /scratch) will be **empty** for some time, then it will undergo its own migration to a different architecture and / | * /archive (the current /scratch) will be **empty** for some time, then it will undergo its own migration to a different architecture and / | ||
| - | === 2024-05-24 === | + | ==== 2024-05-24 ==== |
| * New filesystem layout is being **__tested__ on OPHFE3**. What was /scratch is now mounted on /archive and the new (fast) scratch area is mounted in /scratch | * New filesystem layout is being **__tested__ on OPHFE3**. What was /scratch is now mounted on /archive and the new (fast) scratch area is mounted in /scratch | ||
| <WRAP center round important 60%> | <WRAP center round important 60%> | ||
| Linea 37: | Linea 141: | ||
| </ | </ | ||
| + | ==== 2024-05-06 ==== | ||
| - | === 2024-05-06 === | ||
| * /home/temp temporarily unavailable: | * /home/temp temporarily unavailable: | ||
| - | === 2024-04-16 === | + | ==== 2024-04-16 ==== |
| * ophfe2 is now reinstalled on new HW; please test it and report issues (if any) | * ophfe2 is now reinstalled on new HW; please test it and report issues (if any) | ||
| - | === 2024-04-09 === | + | ==== 2024-04-09 ==== |
| * <WRAP important> | * <WRAP important> | ||
| Direct login to frontend nodes have been disabled, [[oph: | Direct login to frontend nodes have been disabled, [[oph: | ||
| </ | </ | ||
| * ophfe2 is going to be reinstalled: | * ophfe2 is going to be reinstalled: | ||
| - | === 2024-04-05 === | + | |
| + | ==== 2024-04-05 ==== | ||
| * Deployed new authorization config: please promptly report eventual slowdowns or other problems to difa.csi@unibo.it . | * Deployed new authorization config: please promptly report eventual slowdowns or other problems to difa.csi@unibo.it . | ||
| * Bastion (137.204.50.15) is already working and direct access to ophfe* nodes is being phased out. Usually you only need to add "-J name.surname@137.204.50.15" | * Bastion (137.204.50.15) is already working and direct access to ophfe* nodes is being phased out. Usually you only need to add "-J name.surname@137.204.50.15" | ||
| - | === 2024-03-12 === | + | ==== 2024-03-12 |
| * **New frontend available**: | * **New frontend available**: | ||
| * Frontend at 137.204.50.177 is now **deprecated** and will be removed soon, to be replaced by a newer one at 137.204.50.72 | * Frontend at 137.204.50.177 is now **deprecated** and will be removed soon, to be replaced by a newer one at 137.204.50.72 | ||
| - | === 2024-02-21 === | + | ==== 2024-02-21 |
| * 11.50 Outage resolved. | * 11.50 Outage resolved. | ||
| * 06:30 Cluster operation is currently stopped due to a slurmctld error (daemon is not listening to network connections). I'm working to try to resolve the outage ASAP. | * 06:30 Cluster operation is currently stopped due to a slurmctld error (daemon is not listening to network connections). I'm working to try to resolve the outage ASAP. | ||
| + | ===== 2023 ===== | ||
| - | === 2023-11-10 === | + | ==== 2023-11-10 |
| <wrap alert> | <wrap alert> | ||
| Linea 72: | Linea 181: | ||
| </ | </ | ||
| - | === 2023-10-20 === | + | ==== 2023-10-20 |
| Tentatively re-enabled read/write mode for /home/temp . **Archive and delete** old data before starting new writes! | Tentatively re-enabled read/write mode for /home/temp . **Archive and delete** old data before starting new writes! | ||
| - | === 2023-10-13 === | + | ==== 2023-10-13 |
| /home/temp filesystem is currently offline for technical issues. Trying to reactivate **readonly** access ASAP. | /home/temp filesystem is currently offline for technical issues. Trying to reactivate **readonly** access ASAP. | ||
| Linea 84: | Linea 193: | ||
| filesystem will be **wiped** soon. | filesystem will be **wiped** soon. | ||
| </ | </ | ||
| - | === 2023-08-10 === | + | |
| + | ==== 2023-08-10 | ||
| New login node available: ophfe3 (137.204.50.73) is now usable. | New login node available: ophfe3 (137.204.50.73) is now usable. | ||
| slurmtop is now in path, so no need to specify / | slurmtop is now in path, so no need to specify / | ||
| - | === 2023-08-01 === | + | ==== 2023-08-01 |
| :!: VSCode is bringing login node to a halt. Use it on your client and transfer the files. | :!: VSCode is bringing login node to a halt. Use it on your client and transfer the files. | ||
| - | === older === | + | ===== older (undated) ===== |
| /scratch is now available, but readonly and only from the login nodes. | /scratch is now available, but readonly and only from the login nodes. | ||
oph/cluster/messages.1724060769.txt.gz · Ultima modifica: da diego.zuccato@unibo.it
