Indice
Login messages
Newer messages at the top.
To report issues, please write only to difa.csi@unibo.it including a clear description of the problem (“my job doesn't work” is not a clear description but is what we usually receive…), including jobID, misbehaving node(s), steps to reproduce, etc.
2026
2026-03-23
Planned network interruption around 06.00. Network should remain unreachable for about 5 minutes, but the AD integration (the one that lets you authenticate against bastion-nav and ophfe*) could have issues restarting afterwards and will be checked at about 07.45.
Active connections and file transfers will be dropped. Running jobs should continue working unless they're using remote resources.
2026-03-10
Possible problems due to electrical maintenance. Cluster might poweroff without further warning, even if we've been reassured that won't happen (yeah… sure… hope it's not just like the last time…).
No unplanned shutdown this time
2026-01-12
- /archive returned writable from frontends and bld18
2026-01-07
- Recovered (partially) from the emergency shutdown on XMas. /archive is currently readonly and it's possible not everything is accessible – we're investigating the issue
2025
2025-12-25
- Emergency shutdown: data center temperature too high (>50°C) could cause long-lasting issues
2025-12-19
- Maintenance shutdown. The cluster will be unavailable till 2025-12-23 (if all goes well).
2025-12-02
- Users from STUDENTI aren't able to reach the cluster. We're trying to determine the cause. It will be fixed ASAP.
2025-10-15
- /scratch is full: please delete unneeded data, and archive the rest; if it does't get better in a couple of days, we'll have to run the enforcement script that deletes older data – IT'S NOT POSSIBLE TO RECOVERY DATA DELETED BY THE SCRIPT! See (again) the storage page for more info.
2025-08-25
- All nodes except bld17 (that was already down) should be fully operational, but it's not possible for us to check for dataset coherence: please check your datasets/results, expecially the ones you were working on at the time of the blackout; double-check (or discard and recreate) the ones you were writing to!
2025-08-22
- Started power-on. Some disks are corrupt and require some work to be recovered.
- [22:42 GMT+1] Cluster is *mostly* operational, the downed nodes will be fixed in the next days
2025-08-21
- Power line is still unreliable: deferring poweron till tomorrow morning
2025-05-09
- slowness resolved (backup completed)
2025-05-08
- generalized slowness: due to ongoing backup, access to /home is really slow. A concurrent check of the underlying RAID volume made it even worse. Check have been paused and backup is nearing completion, so the system should soon return to normality
- mtx12 is offline again due to RAM issues
2025-04-22
- recreated missing reservations – please check names with
scontrol show res - mtx12 is (temporarily, we hope) down due to RAM issues
2025-04-18
- Maintenance (nearly) completed. Two nodes are still down (gpu01 and gpu02), and some jobs migh have failed when scheduled on misbehaving nodes (bld17 and bld18, now fixed)
- 15:30 Update all the nodes are currently working: don't break'em! Happy Easter!
2025-04-10
- Created a reservation to avoid having running jobs during maintenance (2025-04-16T10:00 to 2025-04-18T15:00); we'll do our best to reduce downtime, so the cluster might come back online sooner than planned
2025-03-31
We're experiencing issues with logins on bastion-nav from domain STUDENTI.Fixed
2025-03-27
/scratch and many nodes are currently down due to technical issues. We're working on it.[9.00] Everything appears OK
2025-01-13
- /archive is now read-only to avoid potential data loss during cluster move
2024
2024-12-18
- Power sources are redundant again. There shouldn't be unexpected shutdowns till the “big one” on 2025-01-15
2024-12-17
- Possibly unstable power source: rotative unit failed yesterday and is being worked on by CNAF. Some nodes powered off and have been restored. Uptime is currently not guaranteed in case of blackout.
2024-11-04
- Started cleanup of /home/work : everything outside sectors folders is being moved inside the (hopefully) correct folder: might bring the sector folder over-quota – just clean up (delete unneeded, move to /scratch or /archive)!
2024-10-30
Slurm is experiencing an unexpected misbehaviour and does not accept job sumbissions: we're already working on itReverted a problematic change, more tests required- mtx18 detected a problem with a DIMM module: powered off for HW checkup
- /home/temp is being decommissioned: data have already been moved to /scratch
2024-10-22
Tomorrow nodes mtx[30-40] will be temporarily shut down for maintenance; jobs running on those nodes will be aborted without further notice
2024-10-17
- Started transfer from /home/temp to /scratch of all the remaining files: /home/temp/sector/name.surname → /scratch/sector/name.surname/name.surname (note the doubling of name.surname to avoid clashes with existing files)
- automatic deletion from /scratch is being worked on: all files older than 40 days will be deleted automatically very soon, archive what you ned to keep! There will be no way to recover deleted files!
2024-10-03
- test e-mail
2024-09-16
- Completed data copy from /scratch/archive to /archive
- /archive is now available readwrite from frontends and readonly from nodes
- remember that quota on /archive is both about data size (20TB per sector) and inodes (max 10k files/dirs per sector)
A “disk full” error means that one of the two limits have been reached.
- the data size can be checked by using 'ls -lh' on the parent directory, number of files requires a
find path/to/dir | wc -l(slow
)
2024-09-10
- /scratch should now be stable (
hopefully) UPDATE: NOPE, only partially accessible - /scratch/archive is now offlined to allow migration to new storage
2024-09-09
- /scratch is temporarily unavailable: we are working to restore the access. UPDATE: the issue has been fixed now so /scratch is usable again.
2024-08-19
- /home/temp reactivated: it crashes when overfilled! ⇒ stop using /home/temp and migrate your data to /scratch ASAP
- still testing /archive, it should become available soon (aiming to have it ready this week if no new issues arise)
2024-06-10
- new nodes for ECOGAL, RED-CARDINAL and ELSA projects are now available (mtx[33-40]), but they are currently without InfiniBand connection
- ophfe3 is now available for regular use
2024-06-10
- mtx03 is now OK
- all data from old /scratch should now be available under /scratch/archive, including many files that were unavailable or even deleted: check for duplicates/unneeded files and clean up! Currently there's no quota, but quota will be enforced when moving data to /archive!
2024-06-03
- switched /scratch from GlusterFS to BeeGFS (SSD-backed):
- can now be used for jobs as a (faster) replacement for /home/temp
- data from old /scratch is now under /scratch/archive (transfer is still proceeding)
- do not write under /scratch/archive
- auto-deletion in not active (yet)
- mtx03 is having hardware issues and will be down for some time
2024-05-28
- ophfe3 is currently reserved for file transfer, login is not allowed: you can use ophfe1 and ophfe2
Planned maintenance starting on 2024-06-03T08:00:00 : details in following items
- data is currently being copied from /scratch to the new temporary storage that will be under /scratch/archive
- planned maintenance should be completed in 2h from the start, so the cluster will be useable on 2024-06-03 at 10am with new mounts
- this maintenance only affects /scratch area: /home is not touched (but the sooner you migrate your data from /home/temp/DDD to /scratch/DDD the sooner your jobs gets sped up)
- if you don't find something from current /scratch/DDD in new /scratch/archive/DDD you can write to difa.csi@unibo.it to ask for an extra transfer (sync would happen anyway, if you can wait some days)
- /archive (the current /scratch) will be empty for some time, then it will undergo its own migration to a different architecture and /scratch/archive will be moved there
2024-05-24
- New filesystem layout is being tested on OPHFE3. What was /scratch is now mounted on /archive and the new (fast) scratch area is mounted in /scratch
Please do not use /scratch on ophfe3 or consider your data gone!
2024-05-06
- /home/temp temporarily unavailable: overfilled (again) and crashed.
Now under maintenance.Added another 4TB but it will crash again when overfilled
2024-04-16
- ophfe2 is now reinstalled on new HW; please test it and report issues (if any)
2024-04-09
Direct login to frontend nodes have been disabled, use Bastion service; nothing changed for connections from internal network (wired or AlmaWifi)
- ophfe2 is going to be reinstalled: please leave it free ASAP; once reinstalled it will change its IP address from 50.177 to 50.72 .
2024-04-05
- Deployed new authorization config: please promptly report eventual slowdowns or other problems to difa.csi@unibo.it .
- Bastion (137.204.50.15) is already working and direct access to ophfe* nodes is being phased out. Usually you only need to add “-J name.surname@137.204.50.15” to the ssh command you're using.
2024-03-12
- New frontend available: a new frontend can be reached at 137.204.50.71
- Frontend at 137.204.50.177 is now deprecated and will be removed soon, to be replaced by a newer one at 137.204.50.72
2024-02-21
- 11.50 Outage resolved.
- 06:30 Cluster operation is currently stopped due to a slurmctld error (daemon is not listening to network connections). I'm working to try to resolve the outage ASAP.
2023
2023-11-10
Old (pre-august) backups of data in /scratch will be deleted on 2023-11-27 . If you have very important data, verify before the deadline.
2023-10-20
Tentatively re-enabled read/write mode for /home/temp . Archive and delete old data before starting new writes!
2023-10-13
/home/temp filesystem is currently offline for technical issues. Trying to reactivate readonly access ASAP.
11:05 UPDATE: /home temp is now available in readonly mode. Please archive the data you need to keep, the filesystem will be wiped soon.
2023-08-10
New login node available: ophfe3 (137.204.50.73) is now usable. slurmtop is now in path, so no need to specify /home/software/utils/ .
2023-08-01
VSCode is bringing login node to a halt. Use it on your client and transfer the files.
older (undated)
/scratch is now available, but readonly and only from the login nodes.
Verify the archived data. Usually you'll find a file.tar for every directory. Associated to that file you can also find
- file-extra.tar with files that were deleted only from one replica (they prevented removing “empty” folders“). Probably you can safely delete these.
- file-extra2.tar contains files that are (most likely, but not always) dupes of the ones in file.tar
Backups will be deleted before 2023-12-20 (precise date TBD), so be sure to verify your data ASAP: once the backups are deleted there will be no way to recover.
