Login messages

Newer messages at the top.

To report issues, please write only to difa.csi@unibo.it including a clear description of the problem (“my job doesn't work” is not a clear description but is what we usually receive…), including jobID, misbehaving node(s), steps to reproduce, etc.

2025

2025-05-09

slowness resolved (backup completed)

2025-05-08

generalized slowness: due to ongoing backup, access to /home is really slow. A concurrent check of the underlying RAID volume made it even worse. Check have been paused and backup is nearing completion, so the system should soon return to normality
mtx12 is offline again due to RAM issues

2025-04-22

recreated missing reservations – please check names with scontrol show res
mtx12 is (temporarily, we hope) down due to RAM issues

2025-04-18

Maintenance (nearly) completed. Two nodes are still down (gpu01 and gpu02), and some jobs migh have failed when scheduled on misbehaving nodes (bld17 and bld18, now fixed)
15:30 Update all the nodes are currently working: don't break'em! Happy Easter!

2025-04-10

Created a reservation to avoid having running jobs during maintenance (2025-04-16T10:00 to 2025-04-18T15:00); we'll do our best to reduce downtime, so the cluster might come back online sooner than planned

2025-03-31

~~We're experiencing issues with logins on bastion-nav from domain STUDENTI.~~ Fixed

2025-03-27

~~/scratch and many nodes are currently down due to technical issues. We're working on it.~~ [9.00] Everything appears OK

2025-01-13

/archive is now read-only to avoid potential data loss during cluster move

2024

2024-12-18

Power sources are redundant again. There shouldn't be unexpected shutdowns till the “big one” on 2025-01-15

2024-12-17

Possibly unstable power source: rotative unit failed yesterday and is being worked on by CNAF. Some nodes powered off and have been restored. Uptime is currently not guaranteed in case of blackout.

2024-11-04

Started cleanup of /home/work : everything outside sectors folders is being moved inside the (hopefully) correct folder: might bring the sector folder over-quota – just clean up (delete unneeded, move to /scratch or /archive)!

2024-10-30

~~Slurm is experiencing an unexpected misbehaviour and does not accept job sumbissions: we're already working on it~~ Reverted a problematic change, more tests required
mtx18 detected a problem with a DIMM module: powered off for HW checkup
/home/temp is being decommissioned: data have already been moved to /scratch

2024-10-22

Tomorrow nodes mtx[30-40] will be temporarily shut down for maintenance; jobs running on those nodes will be aborted without further notice

2024-10-17

Started transfer from /home/temp to /scratch of all the remaining files: /home/temp/sector/name.surname → /scratch/sector/name.surname/name.surname (note the doubling of name.surname to avoid clashes with existing files)
automatic deletion from /scratch is being worked on: all files older than 40 days will be deleted automatically very soon, archive what you ned to keep! There will be no way to recover deleted files!

2024-10-03

test e-mail

2024-09-16

Completed data copy from /scratch/archive to /archive
/archive is now available readwrite from frontends and readonly from nodes
remember that quota on /archive is both about data size (20TB per sector) and inodes (max 10k files/dirs per sector)

A “disk full” error means that one of the two limits have been reached.
the data size can be checked by using 'ls -lh' on the parent directory, number of files requires a find path/to/dir | wc -l ( slow )

2024-09-10

/scratch should now be stable (~~hopefully~~) UPDATE: NOPE, only partially accessible
/scratch/archive is now offlined to allow migration to new storage

2024-09-09

/scratch is temporarily unavailable: we are working to restore the access. UPDATE: the issue has been fixed now so /scratch is usable again.

2024-08-19

/home/temp reactivated: it crashes when overfilled! ⇒ stop using /home/temp and migrate your data to /scratch ASAP
still testing /archive, it should become available soon (aiming to have it ready this week if no new issues arise)

2024-06-10

new nodes for ECOGAL, RED-CARDINAL and ELSA projects are now available (mtx[33-40]), but they are currently without InfiniBand connection
ophfe3 is now available for regular use

2024-06-10

mtx03 is now OK
all data from old /scratch should now be available under /scratch/archive, including many files that were unavailable or even deleted: check for duplicates/unneeded files and clean up! Currently there's no quota, but quota will be enforced when moving data to /archive!

2024-06-03

switched /scratch from GlusterFS to BeeGFS (SSD-backed):
- can now be used for jobs as a (faster) replacement for /home/temp
- data from old /scratch is now under /scratch/archive (transfer is still proceeding)
- do not write under /scratch/archive
- auto-deletion in not active (yet)
mtx03 is having hardware issues and will be down for some time

2024-05-28

ophfe3 is currently reserved for file transfer, login is not allowed: you can use ophfe1 and ophfe2
Planned maintenance starting on 2024-06-03T08:00:00 : details in following items
data is currently being copied from /scratch to the new temporary storage that will be under /scratch/archive
planned maintenance should be completed in 2h from the start, so the cluster will be useable on 2024-06-03 at 10am with new mounts
this maintenance only affects /scratch area: /home is not touched (but the sooner you migrate your data from /home/temp/DDD to /scratch/DDD the sooner your jobs gets sped up)
if you don't find something from current /scratch/DDD in new /scratch/archive/DDD you can write to difa.csi@unibo.it to ask for an extra transfer (sync would happen anyway, if you can wait some days)
/archive (the current /scratch) will be empty for some time, then it will undergo its own migration to a different architecture and /scratch/archive will be moved there

2024-05-24

New filesystem layout is being tested on OPHFE3. What was /scratch is now mounted on /archive and the new (fast) scratch area is mounted in /scratch

Please do not use /scratch on ophfe3 or consider your data gone!

2024-05-06

/home/temp temporarily unavailable: overfilled (again) and crashed. ~~Now under maintenance.~~ Added another 4TB but it will crash again when overfilled

2024-04-16

ophfe2 is now reinstalled on new HW; please test it and report issues (if any)

2024-04-09

Direct login to frontend nodes have been disabled, use Bastion service; nothing changed for connections from internal network (wired or AlmaWifi)
ophfe2 is going to be reinstalled: please leave it free ASAP; once reinstalled it will change its IP address from 50.177 to 50.72 .

2024-04-05

Deployed new authorization config: please promptly report eventual slowdowns or other problems to difa.csi@unibo.it .
Bastion (137.204.50.15) is already working and direct access to ophfe* nodes is being phased out. Usually you only need to add “-J name.surname@137.204.50.15” to the ssh command you're using.

2024-03-12

New frontend available: a new frontend can be reached at 137.204.50.71
Frontend at 137.204.50.177 is now deprecated and will be removed soon, to be replaced by a newer one at 137.204.50.72

2024-02-21

11.50 Outage resolved.
06:30 Cluster operation is currently stopped due to a slurmctld error (daemon is not listening to network connections). I'm working to try to resolve the outage ASAP.

2023

2023-11-10

Old (pre-august) backups of data in /scratch will be deleted on 2023-11-27 . If you have very important data, verify before the deadline.

2023-10-20

Tentatively re-enabled read/write mode for /home/temp . Archive and delete old data before starting new writes!

2023-10-13

/home/temp filesystem is currently offline for technical issues. Trying to reactivate readonly access ASAP.

11:05 UPDATE: /home temp is now available in readonly mode. Please archive the data you need to keep, the filesystem will be wiped soon.

2023-08-10

New login node available: ophfe3 (137.204.50.73) is now usable. slurmtop is now in path, so no need to specify /home/software/utils/ .

2023-08-01

VSCode is bringing login node to a halt. Use it on your client and transfer the files.

older (undated)

/scratch is now available, but readonly and only from the login nodes.

Verify the archived data. Usually you'll find a file.tar for every directory. Associated to that file you can also find

file-extra.tar with files that were deleted only from one replica (they prevented removing “empty” folders“). Probably you can safely delete these.
file-extra2.tar contains files that are (most likely, but not always) dupes of the ones in file.tar

Backups will be deleted before 2023-12-20 (precise date TBD), so be sure to verify your data ASAP: once the backups are deleted there will be no way to recover.

Indice

Login messages

2025

2025-05-09

2025-05-08

2025-04-22

2025-04-18

2025-04-10

2025-03-31

2025-03-27

2025-01-13

2024

2024-12-18

2024-12-17

2024-11-04

2024-10-30

2024-10-22

2024-10-17

2024-10-03

2024-09-16

2024-09-10

2024-09-09

2024-08-19

2024-06-10

2024-06-10

2024-06-03

2024-05-28

2024-05-24

2024-05-06

2024-04-16

2024-04-09

2024-04-05

2024-03-12

2024-02-21

2023

2023-11-10

2023-10-20

2023-10-13

2023-08-10

2023-08-01

older (undated)