Stampede2 /scratch Update Monday 2 March 2020

Posted by Greg Umbay on Mar 2, 2020 12:09:44 PM

Stampede2 /scratch Update Monday 2 March 2020The Stampede2 admins are working diligently to recover the /scratch filesystem. This recovery is expected to take at least one more day. We will provide an update after the filesystem has been recovered. Below is a repeat of the notice that went out on Thursday afternoon.TACC staff have deactivated the affected Lustre storage target on all the logins and compute nodes to avoid continued hangs when trying to access files residing on the offline storage target.This should impact less than 2% of the files in the /scratch filesystem, however, any attempt to read or write to a file on this target will result in an I/O error reporting "Cannot send after transport endpoint shutdown" and a listing of the problem files will show question marks for the permissions/ownership/size/date listing like this:# ls -l-????????? ? ?  ?      ?     ? job.1536_32NUsers should check their pending jobs to ensure that any files being read to or written to for that job are available in the filesystem without the above error message. Users will not encounter errors if creating new files on the filesystem. Also, if submitting new jobs, any input/output files or executables needed by that job (including shared
libraries) should be checked to confirm they are available before submitting a new job. The Slurm partitions/queues will be reopened at 2:00PM US Central time to allow jobs to run.TACC system administrators are continuing to work with the filesystem vendor to restore the offline storage target as soon as possible, User News will be updated with additional status updates.Please submit any questions you may have via the TACC User Portal.