Stampede2 /scratch Update Thursday 27 February 2020

Posted by Greg Umbay on Feb 27, 2020 1:50:02 PM

The Stampede2 /scratch filesystem recovery work is anticipated to take another 2-3 days before the filesystem is completely available again. In the meantime, TACC staff have deactivated the affected Lustre storage target on all the logins and compute nodes to avoid continued hangs when trying to access files residing on the offline storage target. 

 

This should impact less than 2% of the files in the /scratch filesystem, however, any attempt to read or write to a file on this target will result in an I/O error reporting “Cannot send after transport endpoint shutdown” and a listing of the problem files will show question marks for the permissions/ownership/size/date listing like this:

# ls -l

-????????? ? ?   ?       ?      ? job.1536_32N

 

Users should check their pending jobs to ensure that any files being read to or written to for that job are available in the filesystem without the above error message. Users will not encounter errors if creating new files on the filesystem. Also, if submitting new jobs, any input/output files or executables needed by that job (including shared libraries) should be checked to confirm they are available before submitting a new job. The Slurm partitions/queues will be reopened at 2:00PM US Central time to allow jobs to run.

 

TACC system administrators are continuing to work with the filesystem vendor to restore the offline storage target as soon as possible, User News will be updated with additional status updates.

 Please submit any questions you may have via the TACC User Portal.

https://portal.tacc.utexas.edu/tacc-consulting