Managing I/O on TACC Resources
Last update: January 14, 2020

The TACC Global Shared File System, Stockyard, is mounted on (nearly) all TACC HPC resources as the /work ($WORK) directory. As TACC's user base continues to expand, the stress on the Global Shared File System ($WORK) increases daily. This file system is accessible to all TACC users, and therefore experiences a lot of I/O activity (reading and writing to disk) as users run their jobs, reading and generating data including intermediate and checkpoint files.

The greatest stressor of the file system is heavy input/output (I/O): meaning that your program/executable accesses (reads or writes) to disk an excessive amount. Examples of intensive output that could affect the system include, but is not limited to:

  • reading/writing 100+ GBs to checkpoint or output files,
  • running with 4096+ MPI tasks all reading/writing individual files
  • Python jobs using more than 2-3 python modules such as pandas, numpy, matplotlib, mpi4py, etc.

The stress on the $WORK file system has increased to the extent that TACC staff is now recommending new file system and job submission guidelines in order to maintain file system stability. If a user's jobs or activities are stressing the $WORK fileystem, then every other user's jobs and activities are impacted, and the system admins may resort to cancelling your jobs and suspending your access to the queues.

If you know your jobs will require significant I/O, please submit a support ticket and an HPC consultant will work with you.

Recommended File Systems Usage

Consider that $HOME and $WORK are for storage and keeping track of important items. Actual job activity, reading and writing to disk, should be offloaded to your resource's $SCRATCH file system. You can start a job from anywhere but the actual work of the job should occur only on the $SCRATCH partition. You can save original items to $HOME or $WORK so that you can copy them over to $SCRATCH if you need to re-generate results.

This table outlines TACC staff's new recommended guidelines for file system usage:

File System Recommended Use Notes
$HOME cron jobs, scripts and templates, environment settings each user's $HOME directory is backed up
$WORK software installations, original datasets that can't be reproduced. The Stockyard file system is NOT backed up. Ensure that your data is backed up to [Ranch](/user-guides/ranch) long-term storage.
$SCRATCH Reproducible datasets, I/O files: temporary files, checkpoint/restart files, job output files All $SCRATCH file systems are subject to purge.

Best Practices for Minimizing I/O

Here we present some Best Practices aimed at minimizing I/O impact on all TACC resources.

Redirect I/O Away From $WORK

The purpose of these guidelines is to move I/O activity away from the shared file system, $WORK, onto each resource's own local storage, usually the $SCRATCH file system.

Run Jobs out of the Scratch File System

Each TACC resource has its own Scratch file system, /scratch/*username* accessible by the $SCRATCH environment variable and the cds alias.

Scratch file systems are not shared across TACC and are specific to one resource. Scratch file systems have neither file count or file size quotas, but are subject to periodic file purges should total disk usage exceed a safety threshold.

TACC staff recommends you run your jobs out of your resource's $SCRATCH file system instead of the global $WORK file system. To run your jobs out of $SCRATCH, copy (stage) the entire executable/package along with all needed job input files and/or needed libraries to your resource's $SCRATCH directory.

Computes nodes should not reference the $WORK file system unless it's to stage data in or out, and only before or after jobs.

Your job script should also direct the job's output to the local scratch directory:

# stage executable and data
mkdir testrunA
cp $WORK/myprogram testrunA
cp $WORK/jobinputdata testrunA

# launch program
ibrun testrunA/myprogram testrunA/myinputdata > testrunA/output

# copy results back permanent storage once job is done
cp testrunA/output $WORK/savetestrunA

Avoid Writing One File Per Process.

If your program regularly writes data to disk from each process, for instance for checkpointing, avoid writing output to a separate file for each process, as this will quickly overwhelm the metadata server. Instead, employ a library such as hdf5 or netcdf to write a single parallel file for the checkpoint. A one-time generation of one file per process (for instance at the end of your run) is less serious, but even then you should consider writing parallel files.

Alternatively, you could write these per-process files to each compute nodes' /tmp Directory, see below.

Use the Compute Node's /tmp Storage Space

Additionally, each compute node has a /tmp directory on it. You can run I/O operations from that /tmp directory and then at the end of the job you can write out the final results from /tmp to /scratch. This will greatly reduce the load on the file system and you may even see some performance improvement as the job will not have to access the file system repeatedly during the run, eliminating network latency, or being impacted from other user jobs.

Data stored in /tmp directory is as temporary as its name indicates, lasting only for the duration of your job. Only the first node in the job's hostlist will write output to its /tmp directory. Check the node's quota to ensure storage is adequate. To utilize the /tmp directory you could put this in your job script:

# launch program and put output to /tmp
ibrun myprogram > /tmp/output.txt
# move data to file system at end of job
cp /tmp/output.txt $SCRATCH/myOutputs

Avoid Repeated Reading/Writing to the Same File

Jobs that have multiple tasks that read and/or write to the same file will often suspend the file in question in an open state in order to accommodate the changes happening to it. Please make sure that your I/O activity is not being directed to a single file repeatedly. You can use /tmp on the node to store this file if the condition cannot be avoided. If you require shared file operations, then please ensure your I/O is optimized.

Monitor Your File System Quotas

If you are close to file quota on either the $WORK or $HOME file system, your job may fail due to being unable to write output, and this will cause stress to the file systems when writing beyond quota.

Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

To display a summary of your TACC project balances and disk quotas at any time, execute:

login1$ /usr/local/etc/taccinfo # Generally more current than balances displayed on the portals.

Monitor your file system's usage using the taccinfo command. This same screen appears upon login to a TACC resource.

---------------------- Project balances for user  ----------------------
| Name           Avail SUs     Expires | Name           Avail SUs     Expires |
| Allocation            -1             | Alloc             -10037             |
------------------------ Disk quotas for user   -------------------------
| Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
| /home1              1.5      25.0     6.02          741      400000    0.19 |
| /work             107.5    1024.0    10.50         2434     3000000    0.08 |
| /scratch1           0.0       0.0     0.00            3           0    0.00 |
| /scratch3       41829.5       0.0     0.00       246295           0    0.00 |

Reduce I/O with OOOPS

TACC staff has developed a tool to reduce the footprint of high I/O jobs called OOPS (Optimal Overloaded I/O Protection System) For jobs that have a particularly high I/O footprint we now ask that users employ the OOOPS module to help govern their I/O activity. To deploy the OOOPS module add the following lines to your job script after your Slurm commands, but before your executable.

module use /work/01255/siliu/stampede2/ooops/modulefiles/
module load ooops
export IO_LIMIT_CONFIG=/work/01255/siliu/stampede2/ooops/1.0/conf/config_low
set_io_param 0 low

The 1st argument to set_io_param can be set to either 0 to indicate the $SCRATCH file system, or 1 to indicate the $WORK file system. These instructions will allow the system to modulate your job's I/O activity in a way that reduces the impact on the shared file system.

Python I/O Management

For jobs that make use of large numbers of Python modules or jobs that use local installations of Python/Anaconda/MiniConda, we have an additional tool to help manage the I/O activity caused by library and module calls. To deploy this tool add the following line to your job submission file after your Slurm commands, but before your python executable.

export LD_PRELOAD=/work/00410/huang/share/patch/

Tracking Job I/O

Finally, if you wish to track the full extent of your I/O activity over the course of your job, you can employ another TACC tool that will report on open() and stat() calls during the run. Place these lines in your job submission script after your Slurm commands and wrapping your executable.

export LD_PRELOAD=/work/00410/huang/share/patch/
ibrun my_executable