Managing I/O on TACC Resources
Last update: January 08, 2021
The TACC Global Shared File System, Stockyard, is mounted on nearly all TACC HPC resources as the /work
($WORK
) directory. This file system is accessible to all TACC users, and therefore experiences a huge amount of I/O activity (reading and writing to disk) as users run their jobs. This document presents best practices for reducing and mitigating such activity to keep all systems running at maximum efficency for all TACC users.
What is I/O?
I/O stands for Input/Output and refers to the idea that for every input to a computer (keyboard input, mouse click, external disk access), there is an output (to the screen, in game play, write to disk). In the HPC environment, I/O refers almost exclusively to disk access: opening and closing files, reading from, writing to, and searching within files. Each of these I/O operations (iops), "open
", "write
", "close
", access each file system's MetaData Server (MDS). The metadata server coordinates access to the /work filesystem for all users. If a file system's MDS is overwhelmed by a user's I/O intensive work flow, then that file system could go down for an indeterminate period and all current jobs on that resource may fail.
Examples of intensive I/O activity that could affect the system include, but are not limited to:
- reading/writing 100+ GBs to checkpoint or output files,
- running with 4096+ MPI tasks all reading/writing individual files
- Python jobs using more than 2-3 python modules such as
pandas
,numpy
,matplotlib
,mpi4py
, etc.
As TACC's user base continues to expand, the stress on the resources' shared file systems increases daily. TACC staff now recommends new file system and job submission guidelines in order to maintain file system stability. If a user's jobs or activities are stressing the file system, then every other user's jobs and activities are impacted, and the system admins may resort to cancelling your jobs and suspending your access to the queues.
If you know your jobs will require significant I/O, please submit a support ticket and an HPC consultant will work with you.
Recommended File Systems Usage
Consider that $HOME
and $WORK
are for storage and keeping track of important items. The $WORK
filesystem is intended to be an area where you can build your code, store your input and output data, and any intermediate results. The $WORK
fileystem is not designed to handle jobs with large amounts of I/O or iops. Actual job activity, reading and writing to disk, should be offloaded to your resource's $SCRATCH
file system. You can start a job from anywhere but the actual work of the job should occur only on the $SCRATCH
partition. You can save original items to $HOME
or $WORK
so that you can copy them over to $SCRATCH
if you need to re-generate results.
Table 1 outlines TACC's new recommended guidelines for file system usage. Note that two TACC systems, Longhorn and Maverick2, differ in their file system configurations. Longhorn does not mount the Stockyard file system and therefore has no $WORK
access. Maverick2 has no /scratch
file system.
File System | Recommended Use | Notes |
---|---|---|
$HOME | cron jobs, scripts and templates, environment settings | each user's $HOME directory is backed up |
$WORK * | software installations, original datasets that can't be reproduced. | The Stockyard file system is NOT backed up. Ensure that your data is backed up to Ranch long-term storage. |
$SCRATCH ** | Reproducible datasets, I/O files: temporary files, checkpoint/restart files, job output files | All $SCRATCH file systems are subject to purge. |
* Longhorn is not connected to Stockyard. Consult the Longhorn User Guide's File Systems section for further guidance.
** Maverick2 does not have it's own $SCRATCH
file system. Consult the Maverick2 User Guide's File Systems section for further guidance.
Best Practices for Minimizing I/O
Here we present guidelines aimed at minimizing I/O impact on all TACC resources. Primarily this means redirecting I/O activity away from Stockyard, $WORK
, onto each resource's own local storage: usually the respective /tmp
or $SCRATCH
file systems.
Manipulate Data in Memory, not on Disk
Manipulate data in memory instead of files on disk when necessary. This means:
- For unimportant data that do not need a backup, process that data directly in memory.
- For any commands in the intermediate steps, process those commands directly in memory instead of creating extra script files for them.
Use the Compute Nodes' /tmp
Storage Space
Additionally, each compute node has a local /tmp
directory on it. You can use /tmp
to read/write files that do not need to be accessed by other tasks. If this output data is needed at the end of the job, the files may be copied from /tmp
to your $SCRATCH
directory at the end of your batch script. This will greatly reduce the load on the file system and may provide performance improvement.
Data stored in /tmp
directory is as temporary as its name indicates, lasting only for the duration of your job. Each MPI task will write output to the /tmp
directory on the node on which it is running. MPI tasks cannot access data from /tmp
on different nodes. Each system has a different amount of /tmp
space. Use the table below as guidance to the amount of space in /tmp
. Submit a support ticket for more help using this directory/storage.
Compute Resource | Storage per Compute Node |
---|---|
Frontera | 144 GB |
Stampede2 SKX | 144 GB |
Stampede2 KNL | 107 GB normal/large 32 GB development |
Lonestar5 | 27 GB |
Longhorn | 900 GB |
Maverick2 | 32 GB p100/v100 60 GB gtx |
Run Jobs Out of Each Resource's Scratch File System
Each TACC resource (except Maverick2) has its own Scratch file system, /scratch/username
accessible by the $SCRATCH
environment variable and the cds
alias.
Scratch file systems are not shared across TACC and are specific to one resource. Scratch file systems have neither file count or file size quotas, but are subject to periodic and unscheduled file purges should total disk usage exceed a safety threshold.
TACC staff recommends you run your jobs out of your resource's $SCRATCH
file system instead of the global $WORK
file system. To run your jobs out of $SCRATCH
, copy (stage) the entire executable/package along with all needed job input files and/or needed libraries to your resource's $SCRATCH
directory.
Computes nodes should not reference the $WORK
file system unless it's to stage data in or out, and only before or after jobs.
Your job script should also direct the job's output to the local scratch directory:
# stage executable and data cd $SCRATCH mkdir testrunA cp $WORK/myprogram testrunA cp $WORK/jobinputdata testrunA # launch program ibrun testrunA/myprogram testrunA/myinputdata > testrunA/output # copy results back permanent storage once job is done cp testrunA/output $WORK/savetestrunA
Avoid Writing One File Per Process
If your program regularly writes data to disk from each process, for instance for checkpointing, avoid writing output to a separate file for each process, as this will quickly overwhelm the metadata server. Instead, employ a library such as hdf5
or netcdf
to write a single parallel file for the checkpoint. A one-time generation of one file per process (for instance at the end of your run) is less serious, but even then you should consider writing parallel files.
Alternatively, you could write these per-process files to each compute nodes' /tmp
directory, see below.
Avoid Repeated Reading/Writing to the Same File
Jobs that have multiple tasks that read and/or write to the same file will often suspend the file in question in an open state in order to accommodate the changes happening to it. Please make sure that your I/O activity is not being directed to a single file repeatedly. You can use /tmp
on the node to store this file if the condition cannot be avoided. If you require shared file operations, then please ensure your I/O is optimized.
If you anticipate the need for multiple nodes or processes to write to a single file in parallel (aka single file with multiple writers/collective writers), please submit a support ticket for assistance.
Monitor Your File System Quotas
If you are close to file quota on either the $WORK
or $HOME
file system, your job may fail due to being unable to write output, and this will cause stress to the file systems when writing beyond quota.
Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.
To display a summary of your TACC project balances and disk quotas at any time, execute:login1$ /usr/local/etc/taccinfo # Generally more current than balances displayed on the portals.
You can monitor your file system's quotas and usage using the taccinfo
command. This output displays whenever you log on to a TACC resource.
---------------------- Project balances for user---------------------- | Name Avail SUs Expires | Name Avail SUs Expires | | Allocation 1 | Alloc 100 | ------------------------ Disk quotas for user ------------------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used | | /home1 1.5 25.0 6.02 741 400000 0.19 | | /work 107.5 1024.0 10.50 2434 3000000 0.08 | | /scratch1 0.0 0.0 0.00 3 0 0.00 | | /scratch3 41829.5 0.0 0.00 246295 0 0.00 | -------------------------------------------------------------------------------
Reduce I/O with OOOPS
TACC staff has developed OOOPS, an easy-to-use tool to help HPC users optimize heavy IO requests and reduce the impact of high I/O jobs. Optimal Overloaded I/O Protection System. OOOPS for jobs that have a particularly high I/O footprint we are now asking users to employ the OOOPS module to help govern their I/O activity.
Employing OOOPS may slow down your job significantly if your job has a lot of I/O.
The OOOPS module is currently installed on TACC's Frontera and Stampede2 resources.
OOOPS on Frontera and Stampede2
To deploy OOOPS, load the "ooops
" module in your job script or idev
session after your Slurm commands, but before invoking your executable.
Job Script Example | Interactive Session Example |
---|---|
#SBATCH -N 1 #SBATCH -J myjob.%j ... module load ooops ibrun myprogram | login1$ idev [options] ... c123-456$ module load ooops c123-456$ ibrun myprogram |
OOOPS Functions
The OOOPS module contains two functions. Use these commands to adjust the maximum allowed frequency of "open
" or "stat
" on all nodes of one running job.
set_io_param
for single node jobsset_io_param_batch
for multiple node jobs
For both functions, set the first argument to either 0
to indicate the $SCRATCH
file system, or 1
to indicate the $WORK
file system. This instructs the system to modulate your job's I/O activity in a way that reduces the impact on the designated file system.
Usage: [set_io_param_batch | set_io_param] jobid idx_fs t_open freq_open t_stat freq_stat jobid - Slurm job id idx_fs - File system index. Set this value to "0" or "1" where: 0 represents /scratch 1 represents /work t_open - estimated time to finish open(). The unit is microsecond. freq_open - allowed max frequency of open() (times per second) t_stat - estimated time to finish stat(). The unit is microsecond. freq_stat - allowed max frequency of stat() (times per second)
OOOPS Examples
To turn off the throttling on TACC's shared
$WORK
file system:set_io_param_batch 12345 1 1000000 1000000 1000000 1000000
Limit the number of "
open
" and "stat
" calls on TACC's$WORK
file system to 200 times/sec and 500 times/sec respectively.set_io_param_batch 12345 1 1000 200 1000 500
If OOOPS finds intensive I/O work in your job, it will print out warning messages and create an open
/stat
call report after the job finishes.
Contact the OOOPS developers, Lei Huang and Si Liu, with any questions.
Python I/O Management
For jobs that make use of large numbers of Python modules or jobs that use local installations of Python/Anaconda/MiniConda, TACC staff provides additional tools to help manage the I/O activity caused by library and module calls.
Stampede2: To deploy this tool, add the following line to your job submission file after your Slurm commands, but before your python executable.
export LD_PRELOAD=/home1/apps/tacc-patches/python_cacher/myopen.so
Frontera: Load the "
python_cacher
" module in your job script.module load python_cacher
This library will cache python modules to local disk so python programs won't keep pulling the modules over and over from the
/scratch
or/work
file systems.
Tracking Job I/O
Stampede2 and Frontera: Finally, if you wish to track the full extent of your I/O activity over the course of your job, you can employ another TACC tool that will report on open()
and stat()
calls during your job's run. Place the following lines in your job submission script after your Slurm commands and wrapping your executable:
export LD_PRELOAD=/home1/apps/tacc-patches/io_monitor/io_monitor.so:/home1/apps/tacc-patches/io_monitor/hook.so ibrun my_executable unset LD_PRELOAD
During the job run log files will be generated in the working directory with prefix "log_io_*.
"
Note: Since the iomonitor
tool may itself generate a lot of files, we highly recommend you profile your job beginning with trivial cases, then ramping up to the desired number of nodes/tasks.