Stampede User Guide
Last update: February 13, 2017 see revision history


  • 09/27/16 This user guide has been updated substantially to reflect the new Knights Landing (KNL) Upgrade. Most of the older Knights Corner (KNC) coprocessor content has been moved to the new Stampede Archive: Knights Corner Technical Material document.
  • 09/01/16 Multi-Factor Authentication (MFA) is now mandated in order to access all TACC resources. Please see the Multi-Factor Authentication at TACC tutorial for assistance in setting up your account.


TACC's Stampede system, generously funded by the National Science Foundation (NSF) through award ACI-1134872, entered production in January 2012 as a 6,400+ node cluster of Dell PowerEdge server nodes featuring Intel Xeon E5 Sandy Bridge host processors and the Intel Knights Corner (KNC) coprocessor, the first generation of processors based on Intel's Many Integrated Core (MIC) architecture. Stampede's 2016 Intel Knights Landing (KNL) Upgrade prepares the way for Stampede 2 by adding 508 Intel Xeon Phi 7250 second-generation KNL MIC compute nodes. The KNL represents a radical break with the first generation KNC MIC coprocessor. Unlike the legacy KNC, a Stampede KNL is not a coprocessor: each KNL is a stand-alone, self-booting processor that is the sole processor in its node.

While the KNL and Sandy Bridge nodes share the same three Lustre file systems, the Stampede KNL Upgrade is largely its own independent cluster. In fact it can be helpful to think of Stampede as two related but largely independent sub-systems (Figure 1.): the Sandy Bridge cluster and the KNL cluster. Note that the KNL upgrade adds new nodes (and new capabilities) to the system, but leaves the original Stampede hardware intact. The Sandy Bridge cluster, which is the original Stampede system consisting of Sandy Bridge compute nodes with their KNC coprocessors, remains available for production use. When you initiate a login session you do so on either the Sandy Bridge or KNL cluster. When you submit a job on the Sandy Bridge side, for example, it will run only on the Sandy Bridge compute nodes.

The early sections of this User Guide address information important to all Stampede users as well as material specific to the Sandy Bridge cluster. A single self-contained section below focuses on the KNL cluster. For simplicity we have migrated most of the older KNC material to its own stand-alone legacy document, Stampede Archive: Knights Corner (KNC) Technical Material.

Figure 1. Stampede Sandy Bridge and KNL clusters

Sandy Bridge: System Overview

Stampede began production as a 10 petaflop (PF) Dell Linux cluster based on 6400+ Dell PowerEdge server nodes, most of which contain two Intel Xeon E5 Sandy Bridge processors and a first generation KNC coprocessor. These components are still in production. The aggregate peak performance of the Xeon Sandy Bridge E5 processors is 2+PF, while the KNC coprocessors deliver an additional aggregate peak performance of 7+PF. The system also includes login nodes, large-memory nodes, graphics nodes (for both remote visualization and computation), and dual-coprocessor nodes. Additional nodes (not directly accessible to users) provide management and file system services. The KNL Upgrade, which adds 1.5 PF to Stampede's capabilities, consists of 508 Intel Xeon Phi 7250 KNL compute nodes.

One of the important design considerations for Stampede was to create a multi-use cyberinfrastructure resource, offering large memory, large data transfer, and GPU capabilities for data-intensive, accelerated or visualization computing. By augmenting some of the compute-intensive nodes within the system with very large memory and GPUs there is no need to move data for data-intensive computing, remote visualization and GPGPU computing. For those situations requiring large-data transfers from other sites, 4 high-speed data servers have been integrated into the Lustre file systems.

Stampede's Sandy Bridge cluster includes the following components:

Compute Nodes: The majority of the 6400 Sandy Bridge nodes include two Xeon E5-2680 8-core Sandy Bridge processors and one first-generation Intel Xeon Phi SE10P KNC MIC coprocessor on a PCIe card connected to its Sandy Bridge host. These compute nodes are configured with 32GB of "host" memory with an additional 8GB of memory on the Xeon Phi coprocessor card. A smaller number of compute nodes are configured with two Xeon Phi Coprocessors - the specific number of nodes available in each configuration at any time will be available from the batch queue summary.

Large Memory Nodes: There are an additional 16 large-memory Sandy Bridge nodes with 32 cores (node and 1TB of memory for data-intense applications requiring disk caching to memory and large-memory methods.

Visualization Nodes: For visualization and GPGPU processing 128 compute nodes are augmented with a single NVIDIA K20 GPU on each node with 5GB of on-board GDDR5 memory.

Interconnect: The network supporting the Sandy Bridge cluster reflects Mellanox FDR InfiniBand technology in a 2-level (cores and leafs) fat-tree topology. See below for information on the KNL cluster.

File Systems: The Stampede system supports a 14PB global, parallel file storage managed as three Lustre file systems. All three shared Lustre file systems are available from both the Sandy Bridge and KNL clusters. Each Sandy Bridge node contains a local 250GB disk. Additionally, the TACC Ranch tape archival system (60 PB capacity) is accessible from Stampede.

Figure 2. Stampede System

Sandy Bridge: Configuration

All nodes on the Sandy Bridge cluster run CentOS 6.3 and are managed with batch services through Slurm 2.4. Global $HOME, $WORK and $SCRATCH storage areas are supported by three Lustre parallel distributed file systems with 76 IO servers. Inter-node communication (MPI/Lustre) is through an FDR Mellanox InfiniBand network.

The 6400 Dell Zeus C8220z compute nodes are housed in 160 racks (40 nodes/rack), along with two 36-port Mellanox leaf switches. Each node has two Intel E5 8-core (Sandy Bridge) processors and an Intel Xeon Phi 61-core (Knights Corner) coprocessor, connected by an x16 PCIe bus. The host and coprocessor are configured with DDR3 32GB and DDR5 8GB memory, respectively. The 16 large-memory nodes are in a single rack. Each node is a Dell PowerEdge R820 server with 4 E5-4650 8-core processors and 1TB of DDR3 memory.

The interconnect is an FDR InfiniBand network of Mellanox switches, consisting of a fat tree topology of eight core-switches and over 320 leaf switches with a 5/4 oversubscription. The network configuration for the compute nodes is shown in Figure 4.

The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1 through 4.

Table 1. System Configuration & Performance

Component Technology Performance/Size
Nodes(sled) 2 8-core Xeon E5 processors, 1 61-core Xeon Phi coprocessor 6400 Nodes
Memory Distributed, 32GB/node 205TB (Aggregate)
Shared Disk Lustre 2.1.3, parallel File System 14 PB
Local Disk SATA (250GB) 1.6PB (Aggregate)
Interconnect InfiniBand Mellanox Switches/HCAs FDR 56 Gb/s

Compute nodes

A compute node consists of a Dell C8220z double-wide sled in a 4 rack-unit chassis with 3 other sleds. Each node runs CentOS 6.3 with the 2.6.32 x86_64 Linux kernel. Each node contains two Xeon Intel 8-Core 64-bit E5-processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.7GHz and supports 8 floating-point operations per clock period with a peak performance of 21.6 GFLOPS/core or 346 GFLOPS/node. Each node contains 32GB of memory (2GB/core). The memory subsystem has 4 channels from each processor's memory controller to 4 DDR3 ECC DIMMS, each rated at 1600 MT/s (51.2GB/s for all four channels in a socket). The processor interconnect, QPI, runs at 8.0 GT/s between sockets. The Intel Xeon Phi is a special production model with 61 1.1 GHz cores with a peak performance of 16.2 DP GFLOPS/core or 1.0 DP TFLOPS/Phi. Each coprocessor contains 8GB of GDDR5 memory, with 8 dual-channel controllers, with a peak memory performance of 320GB/s.

Table 2. Dell DCS (Dell Custom Solution) C8220z Compute Node

Component Technology
Sockets per Node/Cores per Socket
2/8 Xeon E5-2680 2.7GHz (turbo, 3.5)
1/61 Xeon Phi SE10P 1.1GHz
Motherboard Dell C8220, Intel PQI, C610 Chipset
Memory Per Host
Memory per Coprocessor
32GB 8x4G 4 channels DDR3-1600MHz
QPI 8.0 GT/s
PCI Express Processor
PCI Express Coprocessor
x40 lanes, Gen 3
x16 lanes, Gen 2 (extended)
250GB Disk 7.5K RPM SATA
Figure 3a. Stampede Zeus Node: 2 Xeon E5 processors and 1 Xeon Phi coprocessor
Figure 3b. Intel Xeon Phi Coprocessor

Table 3. PowerEdge R720 Login Nodes

Component Technology
4 login nodes
Processors Sockets per Node/Cores per Socket E5-2680, 2.7GHz
Motherboard Dell R720, Intel QPI C600 Chipset
Memory Per Node 32GB 8x4GB 4 channels/CPU DDR3-1600 (MT/s)
Cache: 256KB/core L2; 20MB/CPU L3
Global: Lustre, xxGB quota
Local: Shared, 432GB SATA 10K rpm

Intel E5 Sandy Bridge Processor

The E5 architecture includes the following features important to HPC:

  • 4 DP (double precision) vector width and AVX Instruction set
  • 4-Channel On-chip (integrated) memory controllers
  • Support for 1600MT/s DDR3 memory
  • Dual Intel QuickPath links between Xeon dual-processor systems support 8.0GT/s
  • Turbo Boost version 2.0, up to peak 3.5GHz in turbo mode.
  • In these Romley platforms PCIe lanes are controlled by CPUs (do not pass through the chip set)
  • Gen 3.0 PCI Express
  • Improved Hyper-threading (turned off but good for some HPC Packages)
  • 64 KB L1 Cache/core (32KB L1 Data and 32KB L1 Instruction)


The 56GB/s FDR InfiniBand interconnect consists of Mellanox switches, fiber cables and HCAs (Host Channel Adapters). Eight 648-port SX6536 core switches and over 320 36-port SX6025 endpoint switches (2 in each compute-node rack) form a 2-level Clos fat tree topology, illustrated in Figure 4. Core and endpoint switches have 4.0 and 73 Tb/s capacities, respectively. There is a 5/4 oversubscription at the endpoint (leaf) switches (20 node input ports: 16 core-switch output ports). Any MPI message is only 5 hops or less from source to destination.

Figure 4. Stampede Interconnect

File Systems Overview

User-owned storage on the Stampede system is available in three directories, identified by the $HOME, $WORK and $SCRATCH environment variables. These directories are part of separate Lustre shared file systems, and accessible from any node in the system.

Table 4. Storage Systems

Storage Class Size Architecture Features
Local (each node) Login: 1TB
Compute: 250GB
Big Mem: 600GB
Login: 432GB partition mounted on /tmp
80GB partition mounted on /tmp
398GB partition mounted on /tmp
Parallel 14PB Lustre 72 Dell R610 data servers (OSS) through IB
user striping allowed
MPI-IO, XPB, YPB, and ZPB partitions on $HOME/$WORK/$SCRATCH
4 Dell R710 meta data servers with 2 Dell MD 3220 Storage Arrays
Ranch (Tape Storage) 60PB SAM-FS (Storage Archive Manager) 10GB/s connection through 4 GridFTP Servers

Sandy Bridge: System Access

Access to all TACC systems now requires Multi-Factor Authentication (MFA). You can create an MFA pairing on the TACC User Portal. After login on the portal, go to your account profile (Home->Account Profile), then click the "Manage" button under "Multi-Factor Authentication" on the right side of the page. See Multi-Factor Authentication at TACC for further information.

Secure Shell

To create a login session from a local machine it is necessary to have an SSH client. Wikipedia is a good source of information on SSH, and provides information on the various clients available for your particular operating system. To ensure a secure login session, users must connect to machines using the secure shell ssh command. Data movement must be done using the secure shell commands scp and sftp.

Do not run the optional ssh-keygen command to set up Public-key authentication. This command sets up a passphrase that will interfere with the execution of job scripts in the batch system. If you have already done this, remove the .ssh directory (and the files under it) from your home directory. Log out and log back in to regenerate the keys.

To initiate an ssh connection to a Stampede login node from your local system, use one of the SSH clients supporting the SSH-2 protocol e.g., like Openssh, Putty, SecureCRT. Then execute the following command:

localhost$ ssh

Login passwords (which are identical to TACC portal passwords, not XUP passwords) can be changed in the TACC Portal. Select "Change Password" under the "HOME" tab after you login. If you've forgotten your password, go to the TACC portal home page ( and select the "? Forgot Password" button in the Sign In area.

Please see below for information on accessing Stampede's KNL login node.

GSI-OpenSSH (gsissh)

The following commands authenticate using the XSEDE myproxy server, then connecting to the gsissh port, 2222, on Stampede:

localhost$ myproxy-logon -s
localhost$ gsissh -p 2222

Please consult NCSA's detailed documentation on installing and using myproxy and gsissh, as well as the GSI-OpenSSH User's Guide for more info.

XSEDE Single Sign-On Hub

XSEDE users may also access Stampede via the XSEDE Single Sign-On Hub.

To report a problem please run the ssh or gsissh command with the "-vvv" option and include the verbose information when submitting a help ticket.

Using Stampede

Good Citizenship

The Stampede cluster is a shared resource. Hundreds of users may be logged on at one time accessing the filesystem, hundreds of jobs may be running on all compute nodes, with a hundred more jobs queued up. All users must practice good citizenship and limit activities that may impact the system for other users. Stampede's four login nodes as well as the three Lustre file systems ($HOME, $WORK, and $SCRATCH) are shared among all users. Good citizenship can be boiled down to two items:

  1. Do not abuse the shared filesystems
  2. Do not run programs on the login nodes

Respect the shared filesystems

  • Avoid running jobs in the $HOME directory. Run jobs in $WORK or $SCRATCH.
  • Avoid too many simultaneous file transfers. Three concurrent scp or globus-url-copy sessions (see Transferring Files) is probably fine. One hundred concurrent file sessions is not.
  • Limit I/O intensive sessions (lots of reads and writes to disk)

Don't run programs on the login nodes

The four login nodes are shared among all users currently logged into Stampede. A single user running computationally expensive or disk intensive task/s will negatively impact performance for other users. Running on the login nodes is one of the fastest routes to account suspension. Don't do this:

login1$ ./a.out

or this:

login1$ ibrun myprogam

or this:

login1$ mpirun myprogam

Instead, submit batch jobs or request interactive access to the compute nodes as detailed below Accessing Compute Nodes.

Accessing Compute Nodes

Once logged into Stampede users are automatically placed on one of four "front-end" login nodes. To determine what type of node you're on, simply issue the "hostname" command. Stampede's four login nodes will be labeled login[1-4] The 6400 compute nodes will be labeled something like "".

The four login nodes provide an interface to the 6400 "back-end" compute nodes. Think of the login nodes as a prep area, where users may edit, compile, perform file management, issue transfers, submit batch jobs etc.

Figure 5. Login and compute nodes

The compute nodes are where actual computations occur and where research is done. All batch jobs and executables, as well as development and debugging sessions, are run on the compute nodes.

Batch jobs are programs scheduled for execution on the compute nodes, to be run without human interaction. A job script (also called "batch script") contains all the commands necessary to run the program: the path to the executable, program parameters, number of nodes and tasks needed, maximum execution time, and any environment variables needed. Batch jobs are submitted to a queue and then managed by a scheduler. The scheduler manages all pending jobs in the queue and allocates exclusive access to the compute nodes for a particular job. The scheduler also provides an interface allowing the user to submit, cancel, and modify jobs. Stampede's job scheduler is Slurm, documented extensively in the Running section.

To access compute nodes on TACC resources, one must either submit a job to a batch queue or initiate an interactive session.

Submit a Slurm job script

Use Slurm's sbatch command to submit a job to the batch queue. Specify the resources needed for your job e.g., number of nodes/tasks needed, job run time. etc. in a job script. See "/share/doc/slurm" for examples.

login1$ sbatch myscript

Please see the Running Jobs section below for extensive information on Slurm and running jobs.

Start an Interactive Session

Start an interactive session to a compute node using TACC's idev utility or Slurm's srun command. Both utilities are useful for code development and debugging.

Login to an Assigned Compute Node

Users may also ssh to a compute node from a login node only when that user's batch job is running, or the user has an active idev or srun session, on that node. Once the batch job or interactive session ends, the user will no longer have access to that node.

In the following example session user slindsey submits a batch job (sbatch), queries which compute nodes the job is running on (squeue), then ssh's to one of the job's compute nodes. The displayed node list, c428-[401,403],c451-103, is truncated for brevity. Notice that the user attempts to ssh to compute node c428-402 which is NOT in the node list. Since that node is not assigned to this user for this job, the connection is refused. When the job terminates the connection will also be closed even if the user is still logged into the node,

User submits a job using Slurm's sbatch command:

login4 [709]~> sbatch myjobscript
Submitted batch job 5462435

User polls the queue, waiting till the job runs (state "R") and nodes are assigned.

login4 [710]~> squeue -u slindsey
5462435 normal mympi slindsey R 0:39 16 c428-[401,403],c451-103,... 

User attempts to logon to an unassigned node. The connection is denied.

login4 [711]~> ssh c428-402
Access denied: user slindsey (uid=804387) has no active jobs.
Connection closed by

User logs in to an assigned compute node and does work. Once the job has finished running, the connection is automatically closed.

login4 [712]~> ssh c428-403
TACC Stampede System
LosF 0.40.0 (Top Notch)
Provisioned on 10-Oct-2012 at 00:46
c466-604 [665]~> do science; attach debuggers etc.
Connection to c466-604 closed by remote host.
Connection to c466-604 closed.
login4 [713]~>

Unix Shell

The most important component of a user's environment is the login shell. It interprets text on each interactive command-line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. The default shell is BASH. To determine your current login shell:

login1$ echo $SHELL

You can change your default login shell by submitting a ticket via TACC Portal. Select one of the available shells listed in the "/etc/shells" file, and include it in the ticket. To display the list of available shells, execute:

login1$ cat /etc/shells

After your support ticket is closed, please allow several hours for the change to take effect.

Environment Variables

Another important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

login1$ env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by sample $HOME and $PATH variables.


Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C-type shells and export for Bourne-type shells) are carried to the environment of shell scripts and new shell invocations, while normal shell variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute "set" to see the (normal) shell variables.

Startup Scripts

Unix shells allow users to customize their environment via startup files containing scripts. Customizing your environment with startup scripts is not entirely trivial. Below are some simple instructions, as well as an explanation of the shell set up operations.

TACC Bash users should consult the Bash Users' Startup Files: Quick Start Guide document for instructions on how best to set up the user environment.

Technical Background

All UNIX systems set up a default environment that provides administrators and users with the ability to execute additional UNIX commands to alter that environment. These commands are sourced; that is, they are executed by the login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment. The Xeon E5 hosts on Stampede support the Bourne shell and its variants (/bin/sh, /bin/bash , /bin/zsh) and the C shell and its variants (/bin/csh, /bin/tcsh). The Linux operating system on the Xeon Phi coprocessors supports only the Bash (/bin/bash) and Bourne (/bin/sh) shells. Each shell's environment is controlled by system-wide and user startup files. TACC deploys system-specific startup files in the /etc/profile.d/ directory. User owned startup files are dot files (begin with a period and are viewed with the "ls -a" command) in the user's $HOME directory.

Each UNIX shell may be invoked in three different ways: as a login shell, as an interactive shell or as a non-interactive shell. The differences between a login and interactive shell are rather arcane. For our purposes, just be aware that each type of shell runs different startup scripts at different times depending on how it's invoked. Both login and interactive shells are shells in which the user interacts with the operating system via a terminal window. A user issues standard command-line instructions interactively. A non-interactive shell is launched by a script and does not interact with the user, for example, when a queued job script runs.

Bash shell users should understand that login shells, for example, shells launched via ssh, source one and only one of the files ~/.bash_profile, ~/.bash_login, or ~/.profile (whichever the command finds first in file-list order), and will not automatically source ~/.bashrc. Interactive non-login shells, for example shells launched by typing "bash" on the command-line, will source ~/.bashrc and nothing else.

TACC staff recommends that Bash shell users use ~/.profile rather than .bash_profile or .bash_login Please see Bash Users' Startup Files: Quick Start Guide

Both Bash on the Xeon E5 host and Bash on the Xeon Phi coprocessor will source ~/.profile when you login via ssh. You may also want to restrict yourself to POSIX-compliant syntax so both shells correctly interpret your commands.

The system-wide startup scripts, /etc/profile for Bash and /etc/csh.cshrc for C type shells, set system-wide variables such as ulimit, and umask, and environment variables such as $HOSTNAME and the initial $PATH. They also source command scripts in the /etc/profile.d/ directory that site administrators may use to set up the environments for common user tools (e.g., vim, less) and system utilities (e.g., Ganglia, Modules, Globus).


TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations (e.g., $WORK, $HOME), aliases (e.g., cdw, cdh) and license paths are set by the login modules. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

The environment for executing each major TACC application can be set with a module command. The specifics are defined in a modulefile file, which sets, unsets, appends to, or prepends to environment variables (e.g., $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH) for the specific application. Each modulefile also sets functions or aliases for use with the application. You only need to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load modulename

where modulename is the name of the module to load. If you often need a specific application, see Controlling Modules Loaded at Login below for details.

Most of the package directories are in /opt/apps/ ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your shell environment by loading the fftw3 module:

login1$ module load fftw3

To look at a synopsis about using an application in the module's environment (in this case, fftw3), or to see a list of currently loaded modules, execute the following commands:

login1$ module help fftw3
login1$ module list

Available Modules

TACC's module system is organized hierarchically to prevent users from loading software that will not function properly with the currently loaded compiler/MPI environment (configuration). Two methods exist for viewing the availability of modules: Looking at modules available with the currently loaded compiler/MPI, and looking at all of the modules installed on the system.

To see a list of modules available to the user with the current compiler/MPI configuration, users can execute the following command:

login1$ module avail

This will allow the user to see which software packages are available with the current compiler/MPI configuration.

To see a list of modules available to the user with any compiler/MPI configuration, users can execute the following command:

login1$ module spider

This command will display all available packages on the system. To get specific information about a particular package, including the possible compiler/MPI configurations for that package, execute the following command:

login1$ module spider modulename

Software upgrades and adding modules

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will generally announce upgrades and module changes in advance.

Controlling Modules Loaded at Login

Each user's computing environment is initially loaded with a default set of modules. This module set may customized at any time. During login startup, the following command is run:

login1$ module restore

This command loads the user's personal set of modules (if it exists) or the system default. If a user wishes to have their own personal collection of modules they can create this by loading the modules they want and unloading the modules they don't and then do:

login1$ module save

This marks the collection as their personal default collection of modules that they will have every time they login. It is also possible to have named collections, run "module help" for more details.

There is a second method for controlling the module specified at login. The ".modules" file is sourced by the startup scripts at TACC and is read after the "module restore" command. This file can contain any list of module commands required. You can also place module commands in shell scripts and batch scripts. We do not recommend putting module commands in personal startup files (.bashrc, .cshrc), however; doing so can cause subtle problems with your environment on compute nodes.

Please see below for KNL specific material regarding modules.

File Management

Stampede supports multiple file transfer programs, common command-line utilities such as scp, sftp, and rsync, and services such as Globus Connect and Globus' globus-url-copy utility. XSEDE users can take advantage of both Globus Online and globus-url-copy may be used to achieve higher performance than the scp and rsync programs when transferring large files between other XSEDE systems, TACC clusters and TACC storage systems (Corral & Ranch. During production, the scp speeds between Stampede and Ranch average about 30MB/s, globus-url-copy speeds are about 125MB/s. These values vary with I/O and network traffic. The following transfer methods are presented in order of ease of use.

Globus Connect

Globus Connect (formerly Globus Online) is recommended for transferring data between XSEDE sites. Globus Connect provides fast, secure transport via an easy-to-use web interface using pre-defined and user-created "endpoints". XSEDE users automatically have access to Globus Connect via their XUP username/password. Other users may sign up for a free Globus Connect Personal account.


Data transfer from any Linux system can be accomplished using the scp utility to copy data to and from the login node. A file can be copied from your local system to the remote server by using the command:

localhost% scp filename \

Consult the man pages for more information on scp.

login1$ man scp


The rsync command is another way to keep your data up to date. In contrast to scp, rsync transfers only the actual changed parts of a file (instead of transferring an entire file). Hence, this selective method of data transfer can be much more efficient than scp. The following example demonstrates usage of the rsync command for transferring a file named "myfile.c" from the current location on Stampede to Lonestar's $WORK directory.

login1$ rsync myfile.c \  

An entire directory can be transferred from source to destination by using rsync as well. For directory transfers the options "-avtr" will transfer the files recursively ("-r" option) along with the modification times ("-t" option) and in the archive mode ("-a" option) to preserve symbolic links, devices, attributes, permissions, ownerships, etc. The "-v" option (verbose) increases the amount of information displayed during any transfer. The following example demonstrates the usage of the "-avtr" options for transferring a directory named "gauss" from the present working directory on Stampede to a directory named "data" in the $WORK file system on Lonestar.

login1$ rsync -avtr ./gauss \  

For more rsync options and command details, run the command "rsync -h" or:

login1$ man rsync

When executing multiple instantiations of scp or rsync, please limit your transfers to no more than 2-3 processes at a time.


XSEDE users may also use Globus' globus-url-copy command-line utility to transfer data between XSEDE sites. globus-url-copy, like Globus Connect described above, is an implementation of the GridFTP protocol, providing high speed transport between GridFTP servers at XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories.

This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the "myproxy-logon" command with your XSEDE User Portal (XUP) username and password to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. On Stampede, the myproxy-logon command is located in the CTSSV4 module (not loaded by default).

login1$ module load CTSSV4
login1$ myproxy-logon -T -l XUP_username

Each globus-url-copy invocation must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy [options] source_url destination_url

where each XSEDE URL will generally be formatted:


Users may look up XSEDE GridFTP servers on the Data Transfers & Management page.

Note that globus-url-copy supports multiple protocols e.g., HTTP, FTP in addtion to the GridFTP protocol. Please consult the following references for more information.

globus-url-copy Examples

The following command copies "directory1" from TACC's Stampede to Georgia Tech's Keeneland system, renaming it to "directory2". Note that when transferring directories, the directory path must end with a slash ( "/"):

login1$ globus-url-copy -r -vb \  
    gsi`pwd`/directory1/ \  

The following command copies a single file, "file1" from TACC's Stampede to "file2" on SDSC's Trestles:

login1$ globus-url-copy -tcp-bs 11M -vb \  
    gsi`pwd`/file1 \  

Use the buffer size option, "-tcp-bs 11M", to explicitly set the FTP data channel buffer size, otherwise, the speed will be about 20 times slower! Consult the Globus documentation to select the optimum value: How do I choose a value for the TCP buffer size (-tcp) option?

Advanced users may employ the "-stripe" option enables striped transfers on supported servers. Stampede's GridFTP servers each have a 10GbE interface adapter and are configured for a 4-way stripe since most deployed 10GbE interfaces are performance-limited by host PCI-X busses to ~6Gb/s.


Additional command-line transfer utilities supporting standard ssh and grid authentication are offered by the Globus GSI-OpenSSH implementation of OpenSSH. The gsissh, gsiscp and gsiftp commands are analogous to the OpenSSH ssh, scp and sftp commands. Grid authentication is provided to XSEDE users by first executing the myproxy-logon command (see above).

Users who need to transfer large amounts of data to Stampede may find it worthwhile to disable gsiscp's default data stream encryption. To do so, add the following three options:

  • -oTcpRcvBufPoll=yes
  • -oNoneEnabled=yes
  • -oNoneSwitch=yes

to your command-line invocation. Note that not all machines support these options. You must explicitly connect to port 2222 on Stampede. The following command copies "file1" on your local machine to Stampede renaming it to "file2".

localhost$ gsiscp -oTcpRcvBufPoll=yes -oNoneEnabled=yes -oNoneSwitch=yes \  
    -P2222 file1 stampede:file2

Please consult Globus' GSI-OpenSSH User's Guide for further info.

Sophisticated users may wish to apply HPN-SSH patches to their own local OpenSSH installations. More information about HPN-SSH can be found here:

Application Development

Programming Models

There are two primary memory models for computing: distributed-memory and shared-memory. A third model, hybrid programming, combines the two. In the distributed-memory model, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the shared-memory model, threads (light weight processes) are used to access memory in a common address space. HPC applications often use OpenMP for threading, although other methods such as Intel Threading Building Blocks (TBB) and Cilk, and POSIX threads are viable alternative solutions. The majority of scientific codes that employ threading techniques use the OpenMP paradigm because it is portable (understood by most compilers) and supported on HPC systems. Hence we will emphasize the use of the OpenMP threading paradigm when discussing the shared-memory model.

Distributed-memory Model

For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used.

In the SPMD paradigm, the same program image (e.g., an a.out executable) is loaded onto cores as processes with their own private memory and address space. Each process executes and performs the same operations on different data in its own address space. This is the usual mechanism for MPI code: a single executable is available to all nodes (through a globally accessible file system such as $SCRATCH, $WORK, or $HOME) and launched on cores of each node through the TACC MPI launch command (e.g., "ibrun ./a.out").

In the MPMD paradigm at least two different program images are launched. This paradigm has been used in a few multi-physics applications that launch 2 or more different codes that communicate through MPI, and more often now multiple programs are interfaced to each other in workflows.

Parametric Sweeps

Two specialty cases of SPMD computing are Parametric Sweeps (PS) and Heterogeneous (architecture) computing. Parametric Sweeps are used to investigate the parameter space of model simulations, and launch simultaneously tens or hundreds of independent executions working on different data without an (MPI) communication. PS executables are launched through a special launcher job script designed to launch independent executables with appropriate inputs from a parameter list file. Details for launching parameter sweep jobs are described in the launcher module:

login1$ module help launcher

In SPMD heterogeneous computing the same program executes on different architectures. On Stampede nodes MPI executables can be launched on the E5 processors as well as the MIC Phi coprocessors. Unlike GPU accelerators, this is possible because the MICs run a micro OS (mOS) for launching processes, and also have an MPI interface to the Host Channel Adaptor (HCA) on the node for external communication. Intel refers to this type of computing paradigm as "symmetric" computing to emphasize that the Phi coprocessor component of a node can be use separately as a stand-alone SMP system for executing MPI codes. On Stampede nodes, MPI applications can be launched solely on the E5 processors, or solely on the Phi coprocessors, or on both in a "symmetric" heterogeneous computing mode. For heterogeneous computing, an application is compiled for each architecture and the MPI launcher ("ibrun" at TACC) is modified to launch the executables on the appropriate processors according to the resource specification for each platform (number of tasks on the E5 component and the Phi component of a node).

Shared-memory Model

Shared-memory programming paradigms that employ threading can only be executed on a set of cores where the threads can access the same (shared) memory. Applications spawn threads on the cores to work on tasks in parallel and access the same memory.

Hybrid Model

In multi-core cluster systems with a small core count per node, applications are run in pure MPI mode in which all cores within the cluster have their own private (distributed) memory. For an MPI application this is accomplished by executing multiple a.out's on a node, with each process running in a separate address space using memory accessible only to the process. In this way, all cores appear as a set of distributed memory machines, even though each node has cores that share a single memory subsystem.

Within the last few years more HPC MPI applications (distributed memory) have been employing on-node threading with OpenMP (a shared memory paradigm) to minimize the number of MPI communicating tasks and to increase the memory per MPI task.

As HPC systems evolve from multi- to many-core systems we expect to see even more applications migrate to this hybrid paradigm (employing MPI tasks to manage memory access across nodes, reducing the number of MPI tasks per node, and employing OpenMP to use several to many threads per MPI task).

In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g., with Pthreads or Cilk) are employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.


The Stampede system's default programming environment uses the Intel C++ and Fortran compilers. These Intel compilers are the only compilers able to compile programs for the Phi coprocessors. This section explains and illustrates the important and basic uses of this compiler for all the programming models described above (OpenMP/MPI on the E5/Phi systems in native mode, offloading and heterogeneous modes). The Intel compiler commands can be used for both compiling (making ".o" object files) and linking (making an executable from ".o" object files).

The Intel Compiler Suite

The Intel Fortran and C++ compilers are loaded as defaults upon login. TACC staff recommends using the Intel compilers whenever possible. The Intel suite has been installed with 64-bit standard libraries and compiles programs as 64-bit applications (as the default compiler mode). Since the E5's and Phi coprocessors are new architectures, which rely on optimizations in the newer compilers, any program compiled for another Intel system should be recompiled.

The Intel Fortran and C/C++ compiler commands are "ifort", "icc" and "icpc", respectively. Use the "-help" option with any of these commands to display a list and explanation of all the compiler options, useful during debugging and optimization.

Basic Compiler Commands and Serial Program Compiling

Appropriate file name extensions are required for each compiler. By default, the executable filename is "a.out", but it may be renamed with the "-o" option. We use "a.out" throughout this guide to designate a generic executable file. The compiler command performs two operations: it makes a compiled object file (having a ".o" suffix) for each file listed on the command-line, and then combines them with system library files in a link step to create an executable. To compile without the link step, use the "-c" option.

The same code can be compiled to run either natively on the host or natively on the MIC. Use the same compiler commands for the host (E5) or the MIC (Phi) compiling, but include the "-mmic" option to create a MIC executable. We suggest you name MIC executables with a ".mic" suffix.

Table 5. Compiling Serial Programs

Language Compiler File Extension Example
C icc .c icc compiler_options prog.c
C++ icpc .C, .cc, .cpp, .cxx icpc compiler_options prog.cpp
F77 ifort .f, .for, .ftn ifort compiler_options prog.f
F90 ifort .f90, .fpp ifort compiler_options prog.f90

The following examples illustrate how to rename an executable (-o option), compile for the host (run on the E5 processors), and compile for the MIC (run natively on the MIC):

A C program example:

login1$ icc -xhost -O2 -o flamec.exe prog.c

A Fortran program example:

login1$ ifort -xhost -O2 -o flamef.exe prog.f90

Commonly used options may be placed in an "icc.cfg" or "ifc.cfg" file for compiling C and Fortran code, respectively.

For additional information, execute the compiler command with the "-help" option to display every compiler option, its syntax, and a brief explanation, or display the corresponding man page, as follows:

login1$ icc   -help
login1$ icpc  -help
login1$ ifort -help
login1$ man icc
login1$ man icpc
login1$ man ifort

Some of the more important options are listed in the Basic Optimization section of this guide.

Compiling OpenMP Programs

Since each Stampede node has many cores (16 E5 and 61 Phi cores), applications can take advantage of multi-core shared-parallelism by using threading paradigms such as OpenMP. For applications with OpenMP parallel directives, include the "-openmp" option on the compiler command to enable the parallel thread generation. Use the "-openmp_report" option to display diagnostic information.

Table 6. Compiling OpenMP Programs

Compiler Options
-openmp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives.
Use whenever OpenMP pragmas are present in core for E5 processor or Phi coprocessor.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level

Below are host compile examples for enabling OpenMP code directives.

login1$ icc   -xhost -openmp -O2 -o flamec.exe      prog.c
login1$ ifort -xhost -openmp -O2 -o flamef.exe      prog.f90

The Intel compiler accepts OpenMP pragmas and OpenMP API calls that adhere to the 3.1 standard ( The $KMP_AFFINITY and OpenMP environment variables that set thread affinity and thread control are described in the "running code" section below.

Compiling MPI Programs

Stampede supports two versions of MPI: MVAPICH2, an open-source MPI library from the Network-Based Computing Laboratory (NBCL) at Ohio State University, and the proprietary Intel MPI Library 4.1. Both libraries provide MPI Compiler Commands that invoke an appropriate compiler with the appropriate MPI include-file path and MPI library-file path options. At login, MPI MVAPICH2 (mvapich2) and Intel compiler (intel) modules are loaded to produce the default environment that includes the mvapich compiler commands (mpixxx) in the user execution path ($PATH). To use the Intel MPI, swap in the impi module ("module swap mvapich2 impi). Use the mpixxx compiler commands in Table 5 for the MVAPICH2 and Intel MPI libraries. Communication performance may differ, and run-time controls are managed by a different set of run-time environment variables that are documented in the MVAPICH2 and Intel MPI user guides and reference manuals.

The MPI compiler commands for MVAPICH2 and Intel libraries are listed for each language in the table below:

Table 7. MPI Compiler Commands

Compiler MPI
Language File Extension Example
mpicc mvapich2 intel C .c mpicc compiler_options myprog.c
mpicxx mvapich2 Intel C++ intel: .C/c/cc/cpp/cxx/c++/i/ii mpicxx compiler_options
mpif77 mvapich2 Intel F77 .f, .for, .ftn mpif77 compiler_options myprog.f
mpif90 mvapich2 Intel F90 .f90, .fpp mpif90 compiler_options myprog.f90

Appropriate file name extensions are required for each wrapper. By default, the executable name is named "a.out". You may rename it using the "-o" option. To compile without the link step, use the "-c" option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.

MVAPICH2 Compile examples

login1$ mpicc  -xhost -O2 -o simulate.exe mpi_prog.c
login1$ mpif90 -xhost -O2 -o simulate.exe mpi_prog.f90

Include linker options, such as library paths and library names, after the program module names, as explained in the Libraries section below. The Running Applications section explains how to execute MPI executables in batch scripts and interactive batch runs on compute nodes. Note: Use Intel's mpi, impi, or mvapich2-mic modules for running MPI applications on the MIC. The regular mvapich2 modules do not support MIC native applications.

Compiling with gcc

We recommend that you use the Intel compiler for optimal code performance. TACC does not support the use of the gcc compiler for production codes on the E5 processors, and there is no information from Intel about combining Intel and gcc binaries for the Phi coprocessor. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the "-cc mpcc" option for modules requiring gcc. (Since gcc- and Intel-compiled codes are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler for these two cases:

gcc Compile Examples

  • Create object file suba.o with gcc

    login1$ mpicc -O3 -xhost -c -cc=gcc suba.c
  • Create a.out: compile main with icc; load in suba.o

    login1$ mpicc -O3 -xhost mymain.c suba.o
  • Create object file suba.o with icc

    login1$ mpicc -O3 -xhost -c suba.c
  • Create a.out: compile main with gcc; load in suba.o

    login1$ mpicc -O3 -xhost -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o

Compiler Optimization Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for inter-procedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

Levels of optimization are set with the "-On" option, detailed in Table 8 below.

Table 8. Compiler optimization levels

  • enables -O2
  • more aggressive prefetching, loop transformations
Level -On Description
n = 0 Fast compilation, full debugging support. Automatically enabled if using -g.
n = 1,2 Low to moderate optimization, partial debugging support:
n = 3 Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change code semantics and results (sometimes even breaks code!) :

Table 9. lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Table 9. Compiler Options Affecting Performance

  • [no-]except : enable/disable floating-point semantics
  • fast[=1|2] : enables more aggressive floating-point optimizations
  • precise : allows value-safe optimizations
  • source : allows intermediates in source precision
  • strict : enables fp-model precise, fp-model except, disables contractions and enables pragma stdc fenv_access
  • double : rounds intermediates in 53-bit (double) precision
  • extended : rounds intermediates in 64-bit (extended) precision
  • Option Description
    -assume buffered_io Use this to ensure buffered I/O from your Fortran executables (recommended)
    -c For compilation of source file only.
    -fpe0 Enable floating point exceptions. Useful for debugging.
    -fp-model arg Enable floating-point model variation:
    -fast combination of aggressive optimization options, implies -static and -no-prec-div (not recommended) -g Debugging information, generates symbol table. (-fp , disables using EBP as general-purpose register) -help lists options. -ip Enable single-file inter-procedural (IP) optimizations (within files). -ipo Enable multi-file IP optimizations (between files). -no-offload Disable any offload usage -O3 Aggressive optimization (-O2 is default). -offload-option,mic, tool"opts" Appends additional options to defaults for tool (compiler, ld, or as). Opts is a command separated list of options. -opt-prefetch Enables data prefetching. -opt-streaming-stores arg Specifies whether streaming stores are generated:
    • Always : Enable streaming stores under the assumption that the application is memory bound
    • Auto : [DEFAULT] Compiler decides when streaming stores are used
    • Never : Disable generation of streaming stores
    -openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives. -openmp-report[0|1|2] Controls the OpenMP parallelizer diagnostic level. -vec_report[0|...|6] controls amount of vectorizer diagnostic information. -xhost includes specialized code for AVX instruction set.

    Basic Optimization for Serial and Parallel Programming using OpenMP and MPI

    The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and icpc and mpicxx.

    Developers often experiment with the following options: "-pad", "-align", "-ip", "-no-rec-div" and "-no-rec-sqrt". In some codes performance may decrease. Please see the Intel compiler manual for a full description of each option.

    Use the "-help" option with the mpi commands for additional information:

    login1$ mpicc  -help
    login1$ mpicxx -help
    login1$ mpif90 -help
    login1$ mpirun -help

    Use the options listed for mpirun with the ibrun command in your job script. For details on the MPI standard, go to:


    Bindings to the most recent shared libraries are configured in the "/etc/" file (and cached in the /etc/ file). Cat /etc/ to see the TACC configured directories, or execute:

    login1$ /sbin/ldconfig -p

    To see a list of directories and candidate libraries, echo the $LD_LIBARY_PATH environment variable and use the "-Wl","-rpath" option to override the default runtime bindings.

    Environment variables set by particular modules can be viewed with the "module show" command:

    login1$ module show modulefile

    Some of the more useful load flags/options for the host environment are listed below. For a more comprehensive list, consult the ld man page.

    • Use the "-l" loader option to link in a library at load time:

      login1$ ifort prog.f90 -lname

      This links in either the shared library "" (default) or the static library "libname.a", provided the library can be found in ldd's library search path or the $LD_LIBRARY_PATH environment variable paths.

    • To explicitly include a library directory, use the "-L" option:

      login1$ ifort prog.f -L/mydirectory/lib -lname

    In the above example, the user's libname.a library is not in the default search path, so the "-L" option is specified to point to the directory containing libname.a (only the library name is supplied in the "-l" argument; remove the "lib" prefix and the ".a" suffix.)

    Many of the modules for applications and libraries, such as the hdf5 library module provide environment variables for compiling and linking commands. Execute the "module help module_name" command for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the hdf5 library:

    login1$ icc -I$TACC_HDF5_INC hdf5_test.c -o hdf5_test \
                    -Wl,-rpath,$TACC_HDF5_LIB -L$TACC_HDF5_LIB -lhdf5 -lz

    Here, the module supplied environment variables $TACC_HDF5_LIB and $TACC_HDF5_INC contain the hdf5 library and header library directory paths, respectively. The loader option "-Wl,-rpath" specifies that the $TACC_HDF5_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the $LD_LIBRARY_PATH or the LDD dynamic cache of bindings between shared libraries and directory paths. This avoids having to set the $LD_LIBRARY_PATH (manually or through a module command) before running the executables. (This simple load sequence will work for some of the unthreaded MKL functions; see MKL Library section for using various packages within MKL.)

    You can view the full path of the dynamic libraries inserted into your binary with the ldd command. The example below shows a partial listing for the hdf5_test binary:

    login1$ ldd hdf5_test
    ... = /opt/apps/intel13/hdf5/1.8.9/lib/ (0x00002a) = /lib64/ (0x00000034a7a00000)
    ... = /opt/apps/intel13/hdf5/1.8.9/lib/ (0x00002abdc2) = /lib64/tls/ = /lib64/

    Intel Math Kernel Library (MKL)

    The Intel Math Kernel Library (MKL) is a collection of highly optimized functions implementing some of the most important mathematical kernels used in computational science, including standardized interfaces to:

    • BLAS (Basic Linear Algebra Subroutines), a collection of low-level matrix and vector operations like matrix-matrix multiplication
    • LAPACK (Linear Algebra PACKage), which includes higher-level linear algebra algorithms like Gaussian Elimination
    • FFT (Fast Fourier Transform), including interfaces based on FFTW (Fastest Fourier Transform in the West)
    • ScaLAPACK (Scalable LAPACK), BLACS (Basic Linear Algebra Communication Subprograms), Cluster FFT, and other functionality that provide block-based distributed memory (multi-node) versions of selected LAPACK, BLAS, and FFT algorithms;
    • Vector Mathematics (VM) functions that implement highly optimized and vectorized versions of special functions like sine and square root.

    MKL with Intel C, C++, and Fortran Compilers

    There is no MKL module for the Intel compilers because you don't need one: the Intel compilers have built-in support for MKL. Unless you have specialized needs, there is no need to specify include paths and libraries explicitly. Instead, using MKL with the Intel modules requires nothing more than compiling and linking with the "-mkl" option.; e.g.

    &#x#x000A;login1$ icc -mkl mycode.c
    login1$ ifort -mkl mycode.c

    The "-mkl" switch is an abbreviated form of "-mkl=parallel", which links your code to the threaded version of MKL. To link to the unthreaded version, use "-mkl=sequential". A third option, "-mkl=cluster", which also links to the unthreaded libraries, is necessary and appropriate only when using ScaLAPACK or other distributed memory packages. For additional information, including advanced linking options, see the MKL documentation and Intel MKL Link Line Advisor.

    MKL with GNU C, C++, and Fortran Compilers

    When using a GNU compiler, load the MKL module before compiling or running your code, then specify explicitly the MKL libraries, library paths, and include paths you application needs. Consult the Intel MKL Link Line Advisor for details. A typical compile/link process on a TACC system will look like this:

    login1$ module load gcc
    login1$ module load mkl   # available/needed only for GNU compilers
    login1$ gcc -fopenmp -I$MKLROOT/include      \
            -Wl,-L${MKLROOT}/lib/intel64 \
            -lmkl_intel_lp64 -lmkl_core  \
            -lmkl_gnu_thread -lpthread   \
            -lm -ldl mycode.c

    For your convenience the mkl module file also provides alternative TACC-defined variables like $TACC_MKL_INCLUDE (equivalent to $MKLROOT/include). Execute "module help mkl" for more information.

    Using MKL as BLAS/LAPACK with Third-Party Software

    When your third-party software requires BLAS or LAPACK, you can use MKL to supply this functionality. Replace generic instructions that include link options like "-lblas" or "-llapack" with the simpler MKL approach described above. There is no need to download and install alternatives like OpenBLAS.

    Using MKL as BLAS/LAPACK with TACC's MATLAB, Python, and R Modules

    TACC's MATLAB, Python, and R modules all use threaded (parallel) MKL as their underlying BLAS/LAPACK library. These means that even serial codes written in MATLAB, Python, or R may benefit from MKL's thread-based parallelism. This requires no action on your part other than specifying an appropriate max thread count for MKL; see the section below for more information.

    Controlling Threading in MKL

    Any code that calls MKL functions can potentially benefit from MKL's thread-based parallelism; this is true even if your code is not otherwise a parallel application. If you are linking to the threaded MKL (using "-mkl", "-mkl=parallel", or the equivalent explicit link line), you need only specify an appropriate value for the max number of threads available to MKL. You can do this with either of the two environment variables MKL_NUM_THREADS or OMP_NUM_THREADS. The environment variable MKL_NUM_THREADS specifies the max number of threads available to each instance of MKL, and has no effect on non-MKL code. If MKL_NUM_THREADS is undefined, MKL uses OMP_NUM_THREADS to determine the max number of threads available to MKL functions. In either case, MKL will attempt to choose an optimal thread count less than or equal to the specified value. Note that OMP_NUM_THREADS defaults to 1 on TACC systems; if you use the default value you will get no thread-based parallelism from MKL.

    If you are running a single serial, unthreaded application (or an unthreaded MPI code involving a single MPI task per node) it is usually best to give MKL as much flexibility as possible by setting the max thread count to the total number of hardware threads on the node (16 on Sandy Bridge, 272 on KNL). Of course things are more complicated if you are running more than one process on a node: e.g. multiple serial processes, threaded applications, hybrid MPI-threaded applications, or pure MPI codes running more than one MPI rank per node. See and related Intel resources for examples of how to manage threading when calling MKL from multiple processes.


    DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Stampede, please see the DDT Debugging Guide.

    Developers needing to debug MPI applications, please see the idev section below. TACC's idev tool allows interactive access to a subset of compute nodes for the purposes of debugging.

    Code Tuning

    Memory Tuning

    There are a number of techniques for optimizing application code and tuning the memory hierarchy.

    Maximize cache reuse

    The following snippets of code illustrate the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.

    Fortran example C example
    real*8 :: a(m,n), b(m,n), c(m,n)
     ... do i=1,n
      do j=1,m
      end do
    end do
    double a[m][n], b[m][n], c[m][n];
    for (i=0;i <m;i++){
      for (j=0;j <n;j++){


    Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

    Example : In the following dot-product example, the number of streams prefetched are increased from 2, to 4, to 6, for the same functionality. However, just prefetching a larger number of streams does not necessarily translate into increased performance. There is a threshold value beyond which prefetching more streams can be counterproductive.

    2 streams 4 streams
    do i=1,n
      end do
    do i=1,n/2
    end do
    6 streams
    do i=1,n/3
    end do
    do i=3*(n/3)+1,n
    end do

    Fit the problem

    Make sure to fit the problem size to memory (32GB/node) as there is no virtual memory available for swap.

    1. Always minimize stride length . For the best-case scenario, stride length 1 is optimal for most systems and in particular the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.
    2. Another approach is data reuse in cache by cache blocking . The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise the data has to be loaded into cache from memory every time it becomes necessary since it's not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, since the latency for loading data from memory is a few orders higher than from cache, hence the concern. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.
    3. This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different sub-matrices in cache while the computation is ongoing.

      Example: Matrix multiplication

            real*8 a(n,n), b(n,n), c(n,n)
            do ii=1,n,nb ! < nb is blocking factor
              do jj=1,n,nb
                do kk=1,n,nb
                  do i=ii,min(n,ii+nb-1)
                    do j=jj,min(n,jj+nb-1)
                      do k=kk,min(n,kk+nb-1)
                      end do
                    end do
                  end do
                end do
              end do
            end do
    4. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two . Users should be particularly aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.
    5. Example : Consider an L1 cache that is 16K in size and 4-way set associative, with a cache line of 64 Bytes.

      Problem : A 16K 4-way set associative cache has 4 sets of 4 K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4. That results in a 256 byte cache, down from the original 16K, due to the non-optimal choice of leading dimension.

            real*8 :: a(1024,50)
            do i=1,n
            end do

      Solution : Change leading dimension to 1028 (1024 + 1/2 cache line)

    6. Encourage Data Prefetching to Hide Memory Latency
    7. Work within available physical memory

    Floating-Point Tuning

    Unroll Inner Loops to Hide FP Latency

    In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better.

          do i=1,n,k
            s1 = s1 + x(i)*y(i)
            s2 = s2 + x(i+1)*y(i+1)
            s3 = s3 + x(i+2)*y(i+2)
            s4 = s4 + x(i+3)*y(i+3)
            sk = sk + x(i+k)*y(i+k)
          end do
          dotp = s1 + s2 + s3 + s4 + ... + sk

    Avoid Divide Operations

    The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside a loop, it is better to substitute that step by a multiply outside of the loop, provided no dependencies exist. Another alternative is to replace the loop by optimized vector intrinsics library, if available.

          do i=1,n
          end do
          do i=1,n
          end do

    Fortran90 Performance Pitfalls

    Several coding issues impact the performance of Fortran90 applications. For example, consider the two cases of using different F90 Array syntax for the two dimensional arrays below:

    Case 1:

        do j = js,je
          do k = ks,ke
            do i = is,ie
              rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))

    Case 2:

        rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
        smdiv * rt(is:ie,ks:ke,js:je) - rtold(is:ie,ks:ke,js:je))

    The array syntax in the computation step of the second approach leads to a significant performance penalty over using explicit loops on cache-based systems, although it is more elegant. Vector systems tend to prefer this array syntax from a performance standpoint. More importantly, the array syntax generates larger temporary arrays on the program stack.

    The way the arrays are declared also impacts performance. In the following example, there are two cases of F90 assumed shape arrays. In the second case, the negative performance impact is significantly higher, almost ten-fold in compile time.

    Case 1:

        REAL, DIMENSION( ims:ime , kms:kme , jms:jme ) :: r, rt, rw, rtold
        Results in F77-style assumed-size arrays
        Compile time: 46 seconds
        Run time: .064 seconds / call

    Case 2:

        REAL, DIMENSION( ims: , kms: , jms: ) :: r, rt, rw, rtold
        Results in F90-style assumed-shape arrays
        Compile time: 3120 seconds!!
        Run time: .083 seconds / call

    Another issue that arises from the F90 assumed shape arrays occurs when it is a parameter in a subroutine. Using assumed shape arrays as a parameter in a subroutine may result in the subroutine being passed a copy, rather than being passed the address of the array itself. This F90 copy-in/copy-out overhead is inefficient and may cause errors when calling external libraries.

    IO Tuning

    The TACC Lustre Parallel I/O systems is a collection of IO servers and a large number of disks that act as if they are one very large disk. One of the IO servers, the Meta-Data Server (MDS), tracks where each file is located on the disks of the different IO servers.

    Because Lustre has a large collection of disks it is possible to read and write large files across many disks quickly. However, all opening, closing and location of files must go through a single Meta-Data Server and this often becomes a bottleneck in IO performance. With hundreds of jobs running at the same time and sharing the Lustre file system, there can be much contention for the accessing the Meta-Data Server.

    When considering IO performance, use the "avoid too often and too many" rules. Avoid writing small files, opening and closing a file frequently, writing to a separate file for each task in a large parallel job stresses the MDS. It is best to aggregate I/O operations whenever possible. For best I/O performance one should consider using libraries like parallel HDF5 to write single files in parallel efficiently.

    Some of the more common sense approach entails using what's provided by the vendor i.e. taking advantage of the hardware. On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

    Other common sensible approaches to optimizing I/O is to be aware of the existence and the locations of the file systems i.e. whether the file systems are locally mounted or through a remote file system. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the file system over the network and should always be the goal at the design level.

    The other approaches include considering the best software options available. Some of them are enumerated below:

    • Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
    • Use binary instead of ASCII format because of the overhead incurred converting from the internal representation of real numbers to a character string. In addition, ASCII files are larger than the corresponding binary file.
    • In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
    • If available, use asynchronous I/O to overlap reads/writes with computation.

    General I/O Tips

    1. Don't open and close files with every I/O operation
    2. Open the files that are needed at the beginning of execution and close them at the end. Each open/close operation has an overhead cost that adds up, especially if multiple tasks are opening and closing files at the same time.

    3. Use the $SCRATCH filesystem
    4. $SCRATCH has more I/O servers than $WORK or $HOME. If you need to keep your input/output data on $WORK, you may add commands to your batch script to copy data in $SCRATCH at the beginning of a run and copy out at the end.

    5. Limit the number of files in one directory
    6. Directories that contain hundreds or thousands of files will be slow to respond to I/O operations due to overhead of indexing the files. Break up directories with large number of files into subdirectories with fewer files.

    7. Aggregate I/O operations as much as possible
    8. Fewer large read/write operations are much more efficient than many small operations. This may be accomplished by reducing the number of writers from every task to one per node or fewer to balance the bandwidth of a node with the bandwidth of the I/O servers.

    Parallel I/O Striping

    There are different Lustre striping options you should use when writing to a single file or multiple files.

    • Multiple Files/Multiple I/O tasks -- Set the stripe count to 1. For a large number of files, set the stripe count for the output directory to 1:
    • login1$ lfs setstripe -c 1 ./my_output_dir

      If you are writing the same amount of data with each write, you should try setting a default stripe size for the output files.

      login1$ lfs setstripe -s 16m ./my_output_dir

      Please limit these settings to the directory in which the output will be stored.

      Stripe counts higher than 80 are usually unnecessary. Please do not set the stripe count higher than 160. Attempting to do so may crash the Meta-Data Server.

    • Single File/Multiple tasks -- Increase the stripe count to the number of nodes.
    • For example, if 1024 tasks are writing to the same file from 64 nodes, set the stripe count to 64:

      login1$ lfs setstripe -c 64 ./my_output_file

      Also, experiment with the size of the stripe to improve performance:

      login1$ lfs setstripe -s 16m ./my_output_file

    Writing to a Single File using Parallel Libraries

    There are three common ways to write a single file in parallel: MPI I/O, Netcdf-4 and HDF5. Any of these work well, but developers should consider using HDF5 when there is no preference. An effective way to learn HDF5 is start with the HDF5 tutorial found at

    The T3PIO library improves the parallel performance of MPI I/O and HDF5 by setting the number of stripes and stripe size at runtime for each file created. It is a module on TACC systems and the source can be found at:

    Running Jobs

    Job Accounting

    Computing services across TACC are allocated and charged in Service Units (SUs). The number of SUs charged per job depends on the number of nodes used, the number of cores per node, and the wallclock duration of the job.

    Each Stampede node (of E5 16 cores, and a Phi coprocessor; and K20 GPUs on gpu nodes) can be assigned to only one user at a time; hence, a complete node is dedicated to a user's job and accrues wall-clock time for 16 cores whether all cores are used or not. The allocation usage is based solely on the E5 core wallclock hours used; the Phi coprocessors and the NVIDIA GPUs are "free" components within the nodes. Codes run in the serial queue will be charged the full rate of 16 cores/node.

    The SU queue multiplier for each queue is listed below in Table 10. SUs charged for all jobs is:

    Stampede SUs charged (core hours) = # nodes * # cores/node * wallclock time * queue multiplier

    The queue multiplier is assigned to high-demand resources such as the largemem queue. Jobs are charged at twice the rate of the other queues. Note that the large memory nodes contain 32 cores/node, therefore the effective cost for use of this queue is 4X that of the other queues.

    Stampede largemem queue SUs charged = # nodes * 32 cores/node * wallclock time * 2

    Slurm Job Scheduler

    Batch facilities such as LoadLeveler, LSF, SGE and Slurm differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (cancel, resource request modification, etc.). Here we provide an overview of the Simple Linux Utility for Resource Management (Slurm) batch environment, describe the Stampede queue structure and associated queue commands, and list basic Slurm job control commands along with options.

    Sandy Bridge Cluster Production Queues

    The Sandy Bridge Cluster production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in Table 10 below. Queues that don't appear in the table (such as systest, sysdebug, and clean) are non-production queues for system and HPC group testing and special support.

    Table 10. Stampede Production Queues

    Queue Name Max Runtime Max Nodes/Procs Max Jobs in Queue Queue Multiplier Purpose
    normal 48 hrs 256 / 4K 50 1 normal production
    development 2 hrs 16 / 256 1 1 development nodes
    largemem 48 hrs 3 / 96 3 2 large memory 32 cores/node
    serial 12 hrs 1 / 16 8 1 serial/shared_memory
    large 24 hrs 1024 / 16K 50 1 large core counts (access by request 1)
    request 24 hrs -- 50 1 special requests
    normal-mic 48 hrs 256 / 4k 50 1 production nodes with KNC coprocessors
    normal-2mic 24 hrs 128 / 2k 50 1 production nodes with two KNC coprocessors
    gpu 24 hrs 32 / 512 50 1 GPU nodes
    gpudev 4 hrs 4 / 64 5 1 GPU development nodes
    vis 8 hrs 32 / 512 50 1 GPU nodes + VNC service
    visdev 4 hrs 4 / 64 5 1 Vis development nodes (GPUs + VNC)

    1For access to the large queue (or for larger runs in the normal-mic queue), please submit a ticket to the TACC User Portal. Include in your request reasonable evidence of your readiness to run at scale on Stampede. In most cases this should include strong or weak scaling results summarizing experiments you have run on Stampede up to the limits of the normal (or normal-mic) queue. Please see the example Slurm job script for large jobs.

    Eight of the 128 GPU nodes have been reserved for development and are only accessible through the gpudev and visdev queues. The gpu and vis queues share the remaining 120 GPU nodes. The purpose of using two queues is to constrain the time limit for remote visualization and provide a VNC daemon.

    All 6400 compute nodes contain at least one KNC coprocessor. In addition, the 440 nodes in the normal-2mic queue contain two KNC coprocessors. The normal queue provides access to any of the 6,120 compute nodes (6400 less the GPU and development nodes), without regard to the presence of a KNC coprocessor. Use this queue if your application does not require a KNC coprocessor. Use the normal-mic or normal-2mic queues to guarantee access to nodes that contain KNC coprocessors.

    Slurm Environment Variables

    In addition to the environment variables that can be inherited by the job from the interactive login environment, Slurm provides environment variables for most of the values used in the #SBATCH directives. These are listed at the end of the sbatch man page. The environment variables SLURM_JOB_ID, SLURM_JOB_NAME, SLURM_SUBMIT_DIR and SLURM_NTASKS_PER_NODE may be useful for documenting run information in job scripts and output. Table 11 below lists some important Slurm-provided environment variables.

    Note that environment variables cannot be used in an #SBATCH directive within a job script. For example, the following directive will NOT work as expected:

    #SBATCH -J {$SLURM_JOB_ID}.out

    Instead, use the following directive:

    #SBATCH -o myMPI.o%j

    where "%j" expands to the jobID.

    Table 11. Slurm Environment Variables

    Environment Variable Description
    SLURM_JOB_ID batch job id assigned by Slurm upon submission
    SLURM_JOB_NAME user-assigned job name
    SLURM_NNODES number of nodes
    SLURM_NODELIST list of nodes
    SLURM_NTASKS total number of tasks
    SLURM_QUEUE queue (partition)
    SLURM_SUBMIT_DIR directory of submission
    SLURM_TASKS_PER_NODE number of tasks per node
    SLURM_TACC_ACCOUNT TACC project/allocation charged

    Job Submission with sbatch

    Use Slurm's sbatch command to submit job scripts:

    login1$ sbatch myjobscript

    where myjobscript is the name of a UNIX format text file containing Slurm sbatch directives and resource specifications, and any shell commands needed. Some of the most common sbatch options are described below in Table 12 and the example job scripts. Details are available online in man pages

    login1$ man sbatch

    Options can be passed to sbatch on the command-line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used sbatch commands in a script file that will be reused several times rather than retyping the sbatch commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script. All batch submissions MUST specify a time limit and total tasks. Jobs that do not use the -t (time) and -n (total tasks) options will be rejected.

    Batch scripts contain two types of statements: scheduler directives and shell commands in that order. Scheduler directive lines begin with #SBATCH and are followed with sbatch options. Slurm stops interpreting #SBATCH directives after the first appearance of a shell command (blank lines and comment lines are okay). The UNIX shell commands are interpreted by the shell specified on the first line after the #! sentinel; otherwise the Bash shell (/bin/bash) is used. By default, a job begins execution in the directory of submission with the local (submission) environment. The job script below requests an MPI job with 32 cores and 1.5 hours of run time:

    #SBATCH -J myMPI           # job name
    #SBATCH -o myMPI.o%j       # output and error file name (%j expands to jobID)
    #SBATCH -n 32              # total number of mpi tasks requested
    #SBATCH -p development     # queue (partition) -- normal, development, etc.
    #SBATCH -t 01:30:00        # run time (hh:mm:ss) - 1.5 hours
    #SBATCH --mail-type=begin  # email me when the job starts
    #SBATCH --mail-type=end    # email me when the job finishes
    ibrun ./a.out              # run the MPI executable named a.out

    If you don't want stderr and stdout directed to the same file, use both -e and -o options to designate separate output files. By default, stderr and stdout are sent to a file named "slurm-%j.out", where %j is replaced by the job ID; and with only an -o option, both stderr and stdout are directed to the same designated output file.

    Developers needing to debug their MPI applications should use TACC's idev tool.

    Table 12. Common sbatch Options

    Option Argument Comments
    -p queue_name Submits to queue (partition) designated by queue_name
    -J job_name Job Name
    -N total_nodes Required for KNL, encouraged for Sandy Bridge. Define the resources you need by specifying either: (1) "-N" and "-n"; or
    (2) "-N" and "--ntasks-per-node".
    -n tasks_per node This is total MPI tasks in this job. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as "-N".
    tasks_per_node This is MPI tasks per node. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set --ntasks-per-node to 1.
    -t hh:mm:ss Wall clock time for job. Required
    --mail-user= email_address Specify the email address to use for notifications.
    --mail-type= begin, end, fail, or all Specify when user notifications are to be sent (one option per line).
    -mem N/A Not available on Stampede. If you attempt to use this option, your job will not run.
    -o output_file Direct job standard output to output_file (without -e option error goes to this file)
    -e error_file Direct job error output to error_file
    -d= afterok:jobid Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes
    -A projectnumber Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.

    Projects and allocation balances are displayed upon command-line login.

    Job Dependencies

    Some workflows may have job dependencies, for example a user may wish to perform post-processing on the output of another job, or a very large job may have to be broken up into smaller pieces so as not to exceed maximum queue runtime. In such cases you may use Slurm's --dependency= options. The following command submits a job script that will run only upon successful completion of another previously submitted job:

     login1$ sbatch --dependency=afterok:jobid job_script_name

    Job Deletion with scancel

    The scancel command is used to remove pending and running jobs from the queue. Include a space-separated list of job IDs that you want to cancel on the command-line:

    login1$ scancel job_id1 job_id2 ...

    Use "showq -u" or "squeue -u username" to see your jobs.

    Example job scripts are available online in /share/doc/slurm . They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

    Sample Slurm Job Scripts

    Please click on the links below for pop-up sample Slurm job submission scripts.

    Viewing Job & Queue Status

    After job submission, users may monitor the status of their jobs in several ways. While the job is in the waiting state the system is continuously monitoring the number of nodes that become available and applying a fair share algorithm and a backfill algorithm to determine a fair, expedient scheduling to keep the machine running at optimum capacity. The latest queue information can be displayed several different ways with:

    Quick view with showq

    TACC's "showq" job monitoring command-line utility displays jobs in the batch system in a manner similar to PBS' utility of the same name. showq summarizes running, idle, and pending jobs, also showing any advanced reservations scheduled within the next week. See Table 13 for some showq options.

    login1$ showq
    ACTIVE JOBS--------------------
    5201623   ms1b_MG_13 bernuzzi      Running 128     18:03:43  Mon May 18 04:24:41
    5224194   SET_6_0    minerj3       Running 32      18:05:01  Mon May 18 04:25:59
    5226688   5kTp4kp04  blandrum      Running 48      10:00:07  Mon May 18 20:21:05
    5256143   CUUG       tg817524      Running 16      19:43:59  Mon May 18 06:04:57
    5265360   NAMD       mahmoud       Running 256      5:20:05  Mon May 18 15:41:03
    login1$ showq -u janeuser
    WAITING JOBS------------------------
    1676351   helloworld janeuser      Waiting 4096    15:30:00  Wed Sep 11 11:59:53
    1676352   helloworld janeuser      Waiting 4096    15:30:00  Wed Sep 11 12:00:07
    1676354   helloworld janeuser      Waiting 4096    15:30:00  Wed Sep 11 12:00:09

    Table 13. showq options

    Option Description
    -l displays queue and node count columns
    -u only active and waiting jobs of the user are reported
    --help get more information on options

    Get some info with sinfo

    The "sinfo" command without arguments might give you more information than you want. Use the print options in in the snippet below with sinfo for a more readable listing that summarizes each queue on a single line. The column labeled "NODES(A/I/O/T)" of this summary listing displays the number of nodes with the Allocated, Idle, and Other states along with the Total node count for the partition. See "man sinfo" for more information. See also the squeue command detailed below.

    Lists the availability and status of queues:

    login1$ sinfo -o "%20P %5a %.10l %16F"

    Job Monitoring with squeue

    Both the showq and squeue commands with the "-u username" option display similar information:

    login1$ squeue -u janeuser
    1676351      normal hellowor janeuser  PD       0:00    256 (Resources)
    1676352      normal hellowor janeuser  PD       0:00    256 (Resources)
    1676354      normal hellowor janeuser  PD       0:00    256 (Resources)

    Each command's output lists the three jobs (1676351, 1676352 & 1676354) waiting to run. The showq command displays cores and time requested, while the squeue command displays the partition (queue), the state (ST) of the job along with the node list when allocated. In this case, all three jobs are in the Pending (PD) state awaiting "Resources", (nodes to free up). Table 14 details common squeue options and Table 15 describes the command's output fields.

    Table 14. Common squeue Options
    Option Result
    < >=comma separated list  
    -i <interval> Repeatedly report at intervals (in seconds).
    -j <job_list> Displays information for specified job(s)
    -p <part_list> Displays information for specified partitions (queues).
    -t <state_list> Shows jobs in the specified state(s)
    See squeue man page for state abbreviations:
    "all" or list of {PD,R,S,CG,CD,CF,CA,F,TO,PR,NF}

    The "squeue" command output includes a listing of jobs and the following fields for each job:

    Table 15. Columns in the squeue command output

    Field Description
    JOBID job id assigned to the job
    USER user that owns the job
    STATE current job status, including, but not limited to:
    CD (completed)
    CF (cancelled)
    F (failed)
    PD (pending)
    R (running)

    Using the squeue command with the --start and -j options can provide an estimate of when a particular job will be scheduled:

    login1$ squeue --start -j 1676354
    1676534    normal hellow   janeuser  PD  2013-08-21T13:42:03    256 (Resources)

    Even more extensive job information can be found using the "scontrol" command. The output shows quite a bit about the job: job dependencies, submission time, number of codes, location of the job script and the working directory, etc. See the man page for more details.

    login1$ scontrol show job 1676354
    JobId=1676991 Name=mpi-helloworld
       UserId=slindsey(804387) GroupId=G-40300(40300)
       Priority=1397 Account=TG-STA110012S QOS=normal
       JobState=PENDING Reason=Resources Dependency=(null)
       Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
       RunTime=00:00:00 TimeLimit=15:30:00 TimeMin=N/A
       SubmitTime=2013-09-11T15:12:49 EligibleTime=2013-09-11T15:12:49
       StartTime=2013-09-11T17:40:00 EndTime=Unknown
       PreemptTime=None SuspendTime=None SecsPreSuspend=0
       Partition=normal AllocNode:Sid=login4:27520
       ReqNodeList=(null) ExcNodeList=(null)
       NumNodes=256-256 NumCPUs=4096 CPUs/Task=1 ReqS:C:T=*:*:*
       MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
       Features=(null) Gres=(null) Reservation=(null)
       Shared=0 Contiguous=0 Licenses=(null) Network=(null)

    About Pending Jobs

    Viewing the status of your jobs in the queue may reveal jobs in a pending (PD) state. Jobs submitted to Slurm may be, and remain, in a pending state for many reasons such as:

    • A queue (partition) may be temporarily offline
    • The resources (number of nodes) requested exceed those available
    • Queues are being drained in anticipation of system maintenance.
    • The system is running other high priority jobs

    The Reason Codes summarized below identify the reason a job is awaiting execution. If a job is pending for multiple reasons, only one of those reasons is displayed. For a full list, view the squeue man page.

    Job Pending Codes Description
    Dependency This job is waiting for a dependent job to complete.
    NodeDown A node required by the job is down.
    PartitionDown The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintainance. Note that this message may be displayed for a time even after the system is back up.
    Priority One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours.
    ReqNodeNotAvail No nodes can be found satisfying your limits, for instance because maintainance is scheduled and the job can not finish before it
    Reservation The job is waiting for its advanced reservation to become available.
    Resources The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes.
    SystemFailure Failure of the Slurm system, a file system, the network, etc.

    Launching Applications

    This section discusses how to submit jobs for your particular programming model: MPI, hybrid (openMP+MPI), symmetric, and serial codes. Stampede also provides users still experimenting with codes interactive access to the development nodes.

    In the examples below we use "a.out" as the executable name, but of course the name may be any valid application name, along with arguments and file redirections (.e.g.,

    ibrun tacc_affinity ./myprogram myargs < myinput

    Please consult the sample Slurm job submission scripts below for various runtime configurations.

    Launching Scalable MPI Programs

    The MVAPICH-2 MPI package provides a runtime environment that can be tuned for scalable code. For packages with short messages, there is a FAST_PATH option that can reduce communication costs, as well as a mechanism to share receive queues. Also, there is a Hot-Spot Congestion Avoidance option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, Scalable features for Large Scale Clusters and Performance Tuning and Chapter 10, MVAPICH2 Parameters of the MVAPICH2 User Guide for more information. MVAPICH docs can be found here.

    It is relatively easy to distribute tasks on nodes in the Slurm parallel environment when only the E5 cores are used. The -N option sets the number of nodes and the -n option sets the total MPI tasks.

    Launching MPI Applications with ibrun

    For all codes compiled with any MPI library, use the ibrun command (NOT mpirun) to launch the executable within the job script. The syntax is:

    ibrun ./myprogram <arguments>
    ibrun ./a.out

    The ibrun command supports options for advanced host selection. A subset of the processors from the list of all hosts can be selected to run an executable. An offset must be applied. This offset can also be used to run two different executables on two different subsets, simultaneously. The option syntax is:

    ibrun -n number_of_cores -o hostlist_offset myprogram myprogram_args

    For the following advanced example, 64 cores were requested in the job script.

    ibrun -n 32 -o  0 ./a.out &
    ibrun -n 32 -o 32 ./a.out &

    The first call launches a 32-core run on the first 32 hosts in the hostfile, while the second call launches a 32-core run on the second 32 hosts in the hostfile, concurrently (by terminating each command with "&"). The wait command (required) waits for all processes to finish before the shell continues. The wait command works in all shells. Note that the -n and -o options must be used together.

    The ibrun command also supports the "-np" option which limits the total number of tasks used by the batch job.

    ibrun -np number_of_cores myprogram myprogram_args 

    Unlike the "-n" option, the "-np" option requires no offset. It is assumed that the offset is 0.

    Using a multiple of 16 cores per node

    For many pure MPI applications, the most cost-efficient choices are to use a multiple of 16 tasks per node. This will ensure that each core on all the nodes is assigned one task. Specify the total number of tasks (use a value evenly divisible by 16). Slurm will automatically place 16 tasks on each node. (If the number of tasks is not divisible by 16, less than 16 tasks will be placed on one of the nodes.)

    The following example will run on 4 nodes, 16 tasks per node.:

    #SBATCH -n 64

    Do not use the -N (number of nodes) option alone; only a single task will be launched on each node in this case.

    Using fewer than 16 cores per node

    When fewer than 16 tasks are needed per node, use a combination of -n and -N. The following resource request

    #SBATCH -N 4 -n 32

    requests 4 nodes with 32 task distributed evenly across the nodes and sockets. The tasks per node is determined from the ratio tasks/nodes, and the tasks for a node are divided evenly across the two sockets (one socket will acquire an extra task when the task number is odd). When the tasks/nodes ratio is not an integer, floor(tasks/nodes) tasks are placed on each node, and the remaining are assigned sequentially to nodes in the hostfile list, one to each node until no more remain. The distribution across sockets allows maximal memory bandwidth to each socket.

    Launching Hybrid Programs

    For hybrid jobs, specify a total-task/nodes ratio with values of 1/2/4/6/8/12/16. Then, set the $OMP_NUM_THREADS environment variable to the number of threads per task, and use tacc_affinity with ibrun. The hybrid Bourne-type shell example below illustrates parameters to run a hybrid job. It requests 2 nodes and 4 tasks, 2 tasks per node, 8 threads/task.

    #SBATCH -n4 -N2
    export OMP_NUM_THREADS=8 #8 threads/task
    ibrun tacc_affinity ./a.out

    Please see the sample Slurm job scripts section below for a complete hybrid job script.

    Launching Serial Programs

    For serial batch executions, use one node and one task, and do not use the ibrun command to launch the executable (just use the executable name) and submit your job to the serial queue (partition). The serial queue has a 12-hour runtime limit and allows up to 6 simultaneous runs per user. There are 148 nodes available for the serial queue.

    #SBATCH -N 1 -n 1  # one node and one task
    #SBATCH -p serial  # run in serial queue
    ./a.out            # execute your application (no ibrun)

    Interactive Sessions

    Interactive access to a single node on the supercomputer is extremely useful for developing und debugging codes that may not be ready for full-scale deployment. Interactive sessions are charged to projects just like normal batch jobs. Please restrict usage to the (default) development queue.

    idev on Stampede

    TACC's HPC staff have recently implemented the idev application on Stampede. idev provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes. idev is simple to use, bypassing the arcane syntax of the srun command. Further idev documentation can be found here:

    In the sample session below, a user requests interactive access to a single node for 15 minutes in order to debug the progindevelopment application. idev returns a compute node login prompt:

    login1$ idev -m 15    
    --> Sleeping for 7 seconds...OK
    --> Creating interactive terminal session (login) on master node c557-704.
    c557-704$ vim progindevelopment.c
    c557-704$ make progindevelopment

    Now the user may open another window to run the newly-compiled application, while continuing to debug in the original terminal session:

    WINDOW2 c557-704$ ibrun -np 16 ./progindevelopment
    WINDOW2 ...program output ...
    WINDOW2 c557-704$

    Use the "-h" switch to see more options:

    login1$ idev -h


    Slurm's srun command will interactively request a batch job, returning a compute-node name as a prompt, usually scheduled within a short period of time. Issue the srun command only from a login node. Command syntax is:

    srun --pty -A projectnumber -p queue -t hh:mm:ss -n tasks -N nodes /bin/bash -l

    The "-A", "-p", "-t", "-n" and "-N" batch options respectively specify the project/allocation number, queue (partition), maximum runtime, total number of tasks and the number of nodes. The "-A" option is only necessary for users with multiple logins. The batch job is terminated when the shell is exited.

    The following example illustrates a request for 1 hour in the development queue on one compute node using the bash shell, followed by an MPI executable launch.

    login1$ srun --pty -p development -t 01:00:00 -n16 /bin/bash -l
    c423-001$ ibrun ./a.out

    Affinity and Memory Locality

    HPC workloads often benefit from pinning processes to hardware instead of allowing the operating system to migrate them at will. This is particularly important in multicore and heterogeneous systems, where process (and thread) migration can lead to less than optimal memory access and resource sharing patterns, and thus a significant performance degradation. TACC provides an affinity script called tacc_affinity, to enforce strict local memory allocation and process pinning to the socket. For most HPC workloads, the use of tacc_affinity will ensure that processes do not migrate and memory accesses are local. To use tacc_affinity with your MPI executable, use this command:

    c423-001$ ibrun tacc_affinity a.out

    This will apply an affinity for the tasks_per_socket option (or an appropriate affinity if tasks_per_socket is not used, and a memory policy that forces memory assignments to the local socket. Try ibrun with and without tacc_affinity to determine if your application runs better with TACC affinity setting.

    However, there may be instances in which tacc_affinity is not flexible enough to meet the user's requirements. This section describes techniques to control process affinity and memory locality that can be used to improve execution performance in Stampede and other HPC resources. In this section an MPI task is synonymous with a process.

    Do not use multiple methods to set affinity simultaneously as this can lead to unpredictable results.

    Using numactl

    numactl is a linux command that allows explicit control of process affinity and memory policy. Since each MPI task is launched as a separate process, numactl can be used to specify the affinity and memory policy for each task. There are two ways this can be used to exercise numa control when launching a batch executable:

    c423-001$ ibrun numactl options ./a.out
    c423-001$ ibrun my_affinity ./a.out

    The first command sets the same options for each task. Because the ranks for the execution of each a.out are not known to numactl it is not possible to use this command-line to tailor options for each individual task. The second command launches an executable script, my_affinity, that sets affinity for each task. The script will have access to the number of tasks per node and the rank of each task, and so it is possible to set individual affinity options for each task using this method. In general any execution using more than one task should employ the second method to set affinity so that tasks can be properly pinned to the hardware.

    In threaded applications, the same numactl command may be used, but its scope is limited globally to all threads, because every forked process or thread inherits the affinity and memory policy of the parent. This behavior can be modified from within a program using the numa API to control affinity. The basic calls for binding tasks and threads are "sched_getaffinity", "sched_setaffinity" and "numalib", respectively. Note, on the login nodes the core numbers for masking are assigned round-robin to the sockets (cores 0, 2, 4,... are on socket 0 and cores 1, 3, 5, ... are on socket 1) while on the compute nodes they are assigned contiguously (cores 0-7 are on socket 0 and 8-15 are on socket 1).

    The TACC provided affinity script, tacc_affinity, enforces a strict local memory allocation to the socket, forcing eviction of previous user's IO buffers, and also distributes tasks evenly across sockets. Use this script as a template for implementing your own affinity script if a custom affinity script is needed for your jobs.

    Table 16. Common numactl options

    Option Arguments Description
    -N 0,1 Socket Affinity. Execute process only on this (these) socket(s)
    -C [0-15] Core Affinity. Execute process on this (these, comma separated list) core(s).
    -l None Memory Policy. Allocate only on socket where process runs. Fallback to another if full.
    -i 0,1 Memory Policy. Strictly allocate round robin on these (comma separated list) sockets. No fallback; abort if no more allocation space is available.
    -m 0,1 Memory Policy. Strictly allocate on this (these, comma separated list) sockets. No fallback; abort if no more allocation space is available.
    --preferred= 0,1 (select only one) Memory Policy. Allocate on this socket. Fallback to the other if full.
    Additional details on numactl are given in its man page and help information:
          login1$ man numactl
          login1$ numactl --help

    Using Intel's KMP_AFFINITY

    To alleviate the complexity of setting affinity in architectures that support multiple hardware threads per core, such as the MIC family of coprocessors, Intel provides the means of controlling thread pinning via the environment variables, $KMP_AFFINITY and $MIC_KMP_AFFINITY. Set these variables to control affinity on the cores and Phi coprocessors.

    login1$ export KMP_AFFINITY=[<modifier>,...]type

    Table 17. KMP_AFFINITY types

    Affinity type Description
    compact Pack threads close to each other.
    explicit Use the proclist modifier to pin threads.
    none Does not pin threads.
    scatter Round-robin threads to cores.
    balanced (Phi coprocessor only)
    Use scatter, but keep OMP thread ids consecutive.

    KMP_AFFINITY type modifiers include:

    • norespect or respect (OS thread placement)
    • noverbose or verbose
    • nowarnings or warnings
    • granularity=[fine|core] where
      • fine - pinned to HW thread
      • core - able to jump between HW threads within the core
    • proclist={<proc-list>} used with explicit affinity type setting

    The meaning of the different affinity types is best explained with an example. Imagine that we have a system with 4 cores and 4 hardware threads per core. If we place 8 threads the assignments produced by the compact, scatter, and balanced types are shown in Figure 5 below. Notice that compact does not fully utilize all the cores in the system. For this reason it is recommended that applications are run using the scatter or balanced (Phi coprocessor only) options in most cases.

    Figure 5. KMP Affinity

    Please see the KNL:Running Jobs section below for information on launching jobs on the KNL cluster.

    File Systems

    The TACC HPC platforms have several different file systems with distinct storage characteristics. These are pre-defined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.

    The $HOME, $WORK and $SCRATCH directories on Stampede are Lustre file systems, designed for parallel and high performance data access of large files from within applications. They have been configured to work well with MPI-IO and support access from many compute nodes. Since metadata services for each file system are through a single server (limitation of Lustre), users should consider efficient strategies for minimizing file services (opening and closing) files when scaling applications to large node counts. To determine the amount of disk space used in a file system, cd to the directory of interest and execute the df -k . command, including the dot that represents the current directory as demonstrated below:
      login1$ cd mydirectory
      login1$ df -k .
      Filesystem          1K-blocks   Used     Available   Use% Mounted on 15382877568 31900512 15350977056 1%   /home1
    In the command output above, the file system name appears on the left (IP number, followed by the file system name), and the used and available space (-k, in units of 1 KBytes) appear in the middle columns, followed by the percent used, and the mount point: To determine the amount of space available in a user-owned directory, cd to the directory and execute the du -sh command (s=summary, h=units 'human readable):
      login1$ du -sh
    To determine quota limits and usage on $HOME and $WORK execute the Lustre files system "lfs quota" command without any options (from any directory). Usage and quotas are reported at each login.
      login1$ lfs quota $HOME
      login1$ lfs quota $WORK
    Stampede's major file systems, $HOME, $WORK, $SCRATCH, /tmp and $ARCHIVE, are detailed below.


    • At login, the system automatically sets the current working directory to your home directory.
    • Store your source code and build your executables here.
    • This directory has a quota limit of of 5GB, 150K files.
    • This file system is backed up.
    • The login nodes and any compute node can access this directory.
    • Use the environment variable $HOME to reference your home directory in scripts.
    • Use the "cdh" or "cd" commands to change to $HOME .


    • This directory has a quota limit of of 1TB, 3M files.
    • Store large files here.
    • Change to this directory in your batch scripts and run jobs in this file system.
    • The work file system is approximately 450TB
    • This file system is not backed up.
    • The login nodes and any compute node can access this directory.
    • Purge Policy: not purged
    • Use the environment variable $WORK to reference this directory in scripts.
    • Use "cdw" to change to $WORK.


    • Store large files here.
    • Change to this directory in your batch scripts and run jobs in this file system.
    • The scratch file system is approximately approximately 8.5PB.
    • This file system is not backed up.
    • The login nodes and any compute node can access this directory.
    • Purge Policy: Files with production access times* greater than 10 days may be purged.
    • Use $SCRATCH to reference this directory in scripts.
    • Use the "cds" command to change to $SCRATCH.
    • NOTE: TACC staff may periodically delete files from the $SCRATCH file system even if files are less than 10 days old. A full file system inhibits use of the system for everyone. Using programs or scripts to actively circumvent the file purge policy will not be tolerated.

    * A file's access time is updated when that file is modified on a login or compute node. Read or execution of a file/script on a login node does not update the access time, however read or execution of a file/script on a compute node does update the access time. Preservation of access times on login nodes keeps utilities such as tar, scp, etc. from obscuring production usage for purging.

    To view files' access times:

    login1$ ls -ul .

    Do NOT install software in the $SCRATCH file system as it is subject to purging.


    • This is a directory in a local disk on each node where you can store files and perform local I/O for the duration of a batch job.
    • It is often more efficient to use and store files directly in $SCRATCH (to avoid moving files from /tmp at the end of a batch job).
    • The /tmp file system is approximately 80GB available to users.
    • Files stored in the /tmp directory on each node are removed immediately after the job terminates.
    • Use "/tmp" to reference this file system in scripts.


    Stampede's archival storage system is Ranch and is accessible via the $ARCHIVER and $ARCHIVE environment variables.

    • Store permanent files here for archival storage.
    • This file system is NOT NSF mounted (directly accessible) on any node.
    • Use the scp command to transfer data to this system.

      login1$ scp ${ARCHIVER}:$ARCHIVE/mybigfile $WORK
      login1$ scp mybigfile ${ARCHIVER}:
    • Use the ssh command to login to the Ranch system from any TACC machine. For example:

      login1$ ssh $ARCHIVER
    • Some files stored on the archiver may require staging prior to retrieval.
    • See the Ranch User Guide for more on archiving your files.

    Sharing Files

    Users often wish to collaborate with fellow project members by sharing files and data with each other. Project managers or delegates can create shared workspaces, areas that are private and accessible only to other project members, using UNIX group permissions and commands. Shared workspaces may be created as read-only or read-write, functioning as data repositories and providing a common work area to all project members. Please see Sharing Project Files on TACC Systems for step-by-step instructions.

    GPU Programming

    CUDA is available on the login nodes and the GPU-equipped compute nodes.

    GPU nodes are accessible through the gpu queue for production work and the gpudev queue for development work. Production job scripts should include the "module load cuda" command before executing cuda code; likewise, load the cuda module before or after acquiring an interactive, development gpu node with the "idev" command.

    Accelerator (CUDA) Programming

    NVIDIA's CUDA compiler and libraries are accessed by loading the CUDA module:

    $ module load cuda

    Use the nvcc compiler on the login node to compile code, and run executables on nodes with GPUs-there are no GPUs on the login nodes. Stampede's K20 GPUs are compute capability 3.5 devices. When compiling your code, make sure to specify this level of capability with:

    nvcc -arch=compute_35 -code=sm_35 ...

    The NVIDA CUDA debugger is cuda-gdb. Applications must be debugged through a VNC session or an interactive srun session. Please see the relevant srun and VNC sections for more details.

    The NVIDIA Compute Visual Profiler, computeprof, can be used to profile both CUDA and OpenCL programs that have been developed in NVIDIA CUDA/OpenCL programming environment. Since the profiler is X based, it must be run either within a VNC session or by ssh-ing into an allocated compute node with X-forwarding enabled. The profiler command and library paths are included in the $PATH and $LD_LIBRARY_PATH variables by the CUDA module. The computeprof executable and libraries can be found in the following respective directories:


    For further information on the CUDA compiler, programming, the API, and debugger, please see:

    • $TACC_CUDA_DIR/doc/nvcc.pdf
    • $TACC_CUDA_DIR/doc/CUDA_C_Programming_Guide.pdf
    • $TACC_CUDA_DIR/doc/CUDA_Toolkit_Reference_Manual.pdf
    • $TACC_CUDA_DIR/doc/cuda-gdb.pdf

    Heterogeneous (OpenCL) Programming

    The OpenCL heterogeneous computing language is supported on all Stampede computing platforms. The Intel OpenCL environment supports the Xeon processors and Xeon Phi coprocessors, and the NVIDIA OpenCL environment supports the Tesla accelerators.

    Note that the prompts in the examples below are deliberately generic. This is because you can compile an OpenCL application on essentially any Stampede node, but you can run it only on nodes with the hardware the application requires (e.g. GPU or KNC).

    Using the Intel OpenCL Environment

    The Intel OpenCL Drivers and runtimes are supported at TACC for all installed Intel compilers.

    Execute the compiler command with the "-lOpenCL" loader option to include the OpenCL libraries, and prepend the "/opt/apps/intel/opencl/lib64" path to the $LD_LIBRARY_PATH environment variable when running an OpenCL executable, as illustrated below.

    $ icc -lOpenCL -o ocl.out ocl_prog.c
    $ export LD_LIBRARY_PATH=/opt/apps/intel/opencl/lib64:$LD_LIBRARY_PATH
    $ ./ocl.out

    Using the NVIDIA OpenCL Environment

    The NVIDIA OpenCL environment supports the v1.1 API is accessible through the cuda module:

    $ module load cuda

    For programming with NVIDIA OpenCL, please see the OpenCL specification at:

    Use the g++ compiler to compile NVIDIA-based OpenCL. The include files are located in the $TACC_CUDA_DIR/include subdirectory. The OpenCL library is installed in the /usr/lib64 directory, which is on the default library path. Use this path and g++ options to compile OpenCL code:

    $ export OCL=$TACC_CUDA_DIR
    $ g++ -I $OCL/include -lOpenCL prog.cpp

    Visualization on Stampede

    While batch visualization can be performed on any Stampede node, a set of nodes have been configured for hardware-accelerated rendering. The vis queue contains a set of 128 compute nodes configured with one NVIDIA K20 GPU each. The largemem queue contains a set of 16 compute nodes configured with one NVIDIA Quadro 2000 GPU each.

    Remote Desktop Access

    Remote desktop access to Stampede is formed through a VNC connection to one or more visualization nodes. Users must first connect to a Stampede login node (see System Access) and submit a special interactive batch job that:

    • allocates a set of Stampede visualization nodes
    • starts a vncserver process on the first allocated node
    • sets up a tunnel through the login node to the vncserver access port

    Once the vncserver process is running on the visualization node and a tunnel through the login node is created, an output message identifies the access port for connecting a VNC viewer. A VNC viewer application is run on the user's remote system and presents the desktop to the user.

    Note: If this is your first time connecting to Stampede, you must run vncpasswd to create a password for your VNC servers. This should NOT be your login password! This mechanism only deters unauthorized connections; it is not fully secure, as only the first eight characters of the password are saved. All VNC connections are tunnelled through SSH for extra security, as described below.

    Follow the steps below to start an interactive session.

    1. Start a Remote Desktop

      TACC has provided a VNC job script (/share/doc/slurm/job.vnc) that requests one node in the vis queue for four hours, creating a VNC session.

      login1$ sbatch /share/doc/slurm/job.vnc

      You may modify or overwrite script defaults with sbatch command-line options:

      • "-t hours:minutes:seconds" modify the job runtime
      • "-A projectnumber" specify the project/allocation to be charged
      • "-N nodes" specify number of nodes needed
      • "-p partition" specify an alternate queue.

      See more sbatch options in Table 11

      All arguments after the job script name are sent to the vncserver command. For example, to set the desktop resolution to 1440x900, use:
      login1$ sbatch /share/doc/slurm/job.vnc -geometry 1440x900

      The vnc.job script starts a vncserver process and writes to the output file, vncserver.out in the job submission directory, with the connect port for the vncviewer. Watch for the "To connect via VNC client" message at the end of the output file, or watch the output stream in a separate window with the commands:

      login1$ touch vncserver.out ; tail -f vncserver.out

      The lightweight window manager, xfce, is the default VNC desktop and is recommended for remote performance. Gnome is available; to use gnome, open the "~/.vnc/xstartup" file (created after your first VNC session) and replace "startxfce4" with "gnome-session". Note that gnome may lag over slow internet connections.

    2. Create an SSH Tunnel to Stampede

      TACC requires users to create an SSH tunnel from the local system to the Stampede login node to assure that the connection is secure. On a Unix or Linux system, execute the following command once the port has been opened on the Stampede login node:

      localhost$ ssh -f -N -L


      • "yyyy" is the port number given by the vncserver batch job
      • "xxxx" is a port on the remote system. Generally, the port number specified on the Stampede login node, yyyy, is a good choice to use on your local system as well
      • "-f" instructs SSH to only forward ports, not to execute a remote command
      • "-N" puts the ssh command into the background after connecting
      • "-L" forwards the port
      On Windows systems find the menu in the Windows SSH client where tunnels can be specified, and enter the local and remote ports as required, then ssh to Stampede.
    3. Connecting vncviewer

      Once the SSH tunnel has been established, use a VNC client to connect to the local port you created, which will then be tunneled to your VNC server on Stampede. Connect to localhost:xxxx, where xxxx is the local port you used for your tunnel. In the examples above, we would connect the VNC client to localhost::xxxx. (Some VNC clients accept localhost:xxxx).

      We recommend the TigerVNC VNC Client, a platform independent client/server application.

      Once the desktop has been established, two initial xterm windows are presented (which may be overlapping). One, which is white-on-black, manages the lifetime of the VNC server process. Killing this window (typically by typing "exit" or "ctrl-D" at the prompt) will cause the vncserver to terminate and the original batch job to end. Because of this, we recommend that this window not be used for other purposes; it is just too easy to accidentally kill it and terminate the session.

      The other xterm window is black-on-white, and can be used to start both serial programs running on the node hosting the vncserver process, or parallel jobs running across the set of cores associated with the original batch job. Additional xterm windows can be created using the window-manager left-button menu.

    Running Applications on the VNC Desktop

    From an interactive desktop, applications can be run from icons or from xterm command prompts. Two special cases arise: running parallel applications, and running applications that use OpenGL.

    Running Parallel Applications from the Desktop

    Parallel applications are run on the desktop using the same ibrun wrapper described above (see Running Applications). The command:

    c442-001$ ibrun [ibrun options] application [application options]

    will run application on the associated nodes, as modified by the ibrun options.

    Running OpenGL/X Applications On The Desktop

    Running OpenGL/X applications on Stampede visualization nodes requires that the native X server be running on each participating visualization node. Like other TACC visualization servers, on Stampede the X servers are started automatically on each node (this happens for all jobs submitted to the vis and largemem queues).

    Once native X servers are running, several scripts are provided to enable rendering in different scenarios.

    • vglrun: Because VNC does not support OpenGL applications, VirtualGL is used to intercept OpenGL/X commands issued by application code and re-direct it to a local native X display for rendering; rendered results are then automatically read back and sent to VNC as pixel buffers. To run an OpenGL/X application from a VNC desktop command prompt:

      c442-0011$ vglrun [vglrun options] application application-args
    • tacc_xrun: Some visualization applications present a client/server architecture, in which every process of a parallel server renders to local graphics resources, then returns rendered pixels to a separate, possibly remote client process for display. By wrapping server processes in the tacc_xrun wrapper, the $DISPLAY environment variable is manipulated to share the rendering load across the two GPUs available on each node. For example,

      c442-001$ ibrun tacc_xrun application application-args
      will cause the tasks to utilize each node, but will not render to any VNC desktop windows.
    • tacc_vglrun: Other visualization applications incorporate the final display function in the root process of the parallel application. This case is much like the one described above except for the root node, which must use vglrun to return rendered pixels to the VNC desktop. For example,

      c442-001$ ibrun tacc_vglrun application application-args

      will cause the tasks to utilize the GPU for rendering, but will transfer the root process' graphics results to the VNC desktop.

    Visualization Applications

    Stampede provides a set of visualization-specific modules listed below.:

    Parallel VisIt on Stampede

    VisIt was compiled under the Intel compiler and the mvapich2 and MPI stacks.

    After connecting to a VNC server on Stampede, as described above, load the VisIt module at the beginning of your interactive session before launching the Visit application:

    c442-001$ module load visit
    c442-001$ vglrun visit

    VisIt first loads a dataset and presents a dialog allowing for selecting either a serial or parallel engine. Select the parallel engine. Note that this dialog will also present options for the number of processes to start and the number of nodes to use; these options are actually ignored in favor of the options specified when the VNC server job was started.

    Preparing data for Parallel Visit

    In order to take advantage of parallel processing, VisIt input data must be partitioned and distributed across the cooperating processes. This requires that the input data be explicitly partitioned into independent subsets at the time it is input to VisIt. VisIt supports SILO data, which incorporates a parallel, partitioned representation. Otherwise, VisIt supports a metadata file (with a .visit extension) that lists multiple data files of any supported format that are to be associated into a single logical dataset. In addition, VisIt supports a "brick of values" format, also using the .visit metadata file, which enables single files containing data defined on rectilinear grids to be partitioned and imported in parallel. Note that VisIt does not support VTK parallel XML formats (.pvti, .pvtu, .pvtr, .pvtp, and .pvts). For more information on importing data into VisIt, see Getting Data Into VisIt; though this documentation refers to VisIt version 2.0, it appears to be the most current available.

    Parallel ParaView on Stampede

    After connecting to a VNC server on Stampede, as described above, do the following:

    1. Set the $NO_HOSTSORT environment variable to 1

      csh shell login1$ setenv NO_HOSTSORT 1
      bash shell login1$ export NO_HOSTSORT=1
    2. Set up your environment with the necessary modules:

      If the user is intending to use the Python interface to Paraview via any of the following methods:

      • the Python scripting tool available through the ParaView GUI
      • pvpython
      • loading the paraview.simple module into python

      then load the python, qt and paraview modules in this order:

      c442-001$ module load python qt paraview

      else just load the qt and paraview modules in this order:

      c442-001$ module load qt paraview

      Note that the qt module is always required and must be loaded prior to the paraview module.

    3. Launch ParaView:

      c442-001$ vglrun paraview [paraview client options]
    4. Click the "Connect" button, or select File -> Connect
    5. If this is the first time you've used ParaView in parallel (or failed to save your connection configuration in your prior runs):

      1. Select "Add Server"
      2. Enter a "Name" e.g. "ibrun"
      3. Click "Configure"
      4. For "Startup Type" in the configuration dialog, select "Command" and enter the command:

        c442-001$ ibrun tacc_xrun pvserver [paraview server options]
        and click "Save"
      5. Select the name of your server configuration, and click "Connect"

    You will see the parallel servers being spawned and the connection established in the ParaView Output Messages window.

    Amira on Stampede

    Amira runs only on one specific Stampede node: c400-116. You must explicitly request this node when submitting a VNC job script:

    login1$ sbatch -w c400-116 -A project /share/doc/slurm/job.vnc

    After connecting to a VNC server, load Amira as follows:

    c400-116$ module load amira
    c400-116$ vglrun $AMRIA_BIN/start


    Timing Tools

    Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

    Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

    The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

    Package Timers

    The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax:

    c123-456$ /usr/bin/time -p ./a.out

    The -p option specifies traditional precision output, units in seconds. See the time man page for additional information.

    To use time with an MPI task, use:

    c123-456$ /usr/bin/time -p ibrun ./a.out

    This example provides timing information only for the rank 0 task on the master node (the node that executes the job script); however, the time output labeled real is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (not load balanced).

    Code Section Timers

    Section timing is another popular mechanism for obtaining timing information. Use these to measure the performance of individual routines or blocks of code by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below.

    Table 18. Code Section Timers

    Routine Type Resolution (usec) OS/Compiler
    times user/sys 1000 Linux/AIX/IRIX/UNICOS
    getrusage wall/user/sys 1000 Linux/AIX/IRIX
    gettimeofday wall clock 1 Linux/AIX/IRIX/UNICOS
    rdtsc wall clock 0.1 Linux
    read_real_time wall clock 0.001 AIX
    system_clock wall clock system dependent Fortran90 Intrinsic
    MPI_Wtime wall clock system dependent MPI Library (C and Fortran)

    For general purpose or coarse-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accessed in the same way:


    Fortran C code
    real*8, external :: x_timer
    real*8 :: sec0, sec1, tseconds
    sec0 = x_timer()
    sec1 = x_timer()
    tseconds = sec1-sec0
    double x_timer(void);
    double sec0, sec1, tseconds;
    sec0 = x_timer();
    sec1 = x_timer();
    tseconds = sec1-sec0

    Standard Profilers

    The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the hotspots As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the "-p" (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program.

    Profiling Serial Executables

          login1$ ifort -p prog.f90   # instruments code
          login1$ idev
          c123-456$ a.out             # produces gmon.out trace file
          c123-456$ gprof             # reads gmon.out (default args: a.out gmon.out),report sent to STDOUT

    Profiling Parallel Executables

          login1$ mpif90 -p prog.f90               # instruments code
          login1$ setenv GMON_OUT_PREFIX gout.*    # forces each task to produce a gout
          login1$ idev
          c123-456$ ibrun a.out                    # produces gmon.out trace file
          c123-456$ gprof -s gout.*                # combines gout files into gmon.sum
          c123-456$ gprof a.out gmon.sum           # reads executable (a.out) and gmon.sum, report sent to STDOUT

    Detailed documentation is available at

    Profiling with PerfExpert

    Source-code performance optimization has four stages: measurement, diagnosis of bottlenecks, determination of optimizations, and rewriting source code. Executing these steps for today's complex many processor and heterogeneous computer architectures requires a wide spectrum of knowledge that many application developers would rather not have to learn. PerfExpert, an expert system built on generations of performance measurement and analysis tools, utilizes knowledge of architectures and compilers to implement (partial) automation of performance optimization for multicore chips and heterogeneous nodes of cluster computers. PerfExpert automates the first three performance optimization stages, then implements those optimizations as part of the fourth stage.

    PerfExpert is available on the Stampede Sandy-Bridge nodes, but not yet on the MICs. PerfExpert is dependent upon the Java interface, HPC toolkit, and the PAPI hardware counter utility, and requires the papi, hpctoolkit, and perfexpert modules to be loaded. The "module help" command provides additional information.
    login1$ module load papi hpctoolkit perfexpert
    login1$ module help perfexpert

    Stampede KNL Cluster

    While the Stampede KNL Upgrade and the Sandy Bridge cluster share the /home1, /work, and /scratch file systems, the Stampede KNL Upgrade is largely an independent cluster. This KNL cluster has its own dedicated Haswell login node, a separate OmniPath network, a KNL-compatible software stack, its own Slurm scheduler, and KNL-specific queues. It also runs a newer Linux distribution (Centos 7) than the Sandy Bridge cluster (Centos 6). Moreover, there are implications associated with sharing $HOME across the Sandy Bridge and KNL clusters. The section below on KNL Cluster: Modules addresses the most important such implication. More generally, remember that a Linux $HOME directory contains startup and configuration files and directories (often invisible so-called "dotfiles" that begin with the "." character) that will be active on both sides of Stampede. This means, for example, that your .bash_history may contain commands that are meaningful or correct only on one side or the other.

    The KNL cluster includes 508 Intel Xeon Phi 7250 KNL compute nodes (68 cores per node, 4 hardware threads per core) housed in 9 racks. Each KNL is a self-hosted node, running CentOS 7 and supporting a KNL-compatible software stack. The nodes include 112GB of local solid state drive (SSD). The interconnect is a 100Gb/sec OmniPath network: a fat tree topology of eight core-switches and 320 leaf switches with 5/4 oversubscription.

    The lightweight KNL cores have a clock frequency of 1.4 GHz, about half that of more conventional processors. This means that performance on KNL depends to a large degree on making effective use of a large number of cores. To put it another way, good performance requires a program or workflow that exposes a high degree of parallelism.

    Each of Stampede's KNL nodes includes 96GB of traditional DDR4 Random Access Memory (RAM). In addition, the KNL processors feature an additional 16 GB of high bandwidth, on-package memory known as Multi-Channel Dynamic Random Access Memory (MCDRAM) that is up to four times faster than DDR4. The KNL's memory is configurable in two important ways: there are BIOS settings that determine at boot time the processor's memory mode and cluster mode. The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The cluster mode determines the mechanisms for achieving cache coherency, which in turn determines latency: roughly speaking, this amounts to specifying whether and how one can think of some memory addresses as "closer" to a given core than others. Following Intel's defaults and recommendations, the nodes in the KNL cluster's normal and development queues are configured as "Cache-Quadrant" (memory mode set to "Cache", cluster mode set to "Quadrant"). See "KNL: Programming and Performance Considerations" below for a top-level description of these and other available memory and cluster modes.

    KNL: System Access

    The KNL cluster has a single dedicated login node. Unlike the other login and compute nodes in Stampede, this login node is a Haswell processor. Access this login node by executing

    localhost$ ssh

    Note that the characters after the "login-" prefix are the three lower-case letters "k", "n", and "el", followed by the digit "1".

    Access to this login node requires MFA. During TACC's transition to site-wide MFA, connecting to login-knl1 from the Sandy Bridge cluster may require MFA even if you have already authenticated.

    KNL: Modules

    The module system treats the Sandy Bridge and KNL clusters as separate systems. On the KNL cluster, for example, the module system will load modules that are appropriate for KNL. If you use the "module save" command to create personal collections of modules, the module system will ignore collections you created on the Sandy Bridge cluster when you are on the KNL cluster; in fact such collections are invisible on the KNL cluster. See Using Stampede:Modules above or execute "module help" for more information on these and other commands that allow you to define and manage personal collections of modules.

    KNL: Building Software

    The KNL architecture is binary compatible with earlier Intel architectures: applications built for Haswell and Sandy Bridge may in theory run on KNL without recompiling. However, the software stacks are different: currently the 2017 Intel compiler and Intel MPI (IMPI) libraries are installed only on the KNL cluster, so you will almost certainly need to rebuild all applications and libraries to achieve the best performance. You may be tempted to use the shared file systems to compile once and run on both sides of Stampede. This is not likely to work: even though the CPUs are compatible, the OmniPath network stack on KNL is not compatible with the Infiniband network stack on the Sandy Bridge cluster. Also note that KNL is not binary compatible with the legacy KNC coprocessors on the Sandy Bridge cluster. In particular, the "-mmic" flag (used to compile for KNC) is not supported on KNL. Finally, remember that the login node is a Haswell, not KNL, processor. This has important implications: in particular, it will affect the way you compile code for KNL.

    You will need to think about both compatibility and performance when deciding where and how to compile your code for KNL. You can compile for KNL on either the Haswell login node or any KNL compute node. Building on the login node is likely to be faster and is the approach we currently recommend. When building on the Haswell login node, it may be enough to cross-compile: use the "-xMIC-AVX512" switch at both compile and link time to produce compiled code appropriate only for the KNL; e.g.

    knl-login1$ icc   -xMIC-AVX512 -o mycode.exe mycode.c
    knl-login1$ ifort -xMIC-AVX512 -o mycode.exe mycode.f90

    You can also elect to build on the KNL compute nodes; again use the "-xMIC-AVX512" to produce KNL-optimized code. Allow extra time to compile on KNL: the configure stage in particular may be many times slower than it would be on the Haswell login node.

    For applications with build systems that need to build and run its own test programs in the build environment (e.g. Autotools/configure, SCons, and Cmake), you may need to specify flags that produce code that will run on both the login Haswell node (the build architecture where these tests will run) and on the compute KNL nodes (the actual target architecture). This is done through an Intel compiler feature called CPU dispatch that produces binaries containing alternate paths with optimized codes for multiple architectures. To produce such a binary containing optimized code for both Haswell and KNL, supply two flags when compiling and linking:

    -xCORE-AVX2 -axMIC-AVX512

    In a typical build system, it may be enough to add these flags to the CFLAGS, CXXFLAGS, FFLAGS, and LDFLAGS makefile variables. Expect the build to take longer than it would for one target architecture, and expect the resulting binary to be larger.

    KNL: Running Jobs

    In general, the job submission process on the KNL cluster is the same as it is on the Sandy Bridge side: use sbatch to submit a batch job, and idev to begin an interactive session. A typical job script for KNL looks no different than its Sandy Bridge counterpart. Below is an example for an MPI job. When submitting KNL jobs, however, you do need to specify explicitly the total number of nodes your job requires. This means incluing in your script or submission command a value for "-N". This is to reduce the chance of accidentally assigning more tasks to a node than you actually intend. See KNL: Best Practices below for more information.

    Sample KNL Job Script

    #SBATCH -J myjob           # Job name
    #SBATCH -o myjob.o%j       # Name of stdout output file
    #SBATCH -e myjob.e%j       # Name of stderr error file
    #SBATCH -p normal          # Queue name
    #SBATCH -N 4               # Total # of nodes
    #SBATCH -n 32              # Total # of mpi tasks
    #SBATCH -t 01:30:00        # Run time (hh:mm:ss)
    #SBATCH --mail-type=all    # Send email at begin and end of job
    #SBATCH -A myproject       # Allocation name (req'd if more than 1)
    # Other commands must follow all #SBATCH directives...
    module reset
    module list
    # Launch MPI job...
    ibrun ./mycode.exe         # Use ibrun instead of mpirun or mpiexec

    As on other TACC systems the MPI launcher is ibrun. However the KNL cluster has its own independent Slurm scheduler and queue list. This means that commands like sinfo and squeue issued on the KNL cluster will show only the KNL queues. Each KNL queue reflects a specific configuration of its KNL nodes (memory and cluster mode). See above for a general description of KNL modes, and KNL: Programming and Performance Considerations below for additional detail.

    Despite the fact that KNL has 68 cores per node, note that the charge for a KNL node-hour is 16 SU (the same charge as a Sandy Bridge node-hour).

    Note: hyper-threading is enabled on the KNLs. While there are 68 active cores on each KNL, the operating system and scheduler see a total of 68 x 4 = 272 CPUs (hardware threads). The "showq" utility will report 272 "cores" (hardware threads) for each node associated with your job, regardless of the number of hardware threads you use.

    Table 19. KNL Production Queues

    Queue Max Runtime Max Nodes and Associated Cores per Job Max Jobs in Queue Charge per node hour Configuration (Memory-Cluster)
    development 2 hrs 4 nodes (272 cores) 1 16 SU Cache-Quadrant
    normal 48 hrs 80 nodes* (5440 cores) 10 16 SU Cache-Quadrant
    Flat-Quadrant 48 hrs 40 nodes* (2720 cores) 5 16 SU Flat-Quadrant
    Flat-All2All 12 hrs 2 nodes* (136 cores) 1 16 SU Flat-All-to-All
    Flat-SNC-4 12 hrs 2 nodes* (136 cores) 1 16 SU Flat-SNC-4

    *To make special arrangements for larger jobs, or for jobs requiring special non-hybrid node configurations, submit a ticket through the TACC User Portal. Include in your request reasonable evidence of your readiness to run under the conditions you are requesting. In most cases this should include strong or weak scaling results summarizing experiments you have run on the KNL cluster.

    KNL: Visualization

    The Stampede KNL cluster uses the KNL processors for all visualization and rendering operations. OpenGL-based graphics are rendered using the Intel OpenSWR ( library. This capability is harnessed by loading the "swr" module and using "swr application" (e.g. "swr glxgears"), similar to the "vglrun" syntax.

    There is no separate visualization queue on Stampede-KNL. All visualization apps are (or will be soon) available on all nodes. We are in the process of porting visualization application builds to KNL. If an application that you use on Stampede is not yet available, please submit a consulting ticket at the TACC or XSEDE portal.

    We expect that most users will notice little difference in visualization application experience on KNL compared to other Stampede nodes. Some users will see performance improvement due to data caching in MCDRAM.

    KNL: Programming and Performance Considerations


    KNL cores are grouped in pairs; each pair of cores occupies a tile. Since there are 68 cores on each node in Stampede's KNL cluster, each node has 34 active tiles. These 34 active tiles are connected by a two-dimensional mesh interconnect. Each KNL has 2 DDR memory controllers on opposite sides of the chip, each with 3 channels. There are 8 controllers for the fast, on-package MCDRAM, two in each quadrant.

    Each core has its own local L1 cache (32KB, data, 32KB instruction) and two 512-bit vector units. These vector units are almost identical, but only one of them can execute legacy (non-AVX512) vector instructions. This means that, in order to use both vector units, you must compile to generate AVX512 instructions. Each core can run up to 4 hardware threads. The two cores on a tile share a 1MB L2 cache. Different cluster modes specify the L2 cache coherence mechanism at the node level.

    Memory Modes

    The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The output of commands like "top", "free", and and "ps -v" reflect the consequences of memory mode in which the processor is actually running. Such commands will show the amount of RAM actually available to the operating system, not the hardware (DDR + MCDRAM) installed on the processor.

    Cache Mode. In this mode, the fast MCDRAM is configured as an L3 cache. The operating system transparently uses the MCDRAM to move data from main memory. In this mode, the user has access to 96GB of RAM, all of it traditional DDR4. The KNL normal and development queues are configured in cache mode.

    Flat Mode. In this mode, DDR4 and MCDRAM act as two distinct Non-Uniform Memory Access (NUMA) nodes. It is therefore possible to specify the type of memory (DDR4 or MCDRAM) when allocating memory. In this mode, the user has access to 112GB of RAM: 96GB of traditional DDR and 16GB of fast MCDRAM. By default, memory allocations occur in DDR4. To use MCDRAM, use the numactl utility or the memkind library; see Managing Memory below for more information.

    Hybrid Mode (not supported on Stampede). In this mode, the MCDRAM is configured so that a portion acts as L3 cache and the rest as RAM (a second NUMA node supplementing DDR4).

    KNL Memory Modes
    Figure 6. KNL Memory Modes

    Cluster Modes

    The KNL's core-level L1 and tile-level L2 caches can reduce the time it takes for a core to access the data it needs. In order for the cores to share memory safely, however, there must be mechanisms in place to ensure cache coherency. Cache coherency means that all cores have a consistent view of the data: if data value x changes as a result of a calculation on a given core, there must be no risk of other cores using outdated values of x. This, of course, is essential on any multi-core chip, but it is especially difficult to achieve on many-core processors.

    The details for KNL are proprietary, but the key idea is this: each tile tracks an assigned range of memory addresses. It does so on behalf of all cores on the chip, maintaining a data structure (tag directory) that tells it which cores are using data from its assigned addresses. Coherence requires both tile-to-tile and tile-to-memory communication. Cores that read or modify data must communicate with the tiles that manage the memory associated with that data. Similarly, when cores need data from main memory, the tile(s) that manage the associated addresses will communicate with the memory controllers on behalf of those cores.

    The KNL can do this in several ways, each of which is called a cluster mode. Each cluster mode, specified in the BIOS as a boot-time option, represents a tradeoff between simplicity and control. There are three major cluster modes with a few minor variations:

    • All-to-All. This is the most flexible and most general mode, intended to work on all possible hardware and memory configurations of the KNL. But this mode also may have higher latencies than other cluster modes because the processor does not attempt to optimize coherency-related communication paths.

    • Quadrant (variation: hemisphere). This is Intel's recommended default, and the cluster mode in Stampede's normal and development queues. This mode attempts to localize communication without requiring explicit memory management by the programmer/user. It does this by grouping tiles into four logical/virtual (not physical) quadrants, then requiring each tile to manage MCDRAM addresses only in its own quadrant (and DDR addresses in its own half of the chip). This reduces the average number of "hops" that tile-to-memory requests require compared to all-to-all mode, which can reduce latency and congestion on the mesh.

    • Sub-NUMA 4 (variation: Sub-NUMA 2). This mode, abbreviated SNC-4, divides the chip into four NUMA nodes so that it acts like a four-socket processor. SNC-4 aims to optimize coherency-related on-chip communication by confining this communication to a single NUMA node when it is possible to do so. This requires explicit manual memory management by the programmer/user (in particular, allocating memory within the NUMA node that will use that memory) to achieve any performance benefit. See "Managing Memory" below for more information.

    KNL Cluster Modes
    Figure 7. KNL Cluster Modes

    TACC's early experience with the KNL suggests that there is little reason to deviate from Intel's recommended default memory and cluster modes. Cache-Quadrant tends to be a good choice for almost all workflows; it offers a nice compromise between performance and ease of use for the applications we have tested. Flat-Quadrant is the most promising alternative and sometimes offers moderately better performance, especially when memory requirements per node are less than 16GB (see "Managing Memory" below). We have not yet observed significant performance differences across cluster modes, and our current recommendation is that configurations other than Cache-Quadrant and Flat-Quadrant are worth considering only for very specialized needs. We have configured the KNL queues accordingly.

    Managing Memory

    By design, any application can run in any memory and cluster mode, and applications always have access to all available RAM. Moreover, regardless of memory and cluster modes, there are no code changes or other manual interventions required to run your application safely. However, there are times when explicit manual memory management is worth considering to improve performance. The Linux numactl utility allows you to specify at runtime where your code should allocate memory. For pure MPI or hybrid MPI-threaded codes launched with TACC's ibrun launcher, tacc_affinity manages the details for you by calling numactl under the hood.

    When running in flat-quadrant mode, launch your code with simple numactl settings to specify whether memory allocations occur in DDR or MCDRAM:

    numactl       --membind=0   ./a.out   # launch a.out (non-MPI); use DDR (default)
    ibrun numactl --membind=0   ./a.out   # launch a.out (MPI-based); use DDR (default)
    numactl       --membind=1   ./a.out   # use only MCDRAM
    numactl       --preferred=1 ./a.out   # use MCDRAM if possible; else DDR
    numactl       --hardware              # show numactl settings
    numactl       --help                  # list available numactl options

    Other settings (e.g. membind=4,5,6,7) specify fast memory within NUMA nodes when in Flat-SNC-4. Please consult TACC Training materials for additional information.

    Intel's new memkind library adds the ability to manage memory in source code with a special memory allocator for C code and a corresponding attribute for Fortran. This makes possible a level of control over memory allocation down to the level of the individual data element. As this library matures it will likely become an important tool for those whose needs require fine-grained control of memory.

    Best Known Practices and Preliminary Observations

    It may not be a good idea to use all 272 hardware threads simultaneously, and it's certainly not the first thing you should try. In most cases it's best to specify no more than
    64-68 MPI tasks or independent processes per node, and 1-2 threads/core. One exception is worth noting: when calling threaded MKL from a serial code, it's safe to set OMP_NUM_THREADS or MKL_NUM_THREADS to 272. This is because MKL will choose an appropriate thread count less than or equal to the value you specify. See Controlling Threading in MKL above for more information.

    When measuring KNL performance against traditional processors, compare node-to-node rather than core-to-core. KNL cores run at lower frequencies than traditional multicore processors. Thus, for a fixed number of MPI tasks and threads, a given simulation may run 2-3x slower on KNL than the same submission on Sandy Bridge. A well-designed parallel application, however, should be able to run more tasks and/or threads on a KNL node than is possible on Sandy Bridge. If so, it may exhibit better performance per KNL node than it does on Sandy Bridge.

    General Expectations. From a pure hardware perspective, a single Stampede KNL node could improve performance by as much as 6x compared to Stampede's dual socket Sandy Bridge nodes; this is true for both memory bandwidth-bound and compute-bound codes. This assumes the code is running out of (fast) MCDRAM in the queues configured in flat mode (450 GB/s bandwidth vs 75 GB/s on Sandy Bridge) or using cache-contained workloads in the queues configured in cache mode (memory footprint < 16GB). It also assumes perfect scalability and no latency issues. In practice we have observed application improvements between 1.3x and 5x for several HPC workloads typically run in TACC systems. Codes with poor vectorization or scalability could see much smaller improvements. In terms of network performance, the OmniPath network provides 100 Gbit/s peak bandwidth, with point-to-point exchange performance measured at over 11 GB/s for a single task pair across nodes. Latency values will be higher than those for the Sandy Bridge FDR Infiniband network: on the order of 2-4 microseconds for exchanges across nodes.

    Affinity. In Cache-Quadrant mode (normal and development queues), default affinity settings are usually sensible and often optimal for threaded codes as well as MPI-threaded hybrid applications. In other modes, use tacc_affinity (for MPI codes) or investigate manual affinity settings.

    Profiling on KNL. Intel's VTune, Advisor, and ITAC tools are available as modules. For more information, see the module help messages (e.g. "module help vtune"), online documentation, and the KNL-related training materials on the TACC User Portal (see References).


    TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities. The Usage Policies are documented here:


    Help is available 24/7. Please submit a helpdesk ticket via the TACC User Portal

    Revision History

    The "Last Update" date is the date of the most recent change to this document. This revision history is a list of non-trivial updates; it excludes routine items such as corrected typos and minor format changes.

    Click to view