Stampede2 User Guide
Last update: July 12, 2017 14:44 see revision history

Notices

  • Stampede2 has its own home and scratch file systems and a new $WORK directory. You'll need to transfer your files from Stampede1 to Stampede2. See Managing Files for information that will help you do so easily.
  • Stampede2's software stack is newer than the software on the decommissioned Stampede1 KNL sub-system. Be sure to recompile before running on Stampede2. See Building Software for more information.
  • Stampede2's accounting system is based on node-hours: one Service Unit (SU) represents a single compute node used for one hour (a node-hour) rather than a core-hour. See Job Accounting for more information.
  • Stampede2's KNL nodes have 68 cores, each with 4 hardware threads. But it may not be a good idea to use all 272 hardware threads simultaneously, and it's certainly not the first thing you should try. In most cases it's best to specify no more than 64-68 MPI tasks or independent processes per node, and 1-2 threads/core. See Best Known Practices for more information.
  • The Stampede2 Transition Guide is available for the first few weeks of Phase 1 production. The Transition Guide targets users familiar with Stampede1 who want a document that focuses on what's new.

Stampede2 System

Introduction

Stampede2, generously funded by the National Science Foundation (NSF) through award ACI-1134872, is the flagship supercomputer at the Texas Advanced Computing Center (TACC), University of Texas at Austin. It will enter full production in the Fall 2017 as an 18-petaflop national resource that builds on the successes of the original Stampede system it replaces. The first phase of the Stampede2 rollout features the second generation of processors based on Intel's Many Integrated Core (MIC) architecture. Stampede2's 4,200 Knights Landing (KNL) nodes represent a radical break with the first-generation Knights Corner (KNC) MIC coprocessor. Unlike the legacy KNC, a Stampede2 KNL is not a coprocessor: each 68-core KNL is a stand-alone, self-booting processor that is the sole processor in its node. The Phase 1 KNL system is now available. Later this summer Phase 2ENDSPAN will add to Stampede2 a total of 1,736 Intel Xeon Skylake (SKX) nodes.

The older Stampede system (sometimes "Stampede1" for simplicity) will remain in production until Fall 2017. We are gradually reducing its size to make room for Stampede2 components. We will not decommission Stampede1, however, until Stampede2 is in full production; the two systems will coexist for an extended transition period.

The Stampede1 to Stampede2 Transition Guide includes material intended to help users manage their transition between the two systems: migrating files, differences between the two systems, temporary conditions, etc. This user guide, however, refers to Stampede1 only when necessary to understand Stampede2.

Phase 1 Compute Nodes (KNL)

Stampede2 hosts 4,200 KNL compute nodes, including the 508 KNL nodes that were formerly configured as a Stampede1 sub-system.

Table. Stampede2 KNL Compute Node Specifications

Model:  Intel Xeon Phi 7250
Total cores per KNL node:  68 cores on a single socket
Hardware threads per core:  4
Hardware threads per node:  68 x 4 = 272
Clock rate:  1.4GHz
RAM:  96GB DDR4 plus 16GB high-speed MCDRAM. Configurable in two important ways; see Programming and Performance for more info.
Local storage:  All but 508 KNL nodes have a 132GB /tmp partition on a 200GB Solid State Drive (SSD). The 508 KNLs originally installed as the Stampede1 KNL sub-system each have a 58GB /tmp partition on 112GB SSDs. The latter nodes currently make up the development, flat-quadrant and flat-snc4 queues.

Each of Stampede2's KNL nodes includes 96GB of traditional DDR4 Random Access Memory (RAM). In addition, the KNL processors feature an additional 16 GB of high bandwidth, on-package memory known as Multi-Channel Dynamic Random Access Memory (MCDRAM) that is up to four times faster than DDR4. The KNL's memory is configurable in two important ways: there are BIOS settings that determine at boot time the processor's memory mode and cluster mode. The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The cluster mode determines the mechanisms for achieving cache coherency, which in turn determines latency: roughly speaking, this amounts to specifying whether and how one can think of some memory addresses as "closer" to a given core than others. See Programming and Performance below for a top-level description of these and other available memory and cluster modes.

Phase 2 Compute Nodes (SKX)

When Phase 2 is complete, Stampede2 will host 1,736 Skylake (SKX) nodes.

Table. Stampede2 SKX Compute Node Specifications

Model:  Intel Xeon Platinum 8160 ("Skylake")
Total cores per KNL node:  48 cores on two sockets (24 cores/socket)
Hardware threads per core:  2
Hardware threads per node:  48 x 2 = 96
Clock rate:  2.1GHz
RAM:  192GB (2.67GHz)
Local storage:  132 /tmp partition on a 200GB SSD

Network

The interconnect is a 100Gb/sec Intel Omni-Path (OPH) network with a fat tree topology employing six core switches. There is one leaf switch for each 28-node half rack, each with 20 leaf-to-core uplinks (28/20 oversubscription).

Shared File Systems

Stampede2's three Lustre-based shared file systems are available from all login and compute nodes.

Table. Stampede2 File Systems

File System Quota Key Features
$HOME 10GB, 200,000 files Not intended for parallel or high-intensity file operations.
Backed up regularly.
Overall capacity ~1PB. Two Meta-Data Servers (MDS), four Object Storage Targets (OSTs). Defaults: 1 stripe, 1MB stripe size.
$WORK 1TB, 3,000,000 files across all TACC systems,
regardless of where on the file system the files reside.
Not intended for high-intensity file operations or jobs involving very large files.
On the Global Shared File System that is mounted on most TACC systems.
See Stockyard system description for more information.
Not backed up.
$SCRATCH no quota Overall capacity ~30PB. Four MDSs, 66 OSTs. Defaults: 1 stripe, 1MB stripe size.
Not backed up.

Specialized Nodes

We plan to add large-memory nodes in 2018. There are no plans to add Graphics Processing Units (GPUs) to the system.

Accessing the System

Access to all TACC systems now requires Multi-Factor Authentication (MFA). You can create an MFA pairing on the TACC User Portal. After login on the portal, go to your account profile (Home->Account Profile), then click the "Manage" button under "Multi-Factor Authentication" on the right side of the page. See Multi-Factor Authentication at TACC for further information.

Secure Shell (SSH)

The "ssh" command (SSH protocol) is the standard way to connect to Stampede2. SSH also includes support for the file transfer utilities scp and sftp. Wikipedia is a good source of information on SSH. SSH is available within Linux and from the terminal app in the Mac OS. If you are using Windows, you will need an SSH client that supports the SSH-2 protocol: e.g. Bitvise, OpenSSH, Putty, or SecureCRT. Initiate a session using the ssh command or the equivalent; from the Linux command line the launch command looks like this:

localhost$ ssh myusername@stampede2.tacc.utexas.edu

The above command will rotate connections across all available login nodes and route your connection to one of them. To connect to a specific login node use its full domain name:

localhost$ ssh myusername@login2.stampede2.tacc.utexas.edu

To connect with X11 support on Stampede2 (usually required for applications with graphical user interfaces), use the "-X" or "-Y" switch:

localhost$ ssh -X myusername@stampede2.tacc.utexas.edu

Use your TACC password, not your XSEDE password, for direct logins to TACC resources. You can change your TACC password through the TACC Portal. Select "Change Password" under the "HOME" tab after login. If you've forgotten your password, go to the TACC User Portal home page and select the "Forgot your password?" link in the login area.

To report a connection problem, execute the ssh command with the "-vvv" option and include the verbose output when submitting a help ticket.

Do not run the "ssh-keygen" command on Stampede2. This command will create and configure a key pair that will interfere with the execution of job scripts in the batch system. If you do this by mistake, you can recover by renaming or deleting the .ssh directory located in your home directory. One good way to do this:

  1. execute "mv .ssh dot.ssh.old"
  2. log out
  3. log back in

After logging in again the system will generate a properly configured key pair.

GSI-OpenSSH (gsissh)

GSI-OpenSSH is a customized implementation of OpenSSH that includes support for Grid Security Infrastructure (GSI) authentication and credential forwarding.

Use port 2222 for gsissh connections to Stampede2. The following sequence of commands authenticate using the XSEDE myproxy server, then connects to Stampede2 via this port:

localhost$ myproxy-logon -s myproxy.teragrid.org
localhost$ gsissh -p 2222 userid@stampede2.tacc.utexas.edu

XSEDE Single Sign-On Hub

XSEDE users can also access Stampede2 via the XSEDE Single Sign-On Hub.

When reporting a problem to the help desk, please execute the gsissh command with the "-vvv" option and include the verbose output in your problem description.

Using Stampede2

Stampede2 nodes run Linux (Red Hat Enterprise Linux 7). Regardless of your research workflow, you will almost certainly need to be comfortable with Linux basics and a Linux-based text editor (e.g. vi, emacs, or gedit). There are numerous resources in a variety of formats that available to help you acquire this understanding, including some listed on the TACC and XSEDE training sites. This user guide presumes you have that understanding. If you encounter a term or concept in the material below that is new to you, a quick internet search should help you resolve the matter quickly.

Linux Shell

The default login shell for your user account is Bash. To determine your current login shell, execute:

$ echo $SHELL

If you'd like to change your login shell to csh, sh, tcsh, or zsh, submit a ticket through the TACC or XSEDE portal. The "chsh" ("change shell") command will not work on TACC systems.

When you start a shell on Stampede2, system-level startup files initialize your account-level environment and aliases before the system sources your own user-level startup scripts. You can use these startup scripts to customize your shell by defining your own environment variables, aliases, and functions. These scripts (e.g. .profile and .bashrc) are generally hidden files: so-called dotfiles that begin with a period, visible when you execute: "ls -a".

Before editing your startup files, however, it's worth taking the time to understand the basics of how your shell manages startup. Bash startup behavior is very different from the simpler csh behavior, for example. The Bash startup sequence varies depending on how you start the shell (e.g. using ssh to open a login shell, executing "bash" to begin an interactive shell, or launching a script to start a non-interactive shell). Moreover Bash does not automatically source your .bashrc when you start a login shell by using ssh to connect to a node. Unless you have specialized needs, however, this is undoubtedly more flexibility than you need: you will probably want your environment to be the same regardless of how you start the shell. The easiest way to achieve this is to execute "source ~/.bashrc" from your ".profile", then put all your customizations in ".bashrc". The system-generated default startup scripts demonstrate this approach. We recommend that you use these default files as templates.

For more information see the Bash Users' Startup Files: Quick Start Guide and other online resources that explain shell startup. To recover the originals that appear in a newly created account, execute "/usr/local/startup_scripts/install_default_scripts".

Environment Variables

Your environment includes the environment variables and functions defined in your current shell: those initialized by the system, those you define or modify in your account-level startup scripts, and those defined or modified by the modules that you load to configure your software environment. Be sure to distinguish between an environment variable's name (e.g. HISTSIZE) and its value ($HISTSIZE). Understand as well that a sub-shell (e.g. a script) inherits environment variables from its parent, but does not inherit ordinary shell variables or aliases. Use export (in Bash) or setenv (in csh) to define an environment variable.

Execute the env command to see the environment variables that that define the way your shell and child shells behave.

Pipe the results of env into grep to focus on specific environment variables. For example, to see all environment variables that contain the string GIT (in all caps), execute:

$ env | grep GIT

The environment variables PATH and LD_LIBRARY_PATH are especially important. PATH is a colon-separated list of directory paths that determines where the system looks for your executables. LD_LIBRARY_PATH is a similar list that determines where the system looks for shared libraries.

Account-Level Diagnostics

TACC's sanitytool module loads an account-level diagnostic package that detects common account-level issues and often walks you through the fixes. You should certainly run the package's sanitycheck utility when you encounter unexpected behavior. You may also want to run sanitycheck periodically as preventive maintenance. To run sanitytool's account-level diagnostics, execute the following commands:

login1$ module load sanitytool
login1$ sanitycheck

Execute "module help sanitytool" for more information.

Accessing the Compute Nodes

You connect to Stampede2 through one of four "front-end" login nodes. The login nodes are shared resources: at any given time there are many users logged into each of these login nodes, each preparing to access the "back-end" compute nodes (Figure. Login and Compute Nodes). What you do on the login nodes affects other users directly because you are competing for the same memory and processing power. This is the reason you should not run your applications on the login nodes or otherwise abuse them. Think of the login nodes as a prep area where you can manage files and compile code before accessing the compute nodes to perform research computations. See Good Citizenship for more information.

Use a node's hostname to tell you whether you are on a login node or a compute node. The hostname for a Stampede2 login node begins with the string "login" (e.g. login2.stampede2.tacc.utexas.edu), while compute node hostnames begin with the character "c" (e.g. c401-064.stampede2.tacc.utexas.edu). The default Linux command-line prompt, or any custom prompt containing "\h", displays the short form of the hostname (e.g. c401-064).

While some workflows, tools, and applications hide the details, there are three basic ways to access the compute nodes:

  1. Submit a batch job using the sbatch command. This directs the scheduler to run the job unattended when there are resources available. Until your batch job begins it will wait in a queue. You do not need to remain connected while the job is waiting or executing. See Running Jobs for more information. Note that the scheduler does not start jobs on a first come, first served basis; it juggles many variables to keep the machine busy while balancing the competing needs of all users. The best way to minimize wait time is to request only the resources you really need: the scheduler will have an easier time finding a slot for the two hours you need than for the 48 hours you unnecessarily request.
  2. Begin an interactive session using idev or srun. This will log you into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine. An interactive session is a great way to develop, test, and debug code. When you request an interactive session, the scheduler submits a job on your behalf. You will need to remain logged in until the interactive session begins.
  3. Begin an interactive session using ssh to connect to a compute node on which you are already running a job. This is a good way to open a second window into a node so that you can monitor a job while it runs.

Be sure to request computing resources that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single core on a single node, and will only see that node's memory.
  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming model and is also restricted to a single node, but the program's individual threads can run on multiple cores on that node.
  • An MPI (Message Passing Interface) program can exploit the distributed computing power of multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network. The tasks on a given node, however, can only directly access the memory on that node. Depending on the program's memory requirements, it may not be possible to run a task on every core of every node assigned to your job. If it appears that your MPI job is running out of memory, try launching it with fewer tasks per node to increase the amount of memory available to individual tasks.
  • A popular type of parameter sweep (sometimes called high throughput computing) involves submitting a job that simultaneously runs many copies of one serial or threaded application, each with its own input parameters ("Single Program Multiple Data", or SPMD). The "launcher" tool is designed to make it easy to submit this type of job. For more information:
$ module load launcher
$ module help launcher

Figure. Login and compute nodes

Using Modules to Manage your Environment

Lmod, a module system developed and maintained at TACC, makes it easy to manage your environment so you have access to the software packages and versions that you need to conduct your research. This is especially important on a system like Stampede2 that serves thousands of users with an enormous range of needs. Loading a module amounts to choosing a specific package from among available alternatives:

$ module load intel          # load the default Intel compiler
$ module load intel/17.0.4   # load a specific version of Intel compiler

A module does its job by defining or modifying environment variables (and sometimes aliases and functions). For example, a module may prepend appropriate paths to $PATH and $LD_LIBRARY_PATH so that you can find the executables and libraries associated with a given software package. The module creates the illusion that the system is installing software for your personal use. Unloading a module reverses these changes and creates the illusion that the system just uninstalled the software:

$ module load   ddt  # defines DDT-related env vars; modifies others
$ module unload ddt  # undoes changes made by load

The module system does more, however. When you load a given module, the module system can automatically replace or deactivate modules to ensure the packages you have loaded are compatible with each other. In the example below, the module system automatically unloads one compiler when you load another, and replaces Intel-compatible versions of IMPI and PETSc with versions compatible with gcc:

$ module load intel  # load default version of Intel compiler
$ module load petsc  # load default version of PETSc
$ module load gcc    # change compiler

Lmod is automatically replacing "intel/17.0.4" with "gcc/7.1.0".

Due to MODULEPATH changes, the following have been reloaded:
1) impi/17.0.3     2) petsc/3.7

On Stampede2, modules generally adhere to a TACC naming convention when defining environment variables that are helpful for building and running software. For example, the "papi" module defines TACC_PAPI_BIN (the path to PAPI executables), TACC_PAPI_LIB (the path to PAPI libraries), TACC_PAPI_INC (the path to PAPI include files), and TACC_PAPI_DIR (top-level PAPI directory). After loading a module, here are some easy ways to observe its effects:

$ module show papi   # see what this module does to your environment
$ env | grep PAPI    # see env vars that contain the string PAPI
$ env | grep -i papi # case-insensitive search for 'papi' in environment

To see the modules you currently have loaded:

$ module list

To see all modules that you can load right now because they are compatible with the currently loaded modules:

$ module avail

To see all installed modules, even if they are not currently available because they are incompatible with your currently loaded modules:

$ module spider   # list all modules, even those not available to load

To filter your search:

$ module spider slep             # all modules with names containing 'slep'
$ module spider sundials/2.5.0   # additional details on a specific module

Among other things, the latter command will tell you which modules you need to load before the module is available to load. You might also search for modules that are tagged with a keyword related to your needs (though your success here depends on the diligence of the module writers). For example:

$ module keyword performance

You can save a collection of modules as a personal default collection that will load every time you log into Stampede2. To do so, load the modules you want in your collection, then execute:

$ module save    # save the currently loaded collection of modules 

Two commands make it easy to return to a known, reproducible state:

$ module reset   # load the system default collection of modules
$ module restore # load your personal default collection of modules

On TACC systems, the command "module reset" is equivalent to "module purge; module load TACC". It's a safer, easier way to get to a known baseline state than issuing the two commands separately.

Help text is available for both individual modules and the module system itself:

$ module help swr     # show help text for software package swr
$ module help         # show help text for the module system itself

See Lmod's online documentation for more extensive documentation. The online documentation addresses several topics (e.g. writing and using your own module files) that are beyond the scope of the help text.

It's safe to execute module commands in job scripts. In fact this is a good way to write self-documenting, portable job scripts that produce reproducible results. If you use "module save" to define a personal default module collection, it's rarely necessary to execute module commands in shell startup scripts, and it can be tricky to do so safely. If you do wish to put module commands in your startup scripts, see Stampede2's default startup scripts for a safe way to do so.

Good Citizenship

Under construction. Until this material is available here, please consult the Stampede1 User Guide.

Please do not use the $WORK file system when running IO-intensive jobs, or jobs involving large files. $SCRATCH is better able to handle high-intensity IO. Moreover, problems on $WORK can affect all users on all TACC systems.

File Systems and Quotas

Stampede2 mounts three file Lustre file systems that are shared across all nodes: the home, work, and scratch file systems. Stampede2's startup mechanisms define corresponding account-level environment variables $HOME, $SCRATCH, and $WORK that store the paths to directories that you own on each of these file systems. Consult the Stampede2 File Systems table for the basic characteristics of these file systems, File Operations: Input/Output Performance for advice on performance issues, and Good Citizenship for file-related tips on good citizenship.

The system defines several account-level aliases that make it easy to navigate across the directories you own in these file systems:

Table. Built-in Account Level Aliases

Built-in Account Level Aliases
Alias Command
cd or cdh cd $HOME
cdw cd $WORK
cds cd $SCRATCH

Stampede2's home and scratch file systems are mounted only on Stampede2, but the work file system mounted on Stampede2 is the Global Shared File System hosted on Stockyard. It is the same file system that is available on Stampede1, Maverick, Wrangler, Lonestar 5, and other TACC resources.

The $STOCKYARD environment variable points to the highest-level directory that you own on the file system. The definition of the $STOCKYARD environment variable is of course account-specific, but you will see the same value on all TACC systems (See Figure. Stockyard). This directory is an excellent place to store files you want to access regularly from multiple TACC resources.

Your account-specific $WORK environment variable varies from system to system and (except for Stampede1) is a sub-directory of $STOCKYARD. The sub-directory name corresponds to the associated TACC resource. The $WORK environment variable on Stampede2 points to the $STOCKYARD/stampede2 subdirectory, a convenient location for files you use and jobs you run on Stampede2. Remember, however, that all subdirectories contained in your $STOCKYARD directory are available to you from any system that mounts the file system. If you have accounts on both Stampede2 and Maverick, for example, the $STOCKYARD/stampede2 directory is available from your Maverick account, and $STOCKYARD/maverick is available from your Stampede2 account. Your quota and reported usage on the Global Shared File System reflects all files that you own on Stockyard, regardless of their actual location on the file system.

Note that resource-specific sub-directories of $STOCKYARD are nothing more than convenient ways to manage your resource-specific files. You have access to any such sub-directory from any TACC resources. If you are logged into Stampede2, for example, executing the alias cdw (equivalent to "cd $WORK") will take you to the resource-specific sub-directory $STOCKYARD/stampede2. But you can access this directory from other TACC systems by executing "cd $STOCKYARD/stampede2". This makes it particularly easy to share files across TACC systems.

$WORK: Stampede2 vs Stampede1

Stampede2 defines the $WORK environment variable differently than Stampede1 did: your Stampede2 $WORK directory is a sub-directory of your Stampede1 work directory. On Stampede2, your $WORK directory is $STOCKYARD/stampede2 (e.g. /work/01234/bjones/stampede2). On Stampede1, your $WORK directory was the $STOCKYARD directory itself (e.g. /work/01234/bjones).

Please see an example for fictitious user bjones in the figure below. All directories are accessible from all systems. A given sub-directory (e.g. wrangler, maverick) exists only if you have an allocation on the respective system.

Figure. Account-level directories on the work file system (Global Shared File System hosted on Stockyard).

Temporary Mounts of Stampede1 File Systems

For your convenience during the transition from Stampede1 to Stampede2, the Stampede1 home and scratch file systems are available as read-only Lustre file systems on the Stampede2 login nodes (and only the login nodes). The mount points on the Stampede2 logins are /oldhome1 and /oldscratch respectively, and your account includes the environment variables $OLDHOME and $OLDSCRATCH pointing to your Stampede1 $HOME and $SCRATCH directories respectively. The aliases "cdoh" and "cdos", defined in your Stampede2 account, are equivalent to "cd $OLDHOME" and "cd $OLDSCRATCH" (see the Built-In Account-level Aliases table above).

Do not submit Stampede2 jobs (sbatch, srun, or idev) from directories in $OLDHOME or $OLDSCRATCH. Because these directories are read-only, attempting to do so may lead to job failures or other subtle problems that may prove difficult to diagnose.

Your Stampede1 $WORK directory is, of course, available to you from Stampede2 (see Managing Files above). As a matter of convenience, however, your Stampede2 account includes the environment variable $OLDWORK (which has the same value as $STOCKYARD) and the associated alias cdow.

Transferring Files from Stampede1 to Stampede2

Transfers from $OLDHOME and $OLDSCRATCH: Stampede2's temporary mounts of the Stampede1 file system make it easy to transfer files. For example, the following command:

$ cp -r $OLDHOME/mysrc $HOME

copies the directory mysrc and its contents from your Stampede1 home directory to your Stampede2 home directory. The rsync command is also available. Given the temporary mounts, there's little reason to use scp. In any case, please remember that recursive copy operations can put a significant strain on Lustre file systems: copy only the files you need, and don't execute more than one or two simultaneous recursive copies.

Transfers involving $WORK: When transferring files on the Stockyard-hosted work file system, it's probably best to use mv rather than cp. If you use cp you will end up with two copies of your file(s) on the work file system; at the very least, this will put pressure on your file system quota. The mv command will not work when transferring files from $OLDHOME or $OLDSCRATCH because these two file systems are read-only on Stampede2.

Striping Large Files: Before copying large files to Stampede2 be sure to set an appropriate default stripe count on the receiving directory. To avoid exceeding your fair share of any given Object Storage Target (OST) on a file system, a good rule of thumb is to allow at least one stripe for each 100GB in the file. For example, to set the default stripe count on the current directory to 30 (a plausible stripe count for a directory receiving a file approaching 3TB in size), execute:

$ lfs setstripe -c 30 $PWD

Note that an "lfs setstripe" command always sets both stripe count and stripe size, even if you explicitly specify only one or the other. Since the example above does not explicitly specify stripe size, the command will set the stripe size on the directory to Stampede2's system default (1MB). In general there's no need to customize stripe size when transferring files.

Startup Files: It is generally safe to copy your startup files (e.g. .profile and .bashrc) from Stampede1 to Stampede2, though you may of course have to make some changes. Execute "/usr/local/startup_scripts/install_default_scripts" to recover the default startup files that appear in a newly created account.

Globus Connect

Globus Connect (formerly Globus Online) is recommended for transferring data between XSEDE sites. Globus Connect provides fast, secure transport via an easy-to-use web interface using pre-defined and user-created "endpoints". XSEDE users automatically have access to Globus Connect via their XUP username/password. Other users may sign up for a free Globus Connect Personal account.

globus-url-copy

XSEDE users may also use Globus' globus-url-copy command-line utility to transfer data between XSEDE sites. globus-url-copy, like Globus Connect described above, is an implementation of the GridFTP protocol, providing high speed transport between GridFTP servers at XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories.

This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the "myproxy-logon" command with your XSEDE User Portal (XUP) username and password to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. On Stampede, the myproxy-logon command is located in the CTSSV4 module (not loaded by default).

login1$ module load CTSSV4
login1$ myproxy-logon -T -l XUP_username

Each globus-url-copy invocation must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy [options] source_url destination_url

where each XSEDE URL will generally be formatted:

gsiftp://gridftp_server/path/to/file

Note that globus-url-copy supports multiple protocols e.g., HTTP, FTP in addtion to the GridFTP protocol. Please consult the following references for more information.

globus-url-copy Examples

The following command copies "directory1" from TACC's Stampede2 to PSC Data Supercell system home filesystem, renaming it to "directory2". Note that when transferring directories, the directory path must end with a slash ( "/"):

login1$ globus-url-copy -r -vb \ 
    gsiftp://gridftp.stampede2.tacc.xsede.org:2811/`pwd`/directory1/ \ 
    gsiftp://gridftp.psc.xsede.org:2811/~/directory2/

The mapping of /~/ depends on the configuration of the GridFTP server but is typically the local user's home directory on Linux systems.

The following command copies a single file, "file1" from TACC's Stampede2 to "file2" on Stanford's XStream home filesystem:

login1$ globus-url-copy -tcp-bs 11M -vb \ 
    gsiftp://gridftp.stampede2.tacc.xsede.org:2811/`pwd`/file1 \ 
    gsiftp://xstream.stanford.xsede.org:2811/~/file2

Use the buffer size option, "-tcp-bs 11M", to explicitly set the FTP data channel buffer size, otherwise, the speed will be about 20 times slower! Consult the Globus documentation to select the optimum value: How do I choose a value for the TCP buffer size (-tcp) option?

Advanced users may employ the "-stripe" option enables striped transfers on supported servers. Stampede's GridFTP servers each have a 10GbE interface adapter and are configured for a 4-way stripe since most deployed 10GbE interfaces are performance-limited by host PCI-X busses to ~6Gb/s.

Sharing Files with Collaborators

If you wish to share files and data with collaborators in your project, see Sharing Project Files on TACC Systems for step-by-step instructions. Project managers or delegates can use Unix group permissions and commands to create read-only or read-write shared workspaces that function as data repositories and provide a common work area to all project members.

Basics of Compiling

Under construction. Until this material is available here, please consult the relevant section in the Stampede1 User Guide.

Intel Math Kernel Library (MKL)

The Intel Math Kernel Library (MKL) is a collection of highly optimized functions implementing some of the most important mathematical kernels used in computational science, including standardized interfaces to:

  • BLAS (Basic Linear Algebra Subroutines), a collection of low-level matrix and vector operations like matrix-matrix multiplication
  • LAPACK (Linear Algebra PACKage), which includes higher-level linear algebra algorithms like Gaussian Elimination
  • FFT (Fast Fourier Transform), including interfaces based on FFTW (Fastest Fourier Transform in the West)
  • ScaLAPACK (Scalable LAPACK), BLACS (Basic Linear Algebra Communication Subprograms), Cluster FFT, and other functionality that provide block-based distributed memory (multi-node) versions of selected LAPACK, BLAS, and FFT algorithms;
  • Vector Mathematics (VM) functions that implement highly optimized and vectorized versions of special functions like sine and square root.

MKL with Intel C, C++, and Fortran Compilers

There is no MKL module for the Intel compilers because you don't need one: the Intel compilers have built-in support for MKL. Unless you have specialized needs, there is no need to specify include paths and libraries explicitly. Instead, using MKL with the Intel modules requires nothing more than compiling and linking with the "-mkl" option.; e.g.

login1$ icc -mkl mycode.c
login1$ ifort -mkl mycode.c

The "-mkl" switch is an abbreviated form of "-mkl=parallel", which links your code to the threaded version of MKL. To link to the unthreaded version, use "-mkl=sequential". A third option, "-mkl=cluster", which also links to the unthreaded libraries, is necessary and appropriate only when using ScaLAPACK or other distributed memory packages. For additional information, including advanced linking options, see the MKL documentation and Intel MKL Link Line Advisor.

MKL with GNU C, C++, and Fortran Compilers

When using a GNU compiler, load the MKL module before compiling or running your code, then specify explicitly the MKL libraries, library paths, and include paths you application needs. Consult the Intel MKL Link Line Advisor for details. A typical compile/link process on a TACC system will look like this:

login1$ module load gcc
login1$ module load mkl                  # available/needed only for GNU compilers
login1$ gcc -fopenmp -I$MKLROOT/include   \
         -Wl,-L${MKLROOT}/lib/intel64     \
         -lmkl_intel_lp64 -lmkl_core      \
         -lmkl_gnu_thread -lpthread       \
         -lm -ldl mycode.c

For your convenience the mkl module file also provides alternative TACC-defined variables like $TACC_MKL_INCLUDE (equivalent to $MKLROOT/include). Execute "module help mkl" for more information.

Using MKL as BLAS/LAPACK with Third-Party Software

When your third-party software requires BLAS or LAPACK, you can use MKL to supply this functionality. Replace generic instructions that include link options like "-lblas" or "-llapack" with the simpler MKL approach described above. There is no need to download and install alternatives like OpenBLAS.

Using MKL as BLAS/LAPACK with TACC's MATLAB, Python, and R Modules

TACC's MATLAB, Python, and R modules all use threaded (parallel) MKL as their underlying BLAS/LAPACK library. These means that even serial codes written in MATLAB, Python, or R may benefit from MKL's thread-based parallelism. This requires no action on your part other than specifying an appropriate max thread count for MKL; see the section below for more information.

Controlling Threading in MKL

Any code that calls MKL functions can potentially benefit from MKL's thread-based parallelism; this is true even if your code is not otherwise a parallel application. If you are linking to the threaded MKL (using "-mkl", "-mkl=parallel", or the equivalent explicit link line), you need only specify an appropriate value for the max number of threads available to MKL. You can do this with either of the two environment variables MKL_NUM_THREADS or OMP_NUM_THREADS. The environment variable MKL_NUM_THREADS specifies the max number of threads available to each instance of MKL, and has no effect on non-MKL code. If MKL_NUM_THREADS is undefined, MKL uses OMP_NUM_THREADS to determine the max number of threads available to MKL functions. In either case, MKL will attempt to choose an optimal thread count less than or equal to the specified value. Note that OMP_NUM_THREADS defaults to 1 on TACC systems; if you use the default value you will get no thread-based parallelism from MKL.

If you are running a single serial, unthreaded application (or an unthreaded MPI code involving a single MPI task per node) it is usually best to give MKL as much flexibility as possible by setting the max thread count to the total number of hardware threads on the node (272 on KNL). Of course things are more complicated if you are running more than one process on a node: e.g. multiple serial processes, threaded applications, hybrid MPI-threaded applications, or pure MPI codes running more than one MPI rank per node. See http://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications and related Intel resources for examples of how to manage threading when calling MKL from multiple processes.

Building Software for Phase 1 Compute Nodes (KNL)

You can compile for the Stampede2 KNLs on either a Broadwell login node or any KNL compute node. Building on the login node is likely to be faster, and is the approach we currently recommend. In either case, use the "-xMIC-AVX512" switch at both compile and link time to produce compiled code targeting the KNL. In addition, you may want to specify an optimization level (e.g. "-O3"). You may want to avoid using "-xHost" when building on a Broadwell login node. If you do so you will produce an executable that will run on KNL but will not employ its optimized instruction set.

When building on a login node using build systems that compile and run their own test programs (e.g. Autotools/configure, SCons, and Cmake), you will need to specify flags that produce code that will run on both the Broadwell login node (the build architecture where these tests will run) and on the compute KNL nodes (the actual target architecture). This is done through an Intel compiler feature called CPU dispatch that produces binaries containing alternate paths with optimized codes for multiple architectures. To produce such a binary containing optimized code for both Broadwell and KNL, supply two flags when compiling and linking (the same settings you would use when building on the Haswell login node that was the front end for the Stampede1 KNL sub-system):

-xCORE-AVX2 -axMIC-AVX512

In a typical build system, add these flags to the CFLAGS, CXXFLAGS, FFLAGS, and LDFLAGS variables. Expect the build to take longer than it would for one target architecture, and expect the resulting binary to be larger.

Stampede2's Intel compilers are newer than those that were installed on the Stampede1 KNL sub system. We therefore recommend rebuilding software originally compiled with Intel for the Stampede1 KNL sub-system.

Building Software for Phase 2 Compute Nodes (SKX)

Pending. This material will be available during the Phase 2 SKX early user period.

Job Accounting

Stampede2's accounting system is based on node-hours: one Service Unit (SU) represents a single compute node used for one hour (a node-hour). The total cost of any given job is total core-hours consumed by that job, adjusted in some cases by a multiplier associated with a special use queue:

SUs billed (node-hrs) = ( # nodes ) x ( job duration in wall clock hours ) x ( queue multiplier )

The system tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. In general, your queue wait time will be less if you request only the time you need: the scheduler will have an easier time finding a slot for the 2 hours you really need than for the 48 hours you request in your job script.

Slurm Job Scheduler

Stampede2's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.

Slurm Partitions (Queues)

Currently available queues include those in Stampede2 Production Queues. See KNL Compute Nodes, Memory Modes , and Cluster Modes for more information on memory-cluster modes.

Table. Stampede2 Production Queues

Queue Node Type Max Nodes,
(assoc'd cores per job*)
Max Duration Max jobs in queue* Charge
(per node-hour)
Configuration
(memory-cluster mode)***
development KNL 4 nodes
(272 cores)*
2 hrs 1* 1 Service Unit (SU) cache-quadrant
normal KNL 256 nodes
(17,048 cores)*
48 hrs 50* 1 SU cache-quadrant
large** KNL 1024 nodes
(69,632 cores)*
48 hrs 50* 1 SU cache-quadrant
flat-quadrant KNL 32 nodes
(2,176 cores)*
48 hrs 50* 1 SU flat-quadrant
flat-snc4 KNL 32 nodes
(2,176 cores)*
48 hrs 50* 1 SU flat-SNC4

* Queues and limits are likely to change frequently and without notice during the first few months of production. Execute "qlimits" on Stampede2 for real-time information regarding limits on available queues.

** To request more nodes than are available in the normal queue, submit a consulting (help desk) ticket through the TACC or XSEDE user portal. Include in your request reasonable evidence of your readiness to run under the conditions you're requesting. In most cases this should include strong or weak scaling results summarizing experiments you've run on KNL.

*** For non-hybrid memory-cluster modes or other special requirements, submit a ticket through the TACC or XSEDE user portal.

Submitting Batch Jobs with sbatch

Use Slurm's "sbatch" command to submit a batch job to one of the Stampede2 queues:

login1$ sbatch myjobscript

Here "myjobscript" is the name of a text file containing #SBATCH directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run. Typical generic examples include:

These scripts are also available on Stampede2 in /share/doc/slurm. You can write your job script by copying the sample script that is most like your intended job, then editing it to meet your needs. If your job uses a software package provided by a Stampede2 module, you should also check that module's help text for additional information that may help you construct your job script.

The Common sbatch Options table below describes some of the most common sbatch command options. Slurm directives begin with "#SBATCH"; most have a short form (e.g. "-N") and a long form (e.g. "--nodes"). You can pass options to sbatch using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases "#!/bin/bash" or "!#/bin/csh" is the right choice. Avoid "#!/bin/sh" (its startup behavior can lead to subtle problems on Stampede2), and do not include comments or any other characters on this first line. All #SBATCH directives must precede all shell commands. Note also that certain #SBATCH options or combinations of options are mandatory, while others are not available on Stampede2.

Table. Common sbatch Options

Option Argument Comments
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-N total_nodes Required. Define the resources you need by specifying either:
(1) "-N" and "-n"; or
(2) "-N" and "--ntasks-per-node".
-n total_tasks This is total MPI tasks in this job. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as "-N".
--ntasks-per-node
or
--tasks-per-node
tasks_per_node This is MPI tasks per node. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set --ntasks-per-node to 1.
-t hh:mm:ss Required. Wall clock time for job.
--mail-user= email_address Specify the email address to use for notifications.
--mail-type= begin, end, fail, or all Specify when user notifications are to be sent (one option per line).
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
-d= afterok:jobid Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes
-A projectnumber Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.
-a
or
-array
N/A Not available. Use the launcher module to parameter sweeps and other collections of related serial/threaded jobs.
-mem N/A Not available. If you attempt to use this option, your job will not run.
--export= N/A Avoid this option on Stampede2. Using it is rarely necessary and can interfere with the way the system propagates your environment.

By default, Slurm writes all console output to a file named "slurm-%j.out", where %j is the numerical job ID. To specify a different filename use the "-o" option. To save stdout (standard out) and stderr (standard error) to separate files, specify both "-o" and "-e".

Interactive Sessions with idev and srun

TACC's own idev utility is the best way to begin an interactive session on one or more compute nodes. To launch a thirty-minute session on a single node in the development queue, simply execute:

login1$ idev

You'll then see output that includes the following excerpts:

...
-----------------------------------------------------------------
      Welcome to the Stampede2 Supercomputer          
-----------------------------------------------------------------
...

-> After your idev job begins to run, a command prompt will appear,
-> and you can begin your interactive development session. 
-> We will report the job status every 4 seconds: (PD=pending, R=running).

->job status:  PD
->job status:  PD
...
c449-001$

The "job status" messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev, the prompts may invite you to choose a default project and a default number of tasks per node for future idev sessions.

For command line options and other information, execute "idev --help". It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:

login1$ idev -p normal -N 2 -n 8 -m 150 # normal queue, 2 nodes, 8 total tasks, 150 minutes

For more information see the idev documentation.

You can also launch an interactive session with Slurm's srun command, though there's no clear reason to prefer srun to idev. A typical launch line would look like this:

login1$ srun --pty -N 2 -n 8 -t 2:30:00 -p normal /bin/bash -l # same conditions as above

Interactive Sessions using ssh

If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute "top", then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.

There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev session, and the squeue or showq utilities. The sequence of identifying your compute node then connecting to it would look like this:

login1$ squeue -u bjones
 JOBID       PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
858811     development idv46796   bjones  R       0:39      1 c448-004
1ogin1$ ssh c448-004
...
C448-004$

Slurm Environment Variables

Be sure to distinguish between internal Slurm replacement symbols (e.g. "%j" described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID). Execute "env | grep SLURM" from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like "%j" only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an SBATCH directive. For example, the following directive will not work the way you might think:

#SBATCH -o myMPI.o${SLURM_JOB_ID}   # incorrect

Instead, use the following directive:

#SBATCH -o myMPI.o%j     # "%j" expands to your job's numerical job ID

Similarly, you cannot use paths like $WORK or $SCRATCH in an #SBATCH directive.

For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself ("man slurm") and its individual command (e.g. "man sbatch"); as well as numerous other online resources.

Managing Jobs

Under construction. Until this material is available here, please consult the relevant section in the Stampede1 User Guide.

Visualization and Virtual Network Computing (VNC) Sessions

Stampede2 uses the KNL processors for all visualization and rendering operations. We use the Intel OpenSWR library to render graphics with OpenGL. On Stampede2 the swr application (e.g. "swr glxgears") replaces vglrun and uses similar syntax. Execute "module load swr" for access to this capability. We expect most users will notice little difference in visualization experience on KNL. MCDRAM may improve visualization performance for some users.

There is currently no separate visualization queue on Stampede2. All visualization apps are (or will be soon) available on all nodes. VNC sessions are available on any queue, either through the command line or via the TACC Visualization Portal. We are in the process of porting visualization application builds to Stampede2. If you are interested in an application that is not yet available, please submit a help desk ticket through the TACC or XSEDE User Portal.

Visualization Applications

Please consult the Stampede1 User Guide for information relating to various visualization packages.

Programming and Performance: General

Under construction. Until this material is available here, consult the Stampede1 User Guide.

Architecture

KNL cores are grouped in pairs; each pair of cores occupies a tile. Since there are 68 cores on each Stampede2 KNL node, each node has 34 active tiles. These 34 active tiles are connected by a two-dimensional mesh interconnect. Each KNL has 2 DDR memory controllers on opposite sides of the chip, each with 3 channels. There are 8 controllers for the fast, on-package MCDRAM, two in each quadrant.

Each core has its own local L1 cache (32KB, data, 32KB instruction) and two 512-bit vector units. Both vector units can execute AVX512 instructions, but only one can execute legacy vector instructions (SSE, AVX, and AVX2). Therefore, to use both vector units, you must compile with -xMIC-AVX512.

Each core can run up to 4 hardware threads. The two cores on a tile share a 1MB L2 cache. Different cluster modes specify the L2 cache coherence mechanism at the node level.

Memory Modes

The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The output of commands like "top", "free", and "ps -v" reflect the consequences of memory mode. Such commands will show the amount of RAM available to the operating system, not the hardware (DDR + MCDRAM) installed.

KNL Memory Modes

KNL Memory Modes
  • Cache Mode. In this mode, the fast MCDRAM is configured as an L3 cache. The operating system transparently uses the MCDRAM to move data from main memory. In this mode, the user has access to 96GB of RAM, all of it traditional DDR4. Most Stampede2 queues are configured in cache mode.

  • Flat Mode. In this mode, DDR4 and MCDRAM act as two distinct Non-Uniform Memory Access (NUMA) nodes. It is therefore possible to specify the type of memory (DDR4 or MCDRAM) when allocating memory. In this mode, the user has access to 112GB of RAM: 96GB of traditional DDR and 16GB of fast MCDRAM. By default, memory allocations occur only in DDR4. To use MCDRAM in flat mode, use the numactl utility or the memkind library; see Managing Memory for more information. If you do not modify the default behavior you will have access only to the slower DDR4.

  • Hybrid Mode (not available on Stampede2). In this mode, the MCDRAM is configured so that a portion acts as L3 cache and the rest as RAM (a second NUMA node supplementing DDR4).

Cluster Modes

The KNL's core-level L1 and tile-level L2 caches can reduce the time it takes for a core to access the data it needs. To share memory safely, however, there must be mechanisms in place to ensure cache coherency. Cache coherency means that all cores have a consistent view of the data: if data value x changes on a given core, there must be no risk of other cores using outdated values of x. This, of course, is essential on any multi-core chip, but it is especially difficult to achieve on manycore processors.

The details for KNL are proprietary, but the key idea is this: each tile tracks an assigned range of memory addresses. It does so on behalf of all cores on the chip, maintaining a data structure (tag directory) that tells it which cores are using data from its assigned addresses. Coherence requires both tile-to-tile and tile-to-memory communication. Cores that read or modify data must communicate with the tiles that manage the memory associated with that data. Similarly, when cores need data from main memory, the tile(s) that manage the associated addresses will communicate with the memory controllers on behalf of those cores.

The KNL can do this in several ways, each of which is called a cluster mode. Each cluster mode, specified in the BIOS as a boot-time option, represents a tradeoff between simplicity and control. There are three major cluster modes with a few minor variations:

  • All-to-All. This is the most flexible and most general mode, intended to work on all possible hardware and memory configurations of the KNL. But this mode also may have higher latencies than other cluster modes because the processor does not attempt to optimize coherency-related communication paths.

  • Quadrant (variation: hemisphere). This is Intel's recommended default, and the cluster mode in most Stampede2 queues. This mode attempts to localize communication without requiring explicit memory management by the programmer/user. It does this by grouping tiles into four logical/virtual (not physical) quadrants, then requiring each tile to manage MCDRAM addresses only in its own quadrant (and DDR addresses in its own half of the chip). This reduces the average number of "hops" that tile-to-memory requests require compared to all-to-all mode, which can reduce latency and congestion on the mesh.

  • Sub-NUMA 4 (variation: Sub-NUMA 2). This mode, abbreviated SNC-4, divides the chip into four NUMA nodes so that it acts like a four-socket processor. SNC-4 aims to optimize coherency-related on-chip communication by confining this communication to a single NUMA node when it is possible to do so. To achieve any performance benefit, this requires explicit manual memory management by the programmer/user (in particular, allocating memory within the NUMA node that will use that memory). See Managing Memory below for more information.

KNL Cluster Modes

KNL Cluster Modes

TACC's early experience with the KNL suggests that there is little reason to deviate from Intel's recommended default memory and cluster modes. Cache-quadrant tends to be a good choice for almost all workflows; it offers a nice compromise between performance and ease of use for the applications we have tested. Flat-quadrant is the most promising alternative and sometimes offers moderately better performance, especially when memory requirements per node are less than 16GB. We have not yet observed significant performance differences across cluster modes, and our current recommendation is that configurations other than cache-quadrant and flat-quadrant are worth considering only for very specialized needs. For more information see Managing Memory and Best Known Practices.

Managing Memory

By design, any application can run in any memory and cluster mode, and applications always have access to all available RAM. Moreover, regardless of memory and cluster modes, there are no code changes or other manual interventions required to run your application safely. However, there are times when explicit manual memory management is worth considering to improve performance. The Linux numactl (pronounced "NUMA Control") utility allows you to specify at runtime where your code should allocate memory.

When running in flat-quadrant mode, launch your code with simple numactl settings to specify whether memory allocations occur in DDR or MCDRAM. Other settings (e.g. membind=4,5,6,7) specify fast memory within NUMA nodes when in Flat-SNC-4. See TACC Training Materials for additional information.

Intel's new memkind library adds the ability to manage memory in source code with a special memory allocator for C code and a corresponding attribute for Fortran. This makes possible a level of control over memory allocation down to the level of the individual data element. As this library matures it will likely become an important tool for those who need fine-grained control of memory.

When you're running in flat mode, the tacc_affinity script, rewritten for Stampede2, simplifies memory management by calling numactl "under the hood" to make plausible NUMA (Non-Uniform Memory Access) policy choices. For MPI and hybrid applications, the script attempts to ensure that each MPI process uses MCDRAM efficiently. To launch your MPI code with tacc_affinity, simply place "tacc_affinity" immediately after "ibrun":

    ibrun tacc_affinity a.out

Note that tacc_affinity is safe to use even when it will have no effect (e.g. cache-quadrant mode). Not also that tacc_affinity and numactl cannot be used together.

Best Known Practices and Preliminary Observations

It may not be a good idea to use all 272 hardware threads simultaneously, and it's certainly not the first thing you should try. In most cases it's best to specify no more than NOBREAK64-68 MPI tasks or independent processes per node, and 1-2 threads/core. One exception is worth noting: when calling threaded MKL from a serial code, it's safe to set OMP_NUM_THREADS or MKL_NUM_THREADS to 272. This is because MKL will choose an appropriate thread count less than or equal to the value you specify. See Controlling Threading in MKL in the Stampede1 User Guide for more information. In any case remember that the default value of OMP_NUM_THREADS is 1.

When measuring KNL performance against traditional processors, compare node-to-node rather than core-to-core. KNL cores run at lower frequencies than traditional multicore processors. Thus, for a fixed number of MPI tasks and threads, a given simulation may run 2-3x slower on KNL than the same submission on Stampede1's Sandy Bridge nodes. A well-designed parallel application, however, should be able to run more tasks and/or threads on a KNL node than is possible on Sandy Bridge. If so, it may exhibit better performance per KNL node than it does on Sandy Bridge.

General Expectations. From a pure hardware perspective, a single Stampede2 KNL node could outperform Stampede1's dual socket Sandy Bridge nodes by as much as 6x; this is true for both memory bandwidth-bound and compute-bound codes. This assumes the code is running out of (fast) MCDRAM on nodes configured in flat mode (450 GB/s bandwidth vs 75 GB/s on Sandy Bridge) or using cache-contained workloads on nodes configured in cache mode (memory footprint < 16GB). It also assumes perfect scalability and no latency issues. In practice we have observed application improvements between 1.3x and 5x for several HPC workloads typically run in TACC systems. Codes with poor vectorization or scalability could see much smaller improvements. In terms of network performance, the Omni-Path network provides 100 Gbits per second peak bandwidth, with point-to-point exchange performance measured at over 11 GBytes per second for a single task pair across nodes. Latency values will be higher than those for the Sandy Bridge FDR Infiniband network: on the order of 2-4 microseconds for exchanges across nodes.

MCDRAM in Flat-Quadrant Mode. Unless you have specialized needs, we recommend using tacc_affinity or launching your application with "numactl --preferred=1" when running in flat-quadrant mode (see Managing Memory above). If you mistakenly use "--membind=1", only the 16GB of fast MCDRAM will be available. If you mistakenly use "--membind=0", you will not be able to access fast MCDRAM at all.

Affinity. Default affinity settings are usually sensible and often optimal for both threaded codes and MPI-threaded hybrid applications. See TACC training materials for more information.

MPI Initialization. Our preliminary scaling tests with Intel MPI on Stampede2 suggest that the time required to complete MPI initialization scales quadratically with the number of MPI tasks (lower case "-n" in your Slurm submission script) and linearly with the number of nodes (upper case "-N").

Programming and Performance: Phase 2 System (SKX)

Pending. This material will be available during the Phase 2 SKX early user period.

File Operations: Input/Output (I/O Performance)

Under Construction. Until this material is available here, consult the Stampede1 User Guide

References

Under construction. Until this material is available here, consult the Stampede1 User Guide.

Revision History

"Last Update" at the top of this document is the date of the most recent change to this document. This revision history is a list of non-trivial updates; it excludes routine items such as corrected typos and minor format changes.

Click to view