Lonestar 4 is now decommissioned. The Lonestar 4 login nodes will remain available through Monday, March 14, 2016 for access to the $WORK and $SCRATCH file systems.

Please see the Transition from Lonestar 4 to Lonestar 5 guide for further information.


Lonestar User Guide

System Overview

The Lonestar Linux Cluster consists of 1,888 compute nodes, with two 6-Core processors per node, for a total of 22,656 cores. It is configured with 44 TB of total memory and 276TB of local disk space. The theoretical peak compute performance is 302 TFLOPS. In addition, for users interested in large memory, high throughput computing, Lonestar provides five nodes, six cores each, each with 1TB of memory. For visualization and GPU programming, Lonestar also provides eight GPU nodes each with 12 CPU cores. The system supports a 1 PB global, parallel file storage, managed by the Lustre file system. Nodes are interconnected with InfiniBand technology in a fat-tree topology with a 40Gbit/sec point-to-point bandwidth. A 10 PB capacity archival system is available for long term storage and backups.

Figure 1. Lonestar System: Front row of frames

System Configuration

All Lonestar nodes run Linux Centos 5.5 OS and support batch services through SGE 6.2. Global, data intensive IO is supported by a Lustre file system, while home directories are serviced by an NSF file system with global access. Other than the $HOME directory file system, all inter-node communication (MPI/Lustre) is through a Mellanox InfiniBand network. The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1.1 through 1.7.

There are 1,888 blades (nodes), 16 per chassis, with each blade housing two 6-core Westmere processors, 24GB of memory and a 146GB disk. Four 648 port Mellanox IS5600 switches and 16-port endpoint switches in each chassis form a Clos fat tree topology.

Figure 2. Dell M610 Blade motherboard.

Lonestar Upgrade: Large Memory and GPU Nodes Enhance Configuration

As of August 2011, an upgrade to the Lonestar cluster now provides users access to large memory nodes and graphics processing units (GPUs) in addition to the initially deployed 1888 compute nodes (22,656 compute cores.) With this upgrade Lonestar is now the first multi-use cyberinfrastructure resource, offering cluster, large memory, high-throughput, GPU and visualization capabilities. The additional nodes extend the functionality of Lonestar for GPGPU programs and for those programs requiring lots of memory. Users can now accomplish trillions of computations per second, solve problems that require huge amounts of memory, utilize GPUs to solve their computational problems, and remotely visualize their results, all using the same system.

GPU nodes

As part of this upgrade, Lonestar now has a queue called "gpu" which provides access to eight nodes, each with two NVIDIA M2090 GPUs. The eight logical nodes are contained in two Dell C6100 servers. Each logical node contains two Intel Xeon X5670 2.93GHz 6-core processors, 24GB of memory, an InfiniBand QDR card and 16-lane PCI-Express interface card that connects to a Dell C410x PCI expansion box that houses the NVIDIA Tesla M2090 GPUs. Each logical node has exclusive access to two M2090s.

The Lonestar login nodes have the latest NVIDIA CUDA 4.0 stack available through the modules interface giving users access to the NVIDIA CUDA tools and libraries. For additional information on compiling and running GPU codes, please see the CUDA-specific content under the Tools section of this guide. The GPU nodes also support remote interactive visualization. Please consult the visualization section.

Please note that the "gpu" queue has a charge factor of 2.0 for the service units charged when jobs are run in this queue. User jobs are limited to a maximum of 4 nodes/48 cores/8 GPUs in normal production.

Large memory nodes

In addition to the gpu queue, another new queue called largemem with access to five large memory nodes has been made available for all Lonestar users. These nodes are Dell PowerEdge R910 servers with four Intel Xeon E7540 2.0GHz 6-core processors, 1TB of memory and InfiniBand QDR card in each node. These large memory nodes are ideally suited to tasks that need lots of memory, such as pre- and post-processing of data for large scale compute jobs.

Please note that the largemem queue has a charge factor of 4.0 for the service units when jobs are run in this queue. User jobs are limited to a maximum of 2 nodes/48 cores/2 TB of memory in normal production.

Technical Specifications

Table 1.1 System Configuration & Performance

Component Technology Performance/Size
Nodes(blades) 2 Hex-core Xeon 5680 processors 1,888 Nodes / 22,656 Cores
Memory Distributed 45TB (Aggregate)
Shared Disk Lustre, parallel File System 1 PB
Local Disk SATA (146GB) 276TB (Aggregate)
Interconnect InfiniBand Mellanox Switch QDR 40 Gbit/s

Compute nodes

A regular node consists of a Dell PowerEdge M610 blade running the 2.6 x86_64 Linux kernel from kernel.org. Each node contains two Xeon Intel Hexa-Core 64-bit Westmere processors (12 cores in all) on a single board, as an SMP unit. The core frequency is 3.33GHz and supports 4 floating-point operations per clock period with a peak performance of 13.3 GFLOPS/core or 160 GFLOPS/node. Each node contains 24GB of memory (2GB/core). The memory subsystem has 3 channels from each processor's memory controller to 3 DDR3 ECC DIMMS, running at 1333 MHz. The processor interconnect, QPI, runs at 6.4 GT/s.

Table 1.2 PowerEdge M610 Regular Compute Node
Component Technology
Sockets per Node/Cores per Socket 2/6
Motherboard Dell M610, Intel PQI 5520 Chipset
Memory Per Node 24GB 6x4G 3 channels DDR3-1333MHz
Processor Interconnect 2x QPI 6.4 GT/s
PCI Express 36 lanes, Gen 2
146GB Disk 10K RPM SAS-SATA

GPU compute nodes

A GPU node consists of a dedicated partition of a Dell PowerEdge C6100 server running the 2.6 x86_64 Linux kernel from kernel.org. Each node contains two Xeon Intel Hexa-Core 64-bit Westmere processors (12 cores in all) on a single board, as an SMP unit. The core frequency is 2.93GHz and supports 4 floating-point operations per clock period with a peak performance of 11.72 GFLOPS/core or 140.6 GFLOPS/node. Each node contains 24GB of memory (2GB/core). The memory subsystem has 3 channels from each processor's memory controller to 3 DDR3 ECC DIMMS, running at 1333 MHz. The processor interconnect, QPI, runs at 6.4GT/s. Each NVIDIA Tesla M2090 GPU has 512 CUDA cores and 6GB of ECC-protected memory. Each M2090 is capable of 1331 GFLOPs peak single-precision and 665 GFLOPs peak double-precision performance.

Table 1.3 PowerEdge C6100 GPU Compute Node

Component Technology
Sockets per Node/Cores per Socket 2/6
Motherboard Dell C6100, Intel PQI 5520 Chipset
Memory per Node 24GB 6x4G 3 channels DDR3-1333MHz
GPUs per Node 2x NVIDIA Tesla M2090, 6GB RAM, 512 CUDA cores, 177GB/s maximum memory bandwidth
Processor Interconnect 2x QPI 6.4 GT/s
PCI Express 36 lanes, Gen 2
146GB Disk 10K RPM SAS-SATA

Large Memory nodes

Table 1.4 Large Memory nodes

Component Technology
Sockets per Node/Cores per Socket 4/6
Motherboard Dell R910, Intel E7510 Chipset
Memory Per Node 1TB 64x16GB DDR3-1066MHz
Processor Interconnect 4x QPI 6.4GT/s
PCI Express 8x, Gen 2
(2) 146GB Disk 10K RPM SAS configured in RAID-1

Login nodes

Table 1.5 PowerEdge M610 Login Nodes

Component Technology
2 login nodes lonestar.tacc.utexas.edu
Sockets per Node/Cores per Socket 2/6
Motherboard Dell M610, Intel QPI 5520 Chipset
Memory Per Node 24GB 3x DDR3-1333 MHz
15TB $HOME Disk Dell 1200 PowerVault
1GB quota  

Intel Westmere processor

The new Nehalem/Westmere architectures include the following features important to HPC:

  • On-chip (integrated) memory controllers
  • Two, three and four channel DDR3 SDRAM memory
  • Intel QuickPath Interconnect (replaces legacy front side bus)
  • Hyper-threading (reintroduced, but turned off for HPC computing)
  • 64 KB L1 Cache/core (32KB L1 Data and 32KB L1 Instruction)
  • 256KB L2 Cache/core
  • 12MB L3 cache share with all cores (for Lonestar 5680 processors)
  • 2nd level TLB caching
  • 1GB memory-page size (under investigation for HPC performance)

The Intel Xeon 5680 (Westmere-EP) processors are employed in the Dell M610 blades. At 3.33GHz and 4FLOPS/clock period the peak performance per node is 12cores x 13.3GFLOPS/core = 160GFLOPS.

Table 1.6 Intel Westmere 5680 Processor

Technology 64-bit (Intel EM64T)
Clock Speed 3.33GHz
FP Results/Clock Period 4
Peak Performance/core 13.3GFLOPS/core
L2 Cache 12MB
L1 Cache 64KB

Interconnect

The InfiniBand (IB) interconnect topology is a Clos fat tree (no oversubscription). Each of the 16 blades in a chassis is connected to a Mellanox M3601Q InfiniBand Switch Blade (leaf switch) within the chassis. From 16 ports (40Mb/s per port) of each switch blade a bundle of 4 links is connected to each of the 4 large core switches, as shown in Figure 3. Each core switch is a Mellanox IS5600 648-port (40Gb/s each) switch delivering up to 51.8Tb/s of bandwidth. Even with IO servers connected to the switches there is still about 25% capacity remaining for system growth. Any processor is only 3 hops away from any other processor.

Figure 3. InfiniBand Switch Topology (1,888 nodes).

File Systems

Lonestar storage includes a 146GB SATA drive (65GB usable by user) on each node. $HOME directories are NSF mounted to all nodes and limited by quota to 1GB per user. The $SCRATCH file system, also accessible from all nodes, is a parallel file system supported by Lustre and 841TB of DataDirect Storage. Archival storage is not directly available from the login node, but accessible through scp.

Table 1.7 Storage Systems

Storage Class Size Architecture Features
Local 146GB/node (65GB available to users) SAS-SATA mounted on /tmp
Parallel 1PB Lustre, DataDirect with 2 SFA10000 Controllers 15 Dell R610 data servers (OSS), through IB, user striping, MPI-IO, services `$SCRATCH/$WORK`, 4 Dell R710 MDS servers with 2 DellMD 3200 Storage Arrays
$HOME 15TB NSF, Raid-6, 1GB/user 1 Dell1200 PowerVault, 10 2TB disks
Ranch (Tape Storage) 10PB SAM-FS (Storage Archive Manager) 10GB/s connection through 4 GridFTP Servers

The TACC HPC platforms have several different file systems with distinct storage characteristics. There are predefined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.

The $WORK and $SCRATCH directories on Lonestar are Lustre file systems. They are designed for parallel and high performance data access from within applications. They have been configured to work well with MPI-IO, accessing data from many compute nodes.

Home directories use the NFS (network file systems) protocol and their file systems are designed for smaller and less intense data access– a place for storing executables and utilities. Use MPI-IO only on $WORK and $SCRATCH filesystems; your $HOME directory does not support parallel IO with MPI-IO.

To determine the amount of disk space used in a file system, cd to the directory of interest and execute the "df -k ." command, including the dot that represents the current directory. Without the dot all file systems are reported.

In the command output below, the file system name appears on the left (IP number, followed by the file system name), and the used and available space (-k, in units of 1 KBytes) appear in the middle columns, followed by the percent used, and the mount point:

login1$ df -k .
Filesystem 1K-blocks Used Available Use% Mounted on
206.76.192.2:/home1 15382877568 31900512 15350977056 1% /home1

To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sh option (s=summary, h=units 'human readable):

login1$ du -sh

To determine quota limits and usage on $HOME, execute the quota command without any options (from any directory):

login1$ quota
Disk quotas for user jlockman (uid 801913):
Filesystem blocks quota limit grace files quota limit grace
206.76.192.2:/home1 68 1048576 1200000 18 1000000 101000

The major file systems available on Lonestar are:

$HOME

  • At login, the system automatically sets the current working directory to your home directory.
  • Store your source code and build your executables here.
  • This directory has a quota limit of 1GB.
  • This file system is backed up.
  • The frontend nodes and any compute node can access this directory.
  • Use $HOME to reference your home directory in scripts.
  • Use cdh to change to $HOME.

$WORK

  • This directory has a quota limit of 250GB.
  • Store large files here.
  • Change to this directory in your batch scripts and run jobs in this file system.
  • The work file system is approximately 450TB
  • This file system is not backed up.
  • The frontend nodes and any compute node can access this directory.
  • Purge Policy: not purged
  • Use $WORK to reference this directory in scripts.
  • Use cdw to change to $WORK.
  • NOTE: TACC staff may delete files from $WORK if the file system becomes full. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

$SCRATCH

  • Store large files here.
  • Change to this directory in your batch scripts and run jobs in this file system.
  • The scratch file system is approximately 1.4PB.
  • This file system is not backed up.
  • The frontend nodes and any compute node can access this directory.
  • Purge Policy: Files with access times greater than 10 days are purged.
  • Use $SCRATCH to reference this directory in scripts.
  • Use cds to change to $SCRATCH.
  • NOTE: TACC staff may delete files from $SCRATCH if the file system becomes full, even if files are less than 10 days old. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

/tmp

  • This is a directory in a local disk on each node where you can store files and perform local I/O for the duration of a batch job.
  • It is often more efficient to use and store files directly in $SCRATCH (to avoid moving files from /tmp at the end of a batch job).
  • The scratch file system is approximately 65 GB available to users.
  • Files stored in the scratch directory on each node are removed immediately after the job terminates.
  • Use /tmp to reference this file system in scripts.

$ARCHIVE

  • Store permanent files here for archival storage.
  • This file system is NOT NSF mounted (directly accessible) on any node.
  • Use the $ARCHIVE file system only for long-term file storage to the $ARCHIVER system; it is not appropriate to use it as a staging area.
  • Use the scp command to transfer data to this system. For example:
    login1$ scp ${ARCHIVER}:$ARCHIVE/myfile $WORK
  • Use the ssh command to login to the $ARCHIVER system from any TACC machine. For example:
    login1$ ssh $ARCHIVER
  • See the Ranch User Guide for more on archiving.

System Access

To ensure a secure login session, users must connect to machines using the secure shell SSH program. Data movement must be done using the secure commands scp, and sftp.

Before any login sessions can be initiated using ssh, a working SSH client needs to be present in the local machine. Wikipedia is a good source of information on SSH in general and provides information on the various clients available for your particular operating system.

Do not run the optional ssh-keygen command to set up Public-key authentication. This option sets up a passphrase that will interfere with submitting job scripts to the batch queues. If you have already done this, remove the .ssh directory (and the files under it) from your home directory. Log out and log back in to test.

To initiate an SSH connection to a Lonestar login node from a UNIX or Linux system with SSH already installed, execute the following command:

login1$ ssh userid@lonestar.tacc.utexas.edu

Note: userid is needed only if the user name on the local machine and the TACC machine differ.

You may reset your TACC password here after logging into the TACC User Portal.

Computing Environment

Unix shell

The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, use:

login1$ echo $SHELL

You can use the chsh command to change your login shell. Full instructions are in the chsh man page. Available shells are defined by in the /etc/shells file, along with their full path.

To display the list of available shells with chsh and change your login shell to bash, execute the following:

login1$ chsh -l chsh -s /bin/bash

Environment variables

The next most important component of a user's environment is the set of environment variables. Many of the Linux commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

login1$ env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.

HOME=/home/utexas/username
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin

Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C shells and export for Bourne shells) are carried to the environment of shell scripts and new shell invocations, while normal shell variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute set to see the (normal) shell variables.

Startup Scripts

All UNIX systems set up a default environment that provides administrators and users with the ability to execute additional UNIX commands to alter that environment. These commands are sourced; that is, they are executed by the login shell, and the variables (both normal and environment), as well as aliases and functions, are included in the present environment. Lonestar supports the Bourne shell and its variants (/bin/sh, /bin/bash, /bin/zsh) and the C shell and its variants (/bin/csh, /bin/tcsh). Each shell's environment is controlled by system-wide and user startup files. TACC deploys system-specific startup files in the /etc/profile.d/ directory. User owned startup files are dot files (begin with a period and are viewed with the "ls -a" command) in the user's $HOME directory.

Each UNIX shell may be invoked in three different ways: as a login shell, as an interactive shell, or as a non-interactive shell. Be aware that each type of shell runs different startup scripts at different times depending on how it is invoked. Both login and interactive shells are shells in which the user interacts with the operating system via a terminal window. A user issues standard command-line instructions interactively in these types of shells. A non-interactive shell is launched by a script execution and does not interact with the user, for example, when a queued job script runs.

Bash shell users should understand that login shells (for example, shells launched via an SSH client) source one and only one of the files "~/.bash_profile", "~/.bash_login", or "~/.profile" (whichever the system finds first in file-list order), and will not automatically source "~/.bashrc". Interactive non-login shells, for example shells launched by typing "bash" on the command line, will source "~/.bashrc" and nothing else. The execution of a non-interactive Bash script (normal or job script) does not source any of the mentioned startup files.

Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations ($WORK, $HOME, etc.), aliases (cdw, cdh, etc.) and license paths are set by the login modules. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:

login1$ module load modulename

where name is the name of the module to load. If you often need an application environment, place the module commands required in your ~/.cshrc and/or ~/.bashrc shell startup file.

Most of the package directories are in /opt/apps/ ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set up by loading the fftw3 module:

login1$ module load fftw3

To look at a synopsis of a particular modulefile's operations (in this case, fftw3), or to see a list of currently loaded modules, execute the following commands:

login1$ module help fftw3
login1$ module list

Available Modules

TACC's module system is organized hierarchically to prevent users from loading software that will not function properly with the current compiler/MPI configuration. Two methods exist for looking at the available modules: Looking at modules available with the current compiler/MPI configuration, and looking at all of the modules installed on the system.

To see a list of modules available to the user with the current compiler/MPI configuration, users can execute the following command:

login1$ module avail

This will allow the user to see which software packages are available with the current compiler/MPI configuration (e.g. Intel 11.1 with MVAPICH2).

To see a list of modules available to the user with any compiler/MPI configuration, users can execute the following command:

login1$ module spider

This command will display all available packages on the system. To get specific information about a particular package, including the available compiler/MPI configurations for that package, execute the following command:

login1$ module spider modulename

Software upgrades and adding modules

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.

Transferring files to Lonestar

TACC supports the use of ssh/scp and globus-url-copy for file transfers.

globus-url-copy

To transfer data between XSEDE sites, use globus-url-copy. Complete globus-url-copy documentation is here. The globus-url-copy command requires the use of an XSEDE certificate to create a proxy for password-less transfers. It has a complex syntax, but provides high-speed access to other XSEDE machines that support gridFTP services (the protocol for globus-url-copy). High-speed transfers of a file or directory occur between the different FTP servers at the XSEDE sites. The GridFTP servers mount the filesystems of the compute machines, thereby providing access to your files or directories. Third party transfers are possible (transfers initiated between two machines from another machine). For a list of XSEDE GridFTP servers and mounted directory names for XSEDE, please see the XSEDE Data Transfers & Management page.

Use myproxy-logon to obtain a proxy certificate. For example:

login1$ myproxy-logon

This command will prompt for the certificate password. The proxy is valid for 12 hours for all logins on the local machine. With globus-url-copy, you must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy <options>
   gsiftp://<gridftp_server1>/<directory>|<file> \
   gsiftp://<gridftp_server2>/<directory>|<file>

An example file transfer might look like this:

login1$ globus-url-copy -stripe -tcp-bs 11M -vb \
   gsiftp://gridftp.ranger.tacc.xsede.org/`pwd`/file1 \
   gsiftp://tg-gridftp.ncsa.xsede.org/home/ncsa/johndoe/file2

An example directory transfer might look like this:

login1$ globus-url-copy -stripe -tcp-bs 11M -vb \
   gsiftp://gridftp.ranger.tacc.xsede.org/`pwd`/directory1/ \
   gsiftp://gridftp.ranch.tacc.xsede.org/home/00000/johndoe/directory2/

Use the stripe and buffer size options (-stripe to use multiple service nodes, -tcp-bs 11M to set ftp data channel buffer size). Otherwise, the speed will be about 20 times slower! When transferring directories, the directory path must end with a slash (/). The -rp option (not shown) allows paths relative to the user's "starting" directory of the filesystem mounted on the server. Without this option, you must specify the full path to the file or directory.

Application Development

Programming Models

There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.

For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor core loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 4. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible file system such as $WORK or $HOME), and launched on each node via the batch MPI launch command, "ibrun a.out".

Figure 4. Single-Program Multiple-Data (SPMD) and Multiple-Program Multiple-Data (MPMD) programming models

In the MPMD paradigm, each processor core loads up and executes a different program image and operates on different data sets, as illustrated in Figure 4. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or hundreds of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) The executables are launched through the same mechanism as SPMD jobs, but a Linux script is used to assign input parameters for the execution command (through the batch MPI launcher, "ibrun myscript"). Details of the batch mechanism for parameter sweeps are described in the help information for the launcher module:

login1$ module help launcher

The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes. Each node on this system contains 12 cores with a single 24GB memory subsystem. The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP) illustrated in Figure 5. The former name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem. There is no need for explicit messaging between processors as with MPI coding.

Figure 5. Parallel Vector Processing (PVP) programming model

In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g. with Pthreads) is employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.

In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.

The number of application that benefit from hybrid programming on dual-processor nodes (e.g. on Lonestar) is very small. The programming and support of hybrid codes is complicated by compiler and platform support of both paradigms. However, with the new multi-core multi-socket commodity systems on the horizon, there may be a resurgence in hybrid programming if these systems provide better enhanced performance with SMPP (OMP) algorithms.

For further information, please see the Reference section of this document.

Compiling

The Lonestar programming environment uses Intel C++ and Intel Fortran compilers by default. This section highlights the important HPC aspects of using the Intel compilers. The Intel compiler commands can be used for both compiling (making "o" object files) and linking (making an executable from "o" object files). For information on compiling GPU programs, please see the CUDA and OpenCL sections of this guide under the Tools section.

The Intel Compiler Suite

The Intel compiler version 11.1 is loaded as the default at login with the intel module. (The previous version of the compiler is available for special porting needs. Use the 'module available' command to list all modules installed, including versions and default information where applicable.) The gcc compiler and module are also available. Use "gcc --version"' to display version information. We recommend using the Intel suite whenever possible. The Intel suite is installed with the EM64T 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode). Any programs compiled on 32-bit systems need to be recompiled to run natively on Lonestar. Any pre-compiled packages should be EM64T (x86-64) compiled or errors may occur. Since only 64-bit versions of the MPI libraries have been built on Lonestar, programs compiled in 32-bit mode will not execute MPI code.

The Intel Fortran compiler command is "ifort". Use "ifort -V" to display current version information.

Web accessible Intel manuals are available from the Intel website.

Compiling Serial Programs

Table 4.1 Compiling serial programs

Compiler Language File Extension Example
icc C .c icc [compiler_options] prog.c
icpc C++ .C, .cc, .cpp, .cxx icpc [compiler_options] prog.cpp
ifort F77 .f, .for, .ftn ifort [compiler_options] prog.f
ifort F90 .f90, .fpp ifort [compiler_options] prog.f90

Appropriate file name extensions are required for each compiler. By default, the executable name is a.out; and it may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimizations.

A C program example:

login1$ icc -o flamec.exe -O3 -xSSE4.2 prog.c

A Fortran program example:

login1$ ifort -o flamef.exe -O3 -xSSE4.2 prog.f90

Commonly used options may be placed in an icc.cfg or ifc.cfg file for compiling C and Fortran code, respectively.

For additional information, execute the compiler command with the -help option to display all compiler options, their syntax, and a brief explanation, or display the man page, as follows:

login1$ icc -help
login1$ icpc -help
login1$ ifort -help
login1$ man icc
login1$ man icpc
login1$ man ifort

Some of the more important options are listed in the Basic Optimization section of this guide. Additional documentation, references, and a number of user guides (pdf, html) are available in the Fortran and C++ compiler home directories ($IFC_DOC and $ICC_DOC).

Compiling Parallel Programs

Since each of the PowerEdge blades (nodes) of the Lonestar cluster is a Xeon dual-processor system, applications can use the shared memory programming paradigm on node However, because of the limited number of processors in each node, the maximum theoretical benefit to using a shared-memory model on the node is a factor of 12.

The OpenMP compiler options are listed in the Basic Optimization section of this guide, for those who need SMP support on the nodes. For hybrid programming, use the mpi-compiler commands, and include the "-openmp" options.

MPI Programs

The mpicmds commands support the compilation and execution of parallel MPI programs for specific interconnects and compilers. At login, MPI MVAPICH2 (mvapich2) and Intel compiler (intel) modules are loaded to produce the default environment that provides the location to the corresponding mpicmds.

The mpicc, mpicxx, mpif77, and mpif90 compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable. The following table lists the MPI compiler wrappers for each language:

Table 4.2 Compiling MPI programs

Compiler Language File Extension Example
mpicc C .c mpicc [compiler_options] myprog.c
mpicxx C++ intel: .C/c/cc/cpp/cxx/c++/i/ii mpicxx [compiler_options] myprog.cc
mpif77 F77 .f, .for, .ftn mpif77 [compiler_options] myprog.f
mpif90 F90 .f90, .fpp mpif90 [compiler_options] myprog.f90

Appropriate file name extensions are required for each wrapper. By default, the executable name is a.out. You may rename it using the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.

A C program example:

login1$ mpicc -o prog.exe -O3 -xSSE4.2 prog.c

A Fortran program example:

login1$ mpif90 -o prog.exe -O3 -xSSE4.2 prog.f90

Include linker options, such as library paths and library names, after the program module names, as explained in the Loading Libraries section below. The Running Code section explains how to execute MPI executables in batch scripts and interactive batch runs on compute nodes.

We recommend that you use the Intel compiler for optimal code performance. TACC does not support the use of the gcc compiler for production code on the Lonestar4 system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the "-cc gcc" option for modules requiring gcc. Since gcc- and Intel-compiled codes are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler. When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler for the two cases:

login1$ mpicc -O3 -xSSE4.2 -c -cc=gcc suba.c
login1$ mpicc -O3 -xSSE4.2 mymain.c suba.o
login1$ mpicc -O3 -xSSE4.2 -c suba.c
login1$ mpicc -O3 -xSSE4.2 -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o

Compiler Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

At the most basic level of optimization that the compiler can perform is -On options, explained below:

Table 4.3 Compiler optimization levels

Level -On Description
n = 0: Fast compilation, full debugging support. Automatically enabled if using `-g`.
n = 1,2: Low to moderate optimization, partial debugging support:
  • instruction rescheduling
  • copy propagation
  • software pipelining
  • common subexpression elimination
  • prefetching, loop transformations
n = 3: Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change code semantics and results (sometimes even breaks code!) :
  • enables -O2
  • more aggressive prefetching, loop transformations

The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Table 4.4 Compiler Options Affecting Performance

Option Description
-c For compilation of source file only.
-O3 Aggressive optimization (-O2 is default).
-xSSE4.2 Generates code with streaming SIMD extensions SSE2/3/4 for Intel Core architecture.
-g Debugging information, generates symbol table.
-fpe0 Enable floating point exceptions. Useful for debugging.
-fp-model <arg> Enable floating-point model variation:
  • [no-]except : enable/disable floating-point semantics
  • fast[=1|2] : enables more aggressive floating-point optimizations
  • precise : allows value-safe optimizations
  • source : allows intermediates in source precision
  • strict : enables fp-model precise, fp-model except, disables contractions and enables pragma stdc fenv_access
  • double : rounds intermediates in 53-bit (double) precision
  • extended : rounds intermediates in 64-bit (extended) precision
-ip Enable single-file interprocedural (IP) optimizations (within files).
-ipo Enable multi-file IP optimizations (between files).
-opt-prefetch Enables data prefetching.
-opt-streaming-stores <arg> Specifies whether streaming stores are generated:
  • Always : Enable streaming stores under the assumption that the application is memory bound
  • Auto : [DEFAULT] Compiler decides when streaming stores are used
  • Never : Disable generation of streaming stores
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp-report[0|1|2] Controls the OpenMP parallelizer diagnostic level.

Basic Optimization for Serial and Parallel Programming using OpenMP and MPI

The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and icpc and mpicxx. Below are some of the common serial compiler options with descriptions.

Table 4.5 More Compiler Options

Compiler Options Description
-O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs.
-vec_report[0|...|5] controls amount of vectorizer diagnostic information.
-xSSE4.2 includes specialized code for SSE4 instruction set.
-fast NOT RECOMMENDED - static load
-g -fp generates debugging information, disables using EBP as general-purpose register.
-help lists options.

Developers often experiment with the following options: -pad, -align, -ip, -no-rec-div and -no-rec-sqrt. In some codes performance may decrease. Please see the Intel compiler manual (below) for a full description of each option.

Use the -help option with the mpicmds commands for additional information:

login1$ mpicc -help
login1$ mpicxx -help
login1$ mpif90 -help
login1$ mpirun -help

Use the options listed for mpirun with the ibrun command in your job script. For details on the MPI standard, go to: http://www.mcs.anl.gov/mpi.

Libraries

Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.

  • Use the "-l" loader option to link in a library at load time: e.g. ifort prog.f90 -lname
  • This links in either the shared library libname.so (default) or the static library libname.a, provided it can be found in ldd's library search path or the $LD_LIBRARY_PATH environment variable paths.
  • To explicitly include a library directory, use the "-L" option, e.g. ifort prog.f -L/mydirectory/lib -lname

In the above examples, the user's libname.a library is not in the default search path, so the L option is specified to point to the libname.a directory. (Only the library name is supplied in the -l argument - remove the lib prefix and the .a suffix.)

Many of the modules for applications and libraries, such as the mkl library module provide environment variables for compiling and linking commands. Execute the module help modulename command for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the mkl library:

login1$ mpicc -Wl,-rpath,$TACC_MKL_LIB -I$TACC_MKL_INC \
    -L$TACC_MKL_LIB mkl_test.c -Wl,--start-group -lmkl_intel_lp64 \
    -lmkl_sequential -lmkl_core -Wl,--end-group

Here, the module supplied environment variables TACC_MKL_LIB and TACC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl,-rpath specifies that the $TACC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the $LD_LIBRARY_PATH or the LDD dynamic cache of bindings between shared libraries and directory paths. This avoids having to set the $LD_LIBRARY_PATH (manually or through a module command) before running the executables. (This simple load sequence will work for some of the unthreaded MKL functions; see MKL Library section for using various packages within MKL.)

You can view the full path of the dynamic libraries inserted into your binary with the ldd command. The example below shows a partial listing for the a.out binary:

login1$ ldd a.out
libmkl_intel_lp64.so => /opt/apps/intel/11.1/mkl/lib/em64t/libmkl_intel_lp64.so
...
libm.so.6 => /lib64/tls/libm.so.6 libdl.so.2 => /lib64/libdl.so.2
...

A load map, which shows the library module for each static routine and a cross reference (-cref) can be used to validate which libraries are being used in your program. The following example shows that the ddot function used in the mdot.f90 code comes from the MKL library:

login1$ mpif90 mdot.f90 -Wl,-Map,mymap,-cref \
    -Wl,-rpath,$TACC_MKL_LIB -L$TACC_MKL_LIB ... -lmkl_intel_lp64 ...

Output:

...
ddot_ /opt/apps/intel/11.1/mkl/lib/em64t/libmkl_intel_lp64.so
daxpy_ /opt/apps/intel/11.1/mkl/lib/em64t/libmkl_intel_lp64.so ...

Performance Libraries

ISPs (Independent Software Providers) and HPC vendors provide high performance math libraries that are tuned for specific architectures. Many applications depend on these libraries for optimal performance. Intel has developed performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for the EM64T architectures. Details of the Intel libraries and specific loader/linker options are given below.

MKL library

The Math Kernel Library consists of functions with Fortran, C, and C++ interfaces for the following computational areas:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
  • LAPACK for linear algebraic equation solvers and eigensystem analysis
  • Fast Fourier Transforms
  • Transcendental Functions

In addition, MKL also offers a set of functions collectively known as VML – the Vector Math Library VML is a set of vectorized transcendental functions that offer both high performance and excellent accuracy compared to the libm functions (for most of the Intel architectures). The vectorized functions are considerably faster than standard library functions for vectors longer than a few elements.

To use MKL and VML, first load the MKL module using the command module load mkl. This will set the TACC_MKL_LIB and TACC_MKL_INC environment variables to the directories containing the MKL libraries and the MKL header files. Below is an example command for compiling and linking a program that contains calls to BLAS functions (in MKL). Note that the library is for use in a single node, hence it can be used by both serial compilers or by MPI wrapper scripts.

The following C and Fortran examples illustrate the use for the mkl library after loading the mkl module: module load mkl:

login1$ mpicc myprog.c -I$TACC_MKL_INC -Wl,-rpath,$TACC_MKL_LIB \
    -L$TACC_MKL_LIB \
    -Wl,--start-group \
    -lmkl_intel_lp64 -lmkl_sequential -lmkl_core \
    -Wl,--end-group -lpthread

login1$ mpf90 myprog.f90 -Wl,-rpath,$TACC_MKL_LIB \
    -L$TACC_MKL_LIB \
    -Wl,--start-group \
    -lmkl_intel_lp64 -lmkl_sequential -lmkl_core \
    -Wl,--end-group -lpthread

Notice the use of the linker commands "--start-group" and "--end-group", which are used to resolve dependencies between the libraries enclosed within. This useful option avoids having to find the correct linking order of the libraries and, in cases of circular dependencies, having to include a library more than once in a single link line.

TACC has summarized the MKL document in a quick guide. Assistance in constructing the MKL linker options is provided by the MKL Link Line Advisor utility.

GotoBLAS

The GotoBLAS Library (pronounced goat-toe) contains highly optimized BLAS routines for the Nehalem microarchitecture. The GotoBLAS routines are also supported on other architectures, and the source code is available free for academic use (download). We recommend the GotoBLAS libraries for performing the following linear algebra operations and solving matrix equations:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations)
  • LAPACK for linear algebraic equation solvers and eigensystem analysis

The GotoBLAS library contains statically compiled routines and does not require the -Wl,rpath option to locate the library directory at run time. The multi-threaded library (for hybrid computing) has a "_mp" suffix. Also, codes that use long long int (C) or integer kind=8 should use the libraries differentiated with the _ilp64 moniker. Otherwise, for normal, single core executions within MPI or serial code, use the libgoto_lp64.a library, as shown below (after loading the gotoBLAS module with the command "module load gotoblas"). If you use the ifort compiler command in your make files, you will need to include the "-shared-intel" option.

It is recommended that developers load the GotoBLAS statically with the following commands.

login1$ mpicc prog.c $TACC_GOTOBLAS_LIB/libgoto_lp64.a
login1$ mpf90 prog.f90 $TACC_GOTOBLAS_LIB/libgoto_lp64.a

Use these commands if you want to use the shared libraries.

login1$ mpicc prog.c -L$TACC_GOTOBLAS_LIB -lgoto_lp64
login1$ mpf90 prog.f90 -L$TACC_GOTOBLAS_LIB -lgoto_lp64

using ifort directly:

login1$ ifort prog.f90 -Wl,-rpath,$TACC_GOTOBLAS_LIB \
    -L$TACC_GOTOBLAS_LIB -lgoto_lp64 -Wl,-rpath,$TACC_GOTOBLAS_LIB \
    -L$TACC_GOTOBLAS_LIB -lgoto_lp64 \
    -shared-intel

If you are using the GotoBLAS in conjunction with other libraries that include the BLAS routines, make sure to include the gotoBLAS reference (-L$TACC_GOTOBLAS_LIB -lgoto_lp64) before any other library specification.

ScaLAPACK

Since the Intel LAPACK routines are highly optimized for EM64T-type architectures, TACC recommends loading the Intel MKL library to satisfy the LAPACK references within ScaLAPACK routines. In this case it is important to specify the ScaLAPACK library options BEFORE MKL and other libraries on the loader/compiler command line. Because ScaLAPACK uses two static libraries in the APIs for accessing the BLACS routines, it may be necessary to explicitly reference them more than once, since the loader makes only a single pass through the library path list while loading libraries. If you need help determining a path sequence, please contact TACC staff through the portal consulting system or the XSEDE Help Desk. The examples below illustrate the normal library specification (sequential, not multi-threaded) to try for C codes, and the loading sequence found to work for Fortran codes (don't forget to load the scalapack and mkl modules with the command module load scalapack mkl):

login1$ mpicc prog.c -I$TACC_SCALAPACK_INC -Wl,-rpath,$TACC_MKL_LIB\
    -L$TACC_SCALAPACK_LIB -lscalapack \
    $TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
    $TACC_SCALAPACK_LIB/blacsCinit_MPI-LINUX-0.a \
    $TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
    -L$TACC_MKL_LIB -Wl,-start-group -lmkl_intel_lp64 \
    -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread \
    -L$IFC_LIB -lifcore

login1$ mpf90 prog.f90 -Wl,-rpath,$TACC_MKL_LIB \
    -L$TACC_SCALAPACK_LIB -lscalapack \
    $TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
    $TACC_SCALAPACK_LIB/blacsF77init_MPI-LINUX-0.a \
    $TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
    -L$TACC_MKL_LIB -Wl,--start-group -lmkl_intel_lp64 \
    -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread

Note, the C library list includes the Fortran libfcore.a to satisfy miscellaneous Intel run-time routines; other situations may require the portability or POSIX libraries (libifport.a and libifposix.a).

Runtime Environment

Bindings to the most recent shared libraries are configured in the file /etc/ld.so.conf (and cached in the /etc/ld.so.cache file). Cat /etc/ld.so.conf to see the TACC configured directories, or execute:

login1$ /sbin/ldconfig -p

to see a list of directories and candidate libraries. Use the -Wl,rpath loader option or the $LD_LIBARY_PATH environment variable to override the default runtime bindings. The Intel compiler, MKL math libraries, GOTO libraries, and other application libraries are located in directories specified by their respective environment variables, which are set when the module is loaded. Use the following command:

login1$ module help <modulename>

to display instructions and examples on loading libraries for a particular modulefile.

Debugging

DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Lonestar, please see the DDT Debugging Guide.

Code Tuning

Memory Subsystem Tuning

There are a number of techniques for optimizing application code and tuning the memory hierarchy.

Maximize cache reuse

The following snippets of code illustrate the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.

Fortran example C example
Real*8 :: a(m,n), b(m,n), c(m,n)
...
do i=1,n
  do j=1,m
    ;a(j,i)=b(j,i)+c(j,i)
  end do
end do
double a[m][n], b[m][n], c[m][n];
...
for (i=0;i <m;i++){
  for (j=0;j <n;j++){
    a[i][j]=b[i][j]+c[i][j];
  }
}

Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop – instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

Example: In the following dot-product example, the number of streams prefetched are increased from 2, to 4, to 6, for the same functionality. However, just prefetching a larger number of streams does not necessarily translate into increased performance. There is a threshold value beyond which prefetching more streams can be counterproductive.

2 streams 4 streams 6 streams
do i=1,n
  s=s+x(i)*y(i)
end do
dotp=s
do i=1,n/2
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/2)*y(i+n/2)
end do
s0=s0+x(i)*y(i)
dotp=s0+s1
do i=1,n/3
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/3)*y(i+n/3)
  s2=s2+x(i+2*n/3)*y(i+2*n/3)
end do
do i=3*(n/3)+1,n
  s0=s0+x(i)*y(i)
end do
dotp=s0+s1+s2

Make sure to fit the problem size to memory (24GiB/node) as there is no virtual memory available for swap.

  1. Always minimize stride length. For the best case scenario, stride length 1 is optimal for most systems and in particular the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.
  2. Another approach is data reuse in cache by cache blocking. The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise the data has to be loaded into cache from memory every time it becomes necessary since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, since the latency for loading data from memory is a few orders higher than from cache, hence the concern. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.

    This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different submatrices in cache while the computation is ongoing.

    Example: Matrix multiplication

    Real*8 a(n,n), b(n,n), c(n,n)
    do ii=1,n,nb ! < nb is blocking factor
      do jj=1,n,nb
        do kk=1,n,nb
          do i=ii,min(n,ii+nb-1)
            do j=jj,min(n,jj+nb-1)
              do k=kk,min(n,kk+nb-1)
                c(i,j)=c(i,j)+a(j,k)*b(k,i)
              end do
            end do
          end do
        end do
      end do
    end do
  3. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two. More particularly, users should be aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.

    Example: Consider an L1 cache that is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.

    Problem: A 16 K 4-way set associative cache has 4 sets of 4 K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4. That results in a 256 byte cache, down from the original 16 K, due to the non-optimal choice of leading dimension.

    Real*8 :: a(1024,50)
    ...
    do i=1,n
      a(1,i)=0.50*a(1,i)
    end do

    Solution: Change leading dimension to 1028 (1024 + 1/2 cache line)

  4. Encourage Data Prefetching to Hide Memory Latency
  5. Work within available physical memory

Floating-Point Tuning

Unroll Inner Loops to Hide FP Latency

In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better.

...
do i=1,n,k
  s1 = s1 + x(i)*y(i)
  s2 = s2 + x(i+1)*y(i+1)
  s3 = s3 + x(i+2)*y(i+2)
  s4 = s4 + x(i+3)*y(i+3)
  ...
  sk = sk + x(i+k)*y(i+k)
end do
...
dotp = s1 + s2 + s3 + s4 + ... + sk
Avoid Divide Operations

The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside a loop, it is better to substitute that step by a multiply outside of the loop, provided no dependencies exist. Another alternative is to replace the loop by optimized vector intrinsics library, if available.

a=...
do i=1,n
  x(i)=x(i)/a
end do
   becomes   
a=...
ainv=1.0/a
do i=1,n
  x(i)=x(i)*ainv
end do

I/O Subsystem Tuning

Some of the more common sense approach entails using what's provided by the vendor i.e. taking advantage of the hardware. On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

Other common sensible approaches to optimizing I/O is to be aware of the existence and the locations of the filesystems i.e. whether the filesystems are locally mounted or through a remote filesystem. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the filesystem over the network and should always be the goal at the design level.

The other approaches include considering the best software options available. Some of them are enumerated below:

  1. Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
  2. Use binary instead of ASCII format because of the overhead incurred converting from the internal representation of real numbers to a character string. In addition, ASCII files are larger than the corresponding binary file.
  3. In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
  4. If available, use asynchronous I/O to overlap reads/writes with computation.

Fortran90 Performance Pitfalls

Several coding issues impact the performance of Fortran90 applications. For example, consider the two cases of using different F90 Array syntax for the two dimensional arrays below:

Case 1:

do j = js,je
  do k = ks,ke
    do i = is,ie
      rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))
    enddo
  enddo
enddo

Case 2:

rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
smdiv * rt(is:ie,ks:ke,js:je) - rtold(is:ie,ks:ke,js:je))

The array syntax in the computation step of the second approach leads to a significant performance penalty over using explicit loops on cache-based systems, although it is more elegant. Vector systems tend to prefer this array syntax from a performance standpoint. More importantly, the array syntax generates larger temporary arrays on the program stack.

The way the arrays are declared also impacts performance. In the following example, there are two cases of F90 assumed shape arrays. In the second case, the negative performance impact is significantly higher, almost ten-fold in compile time.

Case 1:

REAL, DIMENSION( ims:ime , kms:kme , jms:jme ) :: r, rt, rw, rtold
	
Results in F77-style assumed-size arrays
Compile time: 46 seconds
Run time: .064 seconds / call

Case 2:

REAL, DIMENSION( ims: , kms: , jms: ) :: r, rt, rw, rtold

Results in F90-style assumed-shape arrays
Compile time: 3120 seconds!!
Run time: .083 seconds / call

Another issue that arises from the F90 assumed shape arrays occurs when it is a parameter in a subroutine. Using assumed shape arrays as a parameter in a subroutine may result in the subroutine being passed a copy, rather than being passed the address of the array itself. This F90 copy-in/copy-out overhead is not only inefficient, but may cause errors when calling external libraries.

Running Applications

SGE Batch Environment

Batch facilities such as LoadLeveler, LSF, and SGE differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (hold, delete, resource request modification). The following paragraphs list the basic batch operations and their options, explain how to use the SGE batch environment, and describe the queue structure.

In addition to the environment variables inherited by the job from the interactive login environment, SGE sets additional variables in every batch session. The following table lists some of the important SGE variables:

SGE Batch Environment Variables

Table 5.1.SGE Batch Environment Variables

Environment Variable Contains
JOB_ID batch job id
TASK_ID task id of an array job sub-task
JOB_NAME name user assigned to the job

Lonestar Queue Structure

The Lonestar production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in the table below. Queues that don't appear in the table (such as systest, sysdebug, and clean) are non-production queues for system and HPC group testing and special support.

Queue Limits: Users are restricted to a maximum of 50 jobs total over all queues plus a special limit of 5 jobs at a time in the development queue.

Table 5.2 SGE Batch Environment Queues

Queue Name Max Runtime Max Procs SU Charge Rate Purpose
normal 24 hrs 4104 1 normal priority
development 1 hr 264 1 development
largemem 24 hrs 48 4 * large memory jobs
gpu 24 hrs 48 2 * GPGPU jobs
vis 24 hrs 48 2 * visualization jobs
serial 12 hrs 12 1 serial/shared memory

* Please note: The large memory nodes have close to 1TB memory. Jobs submitted to the largemem queue will be charged 4x times the usual rate. Jobs submitted to the gpu and vis queues will be charged 2x the normal rate.

The latest queue information can be determined using the following commands:

Table 5.3 SGE queue information commands

Command Comment
qconf -sql Lists the available queues.
qconf -sq <queue_name> The s_rt and h_rt values are the soft and hard wall-clock limits for the queue.

Job control

SGE provides the qsub command for submitting batch jobs:

login1$ qsub job_script

where job_script is the name of a Linux format text file containing job script commands. This file should contain both shell commands and special statements that include qsub options and resource specifications. Some of the most common options are described below in Table 5.4. Details on using these options and examples of job scripts follow. The qsub command has too many options to list here. Please consult the man page, "man qsub" for complete documentation.

Table 5.4 List of Common qsub Options

Option Argument Function
-q <queue_name> Submits to queue designated by <queue_name>.
-pe <TpN>way <NoN x 12> Executes the job using the specified number of tasks (cores to use) per node (wayness and the number of nodes times 12). See example script below.
-N <job_name> Names the job <job_name>
-M <email_address> Specify the email address to use for notifications.
-m {b|e|a|s|n} Specify when notifications are to be sent: when a job b)egins e)ends a)aborts s)uspended or n)o mail
-V   Use current environment setting in batch job.
-cwd   Use current directory as the job's working directory.
-j y Join stderr output with the file specified by the -o option. (Do not use in conjunction with the -e option.)
-o <output_file> Direct job output to <output_file>
-e <error_file> Direct job error to <error_file>. (Do not use in conjuntion with the -j option.)
-A <project_account_name> Charges run to <project_account_name>. Used only for multi-project logins. Account names and reports are displayed at login.
-l <resource>=<value> Specify resource limits
‑hold_jid <job_id_list> Job dependency; wait for all jobs in the comma-separated <job_id_list> to complete

Options can be passed to qsub on the command line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script.

Batch scripts contain two types of statements: special comments and shell commands. Special comment lines begin with #$ and are followed with qsub options. The SGE shell_start_mode has been set to "unix_behavior", which means the Linux shell commands are interpreted by the shell specified on the first line after #! sentinel; otherwise the Bourne shell (/bin/sh) is used. The SGE job script below requests an MPI job with 24 cores and 1.5 hours of run time:

#!/bin/bash  
#$ -V #Inherit the submission environment
#$ -cwd # Start job in submission directory
#$ -N myMPI # Job Name
#$ -j y # Combine stderr and stdout
#$ -o $JOB_NAME.o$JOB_ID # Name of the output file (eg. myMPI.oJobID)
#$ -pe 12way 24 # Requests 12 tasks/node, 24 cores total
#$ -q normal # Queue name normal
#$ -l h_rt=01:30:00 # Run time (hh:mm:ss) - 1.5 hours
#$ -M # Address for email notification
#$ -m be # Email at Begin and End of job
set -x # Echo commands, use set echo with csh
ibrun ./a.out # Run the MPI executable named a.out

If you don't want stderr and stdout directed to the same file, replace the -j option line, with a -e option to name a separate output file for stderr (but don't use both). By default, stderr and stdout are sent to out.o and err.o, respectively.

SGE provides several environment variables for the #$ options lines that are evaluated at submission time. The above $JOB_ID string is substituted with the job id. The job name (set with -N) is assigned to the environment variable JOB_NAME. The memory limit per task on a node is automatically adjusted to the maximum memory available to a user application (for serial and parallel codes).

Example job scripts are available online in /share/doc/sge. They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

MPI Environment for Scalable Code

The MVAPICH-2 MPI package provides a runtime environment that can be tuned for scalable code. For packages with short messages, there is a FAST_PATH option that can reduce communication costs, as well as a mechanism to Share Receive Queues Also, there is a Hot-Spot Congestion Avoidance option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, Scalable features for Large Scale Clusters and Performance Tuning and Chapter 10, MVAPICH2 Parameters of the MVAPICH2 User Guide for more information.

The SGE Parallel Environment

Each Lonestar node (of 12 cores) can be assigned to only one user at a time; hence, a complete node is dedicated to a user's job and accrues wall-clock time for 12 cores whether they are used or not. The SGE parallel environment option, pe, sets the number of MPI Tasks per Node (TpN), and the Number of Nodes (NoN). The syntax is:

-pe <TpN>way <NoN x 12>

where <TpN> is the number of Tasks per Node, and <NoN x 12> is the total number of cores requested (Number of Nodes times 12). For example:

#$ -pe 12way 48

Regardless of the value of <TpN>, the second argument is always 12 times the number of nodes that you are requesting.

Using a multiple of 12 cores per node

For pure MPI applications, the most cost-efficient choices are: 12 tasks per node (12way) and a total number of tasks that is a multiple of 12, as described above. This will ensure that each core on all the nodes is assigned one task.

Using a large number of cores

For core counts above 4,000, cache the executable in each node's memory immediately before the ibrun statement with the following command:

cache_binary $PWD ./a.out
ibrun tacc_affinity ./a.out

In this example ./a.out is the executable, $PWD is evaluated to the present working directory, and cache_binary is a perl script that caches the executable, libraries, and other files on each node. Set the $DEBUG_CACHE_BINARY environment variable to any value to get a report of the files cached by the cache_binary command.

Using fewer than 12 cores per node

When you want to use less than 12 MPI tasks per node, the choice of tasks per node is limited to the set of numbers {1, 2, 4, 6, and 8}. When the number of tasks you need is equal to (Number of Tasks per Node) x (Number of Nodes) then use the following command:

#$ -pe <TpN>way <NoN x 12>

where <TpN>is a number in the set {1, 2, 4, 6, 8, 12}.

Using an arbitrary number of cores

If the Total number of Tasks that you need is less than (Number of Tasks per Node) x (Number of Nodes) then set the $MY_NSLOTS environment variable to the Total number of Tasks <TnoT>needed. In a job script, use the following -pe option and environment variable statement:

#$ -pe <TpN>way <NoN x 12>
export MY_NSLOTS=<TnoT> # For Bourne shells

or

setenv MY_NSLOTS <TnoT> # For C shells

where <TpN>is a number in the set {1, 2, 4, 6, 8, 12}. For example, using a Bourne shell:

#$ -pe 6way 48 #Use 6 Tasks per Node, 4 Nodes requested
export MY_NSLOTS=20 #20 tasks are launched

Program Environment for Serial Programs

For serial batch executions, use the 1-way program environment (#$ -pe 1way 12), don't use the ibrun command to launch the executable (just use my_executable my_arguments and submit your job to the serial queue (#$ -q serial). The serial queue has a 12-hour runtime limit and allows up to 6 simultaneous runs per user. 12 nodes reserved for the serial queue.

#$ -pe 1way 12 # 1 execution on node of 12 cores
#$ -q serial # run in serial queue
./my_executable # execute your applications (no ibrun)

Program Environment for Hybrid Programs

For hybrid jobs, specify the MPI Tasks per Node through the first -pe option (1/2/4/6/8/12way) and the Number of Nodes in the second -pe argument (as the number of assigned cores = Number of Nodes x 12). Then, use the $OMP_NUM_THREADS environment variable to set the number of threads per task. (Make sure that Tasks per Node x number of Nodes is less than or equal to the number assigned cores, the second argument of the -pe option.) The hybrid Bourne shell example below illustrates the use of these parameters to run a hybrid job. It requests 2 tasks per node, 6 threads per task, and a total of 24 cores (2 nodes x 12 cores).

#$ -pe 2way 24 #2 tasks/node, 24 nodes total
...
export OMP_NUM_THREADS=6 #6 threads/task

Batch query

After job submission, users can monitor the status of their jobs with the qstat command.

Table 5.5 List of Common qstat Options

Option Result
-t Show additional information about subtasks
-r Shows resource requirements of jobs
-ext Displays extended information about jobs
-j Displays information for specified job
-f Shows full list of queue/job details

The qstat command output includes a listing of jobs and the following fields for each job:

Table 5.6 Some of the fields in the qstat command output

Field Description
JOBID job id assigned to the job
USER user that owns the job
STATE current job status, including, but not limited to: w(aiting) s(uspended) r(unning) h(old) E(rror) d(eletion)

For convenience, TACC has created an additional job monitoring utility which summarizes jobs in the batch system in a manner similar to the showq utility from PBS. Execute:

login1$ showq

to summarize running, idle, and pending jobs, along with any advanced reservations scheduled within the next week. The command "showq -u" will show jobs associated with your userid only (issue showq --help to obtain more information on available options).

Job control

Control of job behavior takes many forms:

Job modification while in the pending/run state

Users can reset the qsub options of a pending job with the qalter command, using the following syntax:

qalter [options]

where "options" refers only to the following qsub resource options:

-l h_rt= per-job wall clock time -o output file -e error file

Job deletion

The qdel command is used to remove pending and running jobs from the queue. The following table explains the different qdel invocations:

login1$ qdel

Removes pending or running job. qdel -f Force immediate dequeuing of running job. Use forced job deletion (with the "-f" option) only if all else fails, and immediately report only forced deleted job IDs to TACC staff through the portal consulting system or the XSEDE HelpDesk, as this may leave hung processes that can interfere with the next job.

Job suspension/resumption

The qhold command allow users to prevent jobs from running. This command may be used to stop serial or parallel jobs and can be invoked by a user. A user cannot resume a job that was suspended by a sys admin nor can he suspend or resume a job owned by another user.

Jobs that have been placed on hold by qhold can be resumed by using the qalter -h U ${JOB_ID} command. (Note the character spacing is mandatory.)

Launching MPI Applications with ibrun

For all codes compiled with any MPI library, use the ibrun command to launch the executable within the job script. The syntax is:

login1$ ibrun ./my_exec <code_options>
login1$ ibrun ./a.out 2000

The ibrun command supports options for advanced host selection. A subset of the processors from the list of all hosts can be selected to run an executable. An offset must be applied. This offset can also be used to run two different executables on two different subsets, simultaneously. The option syntax is:

Advanced Syntax

login1$ ibrun -n <# of cores> -o <hostlist offset> my_exec <code_options>

For the following advanced example, 48 cores were requested in the job script.

login1$ ibrun -n 24 -o 0 ./a.out &
login1$ ibrun -n 24 -o 24 ./a.out &
login1$ wait

The first call launches a 24-core run on the first 24 hosts in the hostfile, while the second call launches a 24-core run on the second 24 hosts in the hostfile, concurrently (by terminating each command with &). The wait command (required) waits for all processes to finish before the shell continues. The wait command works in all shells. Note that the -n and -o options must be used together.

Controlling Process Affinity and Memory Locality

While many applications will run optimally using 12 MPI tasks on each node (a single task on each core), certain applications will run more efficiently with fewer than 12 cores per node and/or with a non-default memory policy. In these cases, consider using the techniques described in this section for a specific arrangement of tasks (process affinity) and memory allocation (memory policy). In HPC batch systems an MPI task is synonymous with a process. This section uses these two terms interchangeably. The wayness of a job, (specified by the -pe option), determines the number of tasks, or processes, launched on a node. For example, the SGE batch option "#$ -pe 12way 120" will launch 12 tasks on each of the nodes assembled from a set of 120 cores (10 nodes).

Since each MPI task is launched on a node as a separate process, the process affinity and memory policy can be specified for each task as it is launched with the numactl wrapper command. When a task is launched it can be bound to a socket or a specific core; likewise, its memory allocation can be bound to any socket. The assignment of tasks to sockets (and cores) and memory to socket memories are specified as options of the numactl wrapper command. There are two forms that can be used for numa control when launching a batch executable (a.out). They are:

login1$ ibrun numactl options ./a.out #numactl command; options apply to all a.out's
login1$ ibrun my_affinity ./a.out #my_affinity shell script contains numactl ./a.out

In the first command, ibrun executes numactl options a.out for all tasks (a.out's) using the same options. Because the ranks for the execution of each a.out launch are not known to the job script, it is impossible on this command line to tailor numactl options for each individual task. In the second command, ibrun launches an executable Linux script (here named my_affinity) for each task on a node and provides to this script the rank and wayness in environment variables. Hence, the script can manipulate these two variables, derive numactl option arguments for each task to use in its numactl options a.out command, within the my_affinity script. Additional details on numactl are given in the numactl man page and help information:

login1$ man numactl
login1$ numactl --help

Numactl Command and Options

The table below lists important options for assigning processes and memory allocation to sockets, and assigning processes to specific cores. The -no fallback condition implies a process will abort if no more allocation space is available.

Table 5.7 numactl commands

Control Type Command Option Arguments Description
Socket Affinity numactl -N {0,1} Only execute process on cores of this (these) socket(s).
Memory Policy numactl -l none Allocate only on socket where process runs; fallback to another if full.
Memory Policy numactl -i {0,1} Strictly allocate round robin (interleave) on these (comma separated list) sockets; no fallback.
Memory Policy numactl --preferred= {0,1} select only one Allocate on this (only one) socket; fallback to another if full.
Memory Policy numactl -m {0,1} Strictly allocate on this (these, comma separated list) socket(s); no fallback.
Core Affinity numactl --physcpubind {0,1,2,3, 4,5,6,7, 8,9,10,11} Only execute process on this (these, comma separated list) core(s). Cores 0,2,4,6,8,10 are on socket 0 and cores 1,3,5,7,9,11 are on socket 1

In threaded applications the same numactl command can be used, but its scope is limited globally to ALL threads. In an executable launched with the numactl command, any forked process or thread, inherits the process affinity and memory policy of the parent. Hence, in OpenMP codes, all threads at a parallel region take on the affinity and policy of the master process (thread). Even though multiple threads are forked (or spawned) at run time from a single process, you can still control (migrate) each thread after it is spawned through a numa API (application program interface). The basic underlying API utilities for binding processes (MPI Tasks) and threads (OpenMP/Pthreads threads) are sched_get/setaffinity, and numalib memory functions, respectively.

There are three levels of numa control that can be used when submitting batch jobs. These are:

  1. Global (usually with 12 tasks)
  2. Socket level (usually with 2,6, 8, and 12 tasks)
  3. Core level (usually with an odd number of tasks)

Launch commands will be illustrated for each. The control of numactl can be directed either at a global level on the ibrun command line (usually for memory allocation only), or within a script (which accesses MPI variables) to specify the socket (-N) or core (--physcpubind) mapping to MPI ranks assigned to the execution. (Note, the numactl man page refers to sockets as nodes In HPC systems a node is a blade; to avoid confusion, we will only use node when we refer to a blade).

The default memory policy is a local allocation; that is, allocation preferentially occurs on the socket where the process is executing. Note that the (physical) memory allocation occurs when variables/arrays are first touched (assigned a value) in a program, not when the memory storage is malloc'd (C) or allocated (F90). Task assignments are set by numa control within MPI (and should not be of concern for 12way MPI and 1way pure OMP runs); nevertheless some applications can benefit from their own numa control. The syntax and script templates for the three levels of control are present below.

Numa Control in Batch Scripts

Global control

Global control will affect every execution. This is often only used to control the layout of memory for 12-way MPI executions and SMP executions of 12 threads on a node. Since allocated I/O buffers may remain on a node from a previous job, TACC provides a script, tacc_affinity, to enforce a strict local memory allocation to the socket (-m memory policy), thereby removing I/O buffers (the default local memory policy does not evict the I/O memory allocation, but simply overflows to another socket's memory and incurs a higher overhead for memory access). The tacc_affinity script also distributes 2-, 4-, 6-, 8- and 12way executions evenly across sockets. Use this script as a template for implementing your own control. The examples below illustrate the syntax.

Example: Uses local memory; MPI does core affinity, # -pe 12way 12

login1$ ibrun ./a.out

Example: Interleaved; possibly use for OMP, # -pe 1way 12

login1$ ibrun numactl -i all ./a.out

Example: Assigns MPI tasks round robin to sockets, mandatory memory allocation to socket, # -pe 2/4/6/8/12way

login1$ ibrun tacc_affinity ./a.out

Socket Control

Often socket level affinity is used with hybrid applications (such as 2 tasks and 6 threads/task on a node = 2x6), and when the memory per process must be two to four times the default (~2GB/task). In this scenario, it is important to distribute 6, 4, 2 or 1 tasks per socket. In the 2x6 hybrid model, it is critically important to launch a single task per socket and control 6 threads on the socket. For the usual cases of control TACC provides the tacc_affinity script for recommended behavior.

An example numa control script and related job commands are shown to illustrate how a task (rank) is mapped to a socket and how process memory policy is assigned. The my_affinity script below (based on TACC's tacc_affinity script) is executed once for each task on the node. This script captures the assigned rank (from the particular MPI stack, one of MPIRUN_RANK, PMI_RANK, etc.) and wayness from ibrun and uses them to assign a socket to the task. From the list of the nodes, ranks are assigned sequentially in block sizes determined by the wayness of the Program Environment (assigned in the PE environment variable). So, for example, in the job below, a 24 core, 2way job (#$ -pe 2way 24) will have ranks {0,1} assigned to the first node, {2,3} assigned to the next node. The export OMP_NUM_THREADS=6 specifies a hybrid code using 6 threads per task (for hybrid executions). For the given job script parameters, a task is assigned to each of the sockets 0 and 1 on each of two nodes in the my_affinity script. Also, the script works for the SGE allowed waynesses: 1way, 2 way, 4way, 6way, 8way and 12way jobs.

Bourne shell based job script

...
#! -pe 2way 24
...
export OMP_NUM_THREADS=6
ibrun numa.sh

Core Control

When there is no simple arithmetic algorithm to map the ranks, lists may be used. The template below illustrates the use of a list for mapping tasks and memory allocation onto sockets.

We use the numactl --physcpubind option for assigning tasks to cores. Each element of the task2core array holds a core assigned to a task for the sequence of local ranks. Using localrank as the index of task2core maps the task (for the local rank) onto a core ID. Similarly, for the task2socket array; the 0-11 local rank values are used as indexes to provide socket IDs for the -m memory argument. For any wayness up to 12way the set of local ranks {0,1,2,3,4,5,6,7,8,9,10,11} are mapped onto the cores {11,0,1,2,3,4,5,6,7,8,9,10} and sockets {1 0 1 0 1 0 1 1 0 1 0 1}, respectively. For the special case of 1way, memory is allowed on both sockets with the option -m 0,1.

Task 0, which is often the root rank in MPI broadcasts, is assigned to core 11, with its memory allocated on socket 1. Such a mapping may actually degrade performance for other reasons; hence, custom mappings should be thoroughly tested for their benefits before being used on a production application. Nevertheless, the approach is quite general and provides a mechanism for list-directed control over any desired mapping.

NOTE: When the number of cores is not a multiple of 12 (e.g. 28 in this case), the user must set the environment variable $MY_NSLOTS to the number of cores within the job script, as shown below (AND the second argument in the -pe option (36 below) must be equal to the value of $MY_NSLOTS rounded up the nearest multiple of 12).

Bourne shell based job script

...
#! -pe 8way 36
...
export MY_NSLOTS=28
ibrun numa.sh

numa.sh

#!/bin/bash
# Unset any MPI Affinities
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
export VIADEV_USE_AFFINITY=0
export VIADEV_ENABLE_AFFINITY=0

# Get rank from appropriate MPI API variable
[ "x$MPIRUN_RANK" != "x" ] && myrank=$MPIRUN_RANK
[ "x$PMI_ID" != "x" ] && myrank=$PMI_ID
[ "x$OMPI_COMM_WORLD_RANK" != "x" ] && myrank=$OMPI_COMM_WORLD_RANK
[ "x$OMPI_MCA_ns_nds_vpid" != "x" ] && myrank=$OMPI_MCA_ns_nds_vpid

# TasksPerNode
TPN=`echo $PE | sed 's/way//'`
[ ! $TPN ] && echo "TPN NOT defined!" && exit 1

# local_rank = 0...wayness = Rank modulo wayness
if [[ $TPN -le 12 ]]; then
  task2socket=( 1 0 1 0 1 0 1 1 0 1 0 1)
  task2core=( 11 0 1 2 3 4 5 6 7 8 9 10)
else
  echo "Tasks per node gt 12."; exit 1
fi
localrank=$(( $myrank % $TPN ))
# sh: 1st element is 0
socket=${task2socket[$localrank]}
core=${task2core[ $localrank]}
myway=`echo $PE | sed s/way//`
[[ $myway = 1 ]] && socket=0,1 #mem on 0 and 1
exec numactl --physcpubind $core -m $socket ./a.out

In hybrid codes, a single MPI task (process) is launched and becomes the 'master thread'. It uses any numactl options specified on the launch command. When a parallel region forks the worker threads, the workers inherit the affinity and memory policy of the parent, the master thread (launch process). The usual case is for an application to launch a single task on a node without numactl process binding, but to provide a memory policy for the task (and hence all of the threads).

For instance, ibrun numactl -i all ./a.out would be used to assign interleave as the memory policy. Another hybrid scenario is to assign a single task on each socket (a 2x6 hybrid). In this case, a socket binding and some form of local memory policy should be employed, as in the example above.

In the first case, a socket could have been assigned to the single task (socket 0 is the usual default). In both cases, a memory policy could have been neglected, and may even be the most appropriate action. Without an explicit memory policy, a local, sometimes called a first touch mechanism is used. For any shared array, the first time an element within a block of memory (block = a page of 4096 bytes) is accessed the page is assigned to the socket memory on which the thread is running, regardless of the location of the master thread that may have performed the allocation statement. (For special cases, you can force touching at allocation time with the --touch option; but that should not be used for a shared array because it assigns all the memory to a single socket.)

Visualization on Lonestar

Lonestar is configured with a set of GPU-enhanced nodes that are available for interactive visualization. These nodes are M610 blades (see System Configuration) that are attached to a Dell C410x PCIe extension chassis that houses NVIDIA M2090 GPUs. Each host has dedicated access to two M2090s in the expansion chassis, connected through a daughter card in a gen2 x16 slot.

Interactive System Access

Interactive access to Lonestar is provided using VNC to provide a remote desktop running on a Lonestar visualization node. To initiate an interactive session on Lonestar, connect to a Lonestar login node (see System Access) and use the standard SGE batch-mode environment (see Running Applications) to run a specialized batch job that:

  • allocates a set of Lonestar visualization nodes, and
  • starts a vncserver process on the first allocated node, and
  • sets up a tunnel through the login node to the vncserver access port.

Once the vncserver process is running on the visualization node and the tunnel through the login node is created, a output message identifies the access port enabling the user to connect a vncviewer, running on the user's remote system, that presents a desktop to the user.

In detail, the steps to start an interactive session are as follows:

  1. Starting A Remote Desktop: The first step is to run a batch job that allocates Lonestar visualization nodes and run a vncserver process on the first allocated node. This is done using the standard qsub batch submission interface to submit a VNC job script to the vis queue:
    login1$ qsub -q vis [qsub options] vnc_job_script

    A prototypical VNC job script is available at /share/doc/sge/job.vnc. This script actually specifies the vis queue using internal directives, making the "-q vis" option in the above example unnecessary. The prototype script can either be qsub'd directly or by copying it to a private location and customizing it. This is particularly convenient if you would like to add your account information or change the default runtime of your job (currently limited to 24 hours). You can also change job runtime using the qsub command line option "l h_rt=<hours:minutes:seconds>". Note that the very lightweight twm window manager is the default window manager on the VNC desktop. The Gnome window manager is also available. To use gnome, open the file ~/.vnc/xstartup and replace "twm" with "gnome-session".

    The VNC job script starts the vncserver process and outputs a message indicating the port to connect vncviewer to. By default, the batch program output is written to $HOME/vncserver.out (based on the "-N" and "-o" directives of the job scripts or by their specification on the qsub command line). The relevant line in this output file is found near the end of the file; typically, a user would run:

    login1$ touch vncserver.out; tail -f vncserver.out

    in a separate window to watch for the message to be written.

    All standard qsub parameters are available. Of these, two are particularly relevant. The "-q queue_name" parameter enables the user to allocate GPU enhanced nodes (the gpu queue) rather than the general purpose nodes (normal, development and serial queues). The "-pe TpNway NoN" parameter determines the number of nodes to be allocated and their wayness.

  2. Creating an SSH Tunnel to Lonestar: Once the port is open on the Lonestar login node, TACC requires that the user create an ssh tunnel from the local system to the Lonestar login node. From a Unix or Linux system, this is done from a command prompt:
    login1$ ssh -f -N -L yyyy:lonestar.tacc.utexas.edu:xxxx username@lonestar.tacc.utexas.edu

    where xxxx is the port number outputted by the vncserver batch job, and yyyy is a port on the remote system (generally, the port number specified on the Lonestar login node xxxx is a good choice to be used on the remote system as well). The "-f" option instructs ssh to only forward ports, not to execute a remote command; the "-N" option puts the ssh command into the background after connecting; and the "-L" forwards the port.

    On Windows systems, a windows ssh client will be used. Find the menu where tunnels can be specified, and specify the local and remote ports as required, then launch the ssh connection to Lonestar.

  3. Connecting vncviewer: Once the ssh tunnel has been established, use a VNC client to connect to the local port you created, which will then be tunneled to your VNC server on Lonestar. Connect to localhost:xxxx, where xxxx is the local port you used for your tunnel. In the examples above, we would connect the VNC client to localhost::xxxx. (Some VNC clients accept localhost:xxxx).

    If you do not have a VNC client, the following are available for Windows/Linux:

    TightVNC http://www.tightvnc.com TurboVNC http://sourceforge.net/projects/turbovnc/ UltraVNC http://www.uvnc.com

    For Mac, we recommend Chicken of the VNC: http://sourceforge.net/projects/cotvnc/.

    Once the desktop has been established, two initial xterm windows are presented (which may be overlapping). One, which is white-on-black, manages the lifetime of the VNC server process. Killing this window (typically by typing "exit" or "ctrl-D" at the prompt) will cause the vncserver to terminate and the original batch job to end. Because of this, we recommend that this window not be used for other purposes; it is just too easy to accidentally kill it and terminate the session.

    The other xterm window is black-on-white, and can be used to start both serial programs running on the node hosting the vncserver process, or parallel jobs running across the set of cores associated with the original batch job. Additional xterm windows can be created using the window-manager left-button menu.

Running Applications on the VNC Desktop

From an interactive desktop, applications can be run from icons or from xterm command prompts. Before launching your first VNC job, set up your environment as follows:

login1$ module load tightvnc
login1$ vncpasswd

Running Parallel Applications from the Desktop

Parallel applications are run on the desktop using the same ibrun wrapper (see Running Applications) for use in Lonestar batch job scripts. The ibrun command:

login1$ ibrun [ibrun options] application [application options]

will run application on the associated nodes, as modified by the ibrun options.

Running OpenGL/X Applications On The Desktop

Running OpenGL/X applications on Lonestar visualization nodes requires that the native X server be running on each participating visualization node. Like other TACC visualization servers, on Lonestar the X servers are started automatically on each node (this happens for all jobs submitted to the vis queue). Because each visualization node has two GPUs, two separate X displays (:0.0 and :0.1) are created on each X server, each using one of the GPUs for rendering.

Once native X servers are running, several scripts are provided to enable rendering in different scenarios.

  • vglrun: Because VNC does not support OpenGL applications, VirtualGL is used to intercept OpenGL/X commands issued by application code and re-direct it to a local native X display for rendering. Rendered results are then automatically read back and sent to VNC as pixel buffers. To run an OpenGL/X application from a VNC desktop command prompt:
    login1$ vglrun [vglrun options] application application-args
    For more information on VirtualGL, see VirtualGL.
  • tacc_xrun: Some visualization applications present a client/server architecture, in which every process of a parallel server renders to local graphics resources, then returns rendered pixels to a separate, possibly remote client process for display. By wrapping server processes in the tacc_xrun wrapper, the $DISPLAY environment variable is manipulated to share the rendering load across the two GPUs available on each node. For example,
    login1$ ibrun tacc_xrun application application-args
    will cause the tasks to utilize both GPUs on each node, but will not render to any VNC desktop windows.
  • tacc_vglrun: Other visualization applications incorporate the final display function in the root process of the parallel application. This case is much like the one described above except for the root node, which must use vglrun to return rendered pixels to the VNC desktop. For example,
    login1$ ibrun tacc_vglrun application application-args
    will cause the tasks to utilize both GPUs for rendering, but will transfer the root process' graphics results to the VNC desktop.

Visualization Applications

Modules

Lonestar provides a set of visualization-specific modules. Visualization modules available on Lonestar include:

  • cuda: Run-time libraries for Cuda
  • cuda-sdk: Software development toolkit and examples for Cuda
  • VisIt: Access to Visit visualization application
  • paraview: Access to Paraview visualization application

Running Parallel VisIt on Lonestar

VisIt was compiled under the Intel v11 compiler and both the mvapich2 v1.4 and the openmpi v1.3 MPI stacks.

After connecting to a VNC server on Lonestar, as described above, do the following:

  • Load the VisIt module
    login1$ module load visit
  • Launch VisIt
    login1$ vglrun visit

When VisIt first loads a dataset, a dialog prompts the user to select either a serial or parallel engine. Select the parallel engine. Note that this dialog will also present options for the number of processes to start and the number of nodes to use; these options are actually ignored in favor of the options specified when the VNC server job was started.

Preparing data for Parallel Visit

In order to take advantage of parallel processing, VisIt input data must be partitioned and distributed across the cooperating processes. This requires that the input data be explicitly partitioned into independent subsets at the time it is input to VisIt. VisIt supports SILO data (see SILO), which incorporates a parallel, partitioned representation. Otherwise, VisIt supports a metadata file (with a .visit extension) that lists multiple data files of any supported format that are to be associated into a single logical dataset. In addition, VisIt supports a "brick of values" format, also using the .visit metadata file, which enables single files containing data defined on rectilinear grids to be partitioned and imported in parallel. Note that VisIt does not support VTK parallel XML formats (.pvti, .pvtu, .pvtr, .pvtp, and .pvts). For more information on importing data into VisIt, see Getting Data Into VisIt. Though this documentation refers to VisIt version 1.5.4, it appears to be the most current available.

For more information on VisIt, see http://wci.llnl.gov/codes/visit/home.html

Running Parallel ParaView on Lonestar

After connecting to a VNC server on Lonestar, as described above, do the following:

  • Set the "NO_HOSTSORT" environment variable to 1. This can be done either in the startup scripts (see Startup Scripts) or on the command line prior to prior to running Paraview (see Environment Variables).
  • Load the Python and ParaView modules, first swapping out the default mvapich2 module for environment compatability:
    	login1$ module swap mvapich2 openmpi/1.4.3
    	login1$ module load python paraview
  • Launch ParaView:
    	login1$ vglrun paraview [paraview client options]
  • Click the "Connect" button, or select File -> Connect
  • If this is the first time you've used ParaView in parallel (or failed to save your connection configuration in your prior runs):

    1. Select "Add Server"
    2. Enter a "Name" e.g. "ibrun"
    3. Click "Configure"
    4. For "Startup Type" in the configuration dialog, select "Command" and enter the command:

    		login1$ ibrun tacc_xrun pvserver [paraview server options]
    and click "Save"
  • Select the name of your server configuration, and click "Connect"
  • You will see the parallel servers being spawned and the connection established in the ParaView Output Messages window.

Preparing data for Parallel ParaView

In order to take advantage of parallel processing, ParaView data must be partitioned and distributed across the cooperating processes. While ParaView will import unpartitioned data and then partition and distribute it, best performance (by far) is attained when the input data is explicitly partitioned into independent subsets at the time it is loaded, enabling ParaView to import data in parallel. ParaView supports SILO data (see SILO), which incorporates a parallel, partitioned representation, as well as a comprehensive set of parallel XML formats, which utilize a metadata file to associate partitions found in separate files into a single logical dataset. In addition, ParaView supports a "brick of values" format enabling single files containing data defined on rectilinear grids to be partitioned and imported in parallel. This is not done with a metadata file; rather, the file is described to ParaView using a dialog that is presented when a file with a .raw extension is imported (this importer is also among the options presented when an unrecognized file type is imported). For more information on ParaView file formats, see VTK File Formats.

For more information on ParaView see http://www.paraview.org

Tools

Using CUDA on Lonestar

NVIDIA's CUDA compiler and libraries are accessed by loading the CUDA module:

login1$ module load cuda

This puts nvcc in the $PATH and the CUDA libraries in the $LD_LIBRARY_PATH environment variables. Applications should be compiled on the Lonestar login nodes, but these must be run by submitting an SGE job to the compute nodes, both in accordance with TACC user policies and because the login nodes have no GPUs. The CUDA module should be loaded within your job script to ensure access to the proper libraries when your program runs.

Lonestar's Fermi GPUs are compute capability 2.0 devices. When compiling your code, make sure to specify this level of capability with:

nvcc -arch=compute_20 -code=sm_20

The following demonstrates a sample CUDA job submission script:

#!/bin/bash
#$ -V
#$ -cwd
#$ -j y                      # combine the stdout and stderr streams
#$ -N simple_kernel          # specify the executable
#$ -l h_rt=1:30:00           # run time not to exceed one hour and 30 minutes
#$ -o $JOB_NAME.out$JOB_ID   # specify stdout & stderr output
#$ -q gpu                    # specify the GPU queue
#$ -pe 1way 12               # request one node (the 12)
module load cuda
set -x
./simple_kernel

For further information on the CUDA compiler, please see: $TACC_CUDA_DIR/doc/nvcc.pdf.

For more information about using CUDA, please see: $TACC_CUDA_DIR/doc/CUDA_C_Programming_Guide.pdf.

For the complete CUDA API, please see: $TACC_CUDA_DIR/doc/CUDA_Toolkit_Reference_Manual.pdf.

Using the CUDA SDK on Lonestar

The NVIDIA CUDA SDK can be accessed by loading the CUDA SDK module:

login1$ module load cuda_SDK

This defines the environment variable $TACC_CUDASDK_DIR which can be used to access the libraries and executables in the CUDA SDK.

Using multiple GPUs in CUDA

CUDA contains functions to query the number of devices connected to each host, and to select among devices. CUDA commands are sent to the current device, which is GPU 0 by default. To query the number of available devices, use the function:

int devices;
cudaGetDeviceCount( &devices );

To set a particular device, use the function:

int device = 0;
cudaSetDevice( device );

Remember that any calls after cudaSetDevice() typically pertain only to the device that was set. Please see the CUDA C Programming Guide and Toolkit Reference Manual for more details. For a multi-GPU CUDA example, please see the code at: $TACC_CUDASDK_DIR/C/src/simpleMultiGPU.

Debugging CUDA kernels

The NVIDA CUDA debugger, cuda-gdb, is included in the CUDA module. Applications must be debugged through a job using either the idev module or using a VNC session. Please see the relevant sections for more information on idev and launching a VNC session. For more information on the CUDA debugger, see: $TACC_CUDA_DIR/doc/cuda-gdb.pdf.

Using OpenCL on Lonestar

Lonestar has the NVIDIA implementation of the OpenCL v. 1.0 standard that is included in the NVIDIA CUDA SDK. To access it, first load both the CUDA and the CUDA SDK modules:

login1$ module load cuda

OpenCL is contained within the $TACC_CUDA_DIR/OpenCL directory. When compiling, you should use the following include directory on the compile line:

login1$ g++ -I${TACC_CUDA_DIR}/include

The OpenCL library is in /usr/lib64 and should be found automatically by the compiler.

For more information on OpenCL, please see the OpenCL specification at: https://www.khronos.org/registry/cl/.

Using multiple GPUs in OpenCL

OpenCL contains functions to query the number of GPU devices connected to each host, and to select among devices. OpenCL commands are sent to the specified device. To query the number of available devices, use the following code:

cl_platform_id platform;
cl_device_id* devices;
cl_uint device_count;

oclGetPlatformID(&platform);
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &device_count);
cdDevices = (cl_device_id*)malloc(device_count * sizeof(cl_device_id) );
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, device_count, devices, NULL);

In OpenCL, multiple devices can be a part of a single context. To create a context with all available GPUs and to create a command queue for each device, use the above code snippet to detect the GPUs, and the following to create the context and command queues:

cl_context context;
cl_device_id device;
cl_command_queue* command_queues;
int i;
	
context = clCreateContext(0, device_count, devices, NULL, NULL, NULL);
command_queues = (cl_command_queue*)malloc(device_count * sizeof(cl_command_queue));
for (i=0; i <device_count; ++i) {
cdDevice = oclGetDev(cxGPUContext, i);
command_queue[i] = clCreateCommandQueue(context, device, 0, NULL);
}

For a multi-GPU OpenCL example, please see the code at: $TACC_CUDASDK_DIR/OpenCL/src/oclSimpleMultiGPU/.

Using the NVIDIA Compute Visual Profiler

The NVIDIA Compute Visual Profiler, computeprof can be used to profile both CUDA programs and OpenCL programs that are run using the NVIDIA OpenCL implementation. Since the profiler is X based, it must be run either within a VNC session or by ssh-ing into an allocated compute node with X-forwarding enabled. The profiler executable path should be loaded by the CUDA module. If the computeprof executable cannot be located, define the following environment variables:

login1$ export PATH=$TACC_CUDA_DIR/computeprof/bin:$PATH
login1$ export LD_LIBRARY_PATH=$TACC_CUDA_DIR/computeprof/bin:$LD_LIBRARY_PATH

Timing Tools

Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

Package Timers

The time command is available on most Linux systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax:

login1$ /usr/bin/time -p

The -p option specifies traditional precision output, units in seconds. See the time man page for additional information.

To use time with an MPI task, use:

login1$ /usr/bin/time -p mpirun -np 4 ./a.out

This example provides timing information only for the rank 0 task on the master node (the node that executes the job script); however, the time output labeled real is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (not load balanced).

Code Section Timers

Section timing is another popular mechanism for obtaining timing information. Use these to measure the performance of individual routines or blocks of code by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below.

Table 6.1 Code Section Timers

Routine Type Resolution (usec) OS/Compiler
times user/sys 1000 Linux/AIX/IRIX/UNICOS
getrusage wall/user/sys 1000 Linux/AIX/IRIX
gettimeofday wall clock 1 Linux/AIX/IRIX/UNICOS
rdtsc wall clock 0.1 Linux
read_real_time wall clock 0.001 AIX
system_clock wall clock system dependent Fortran90 Intrinsic
MPI_Wtime wall clock system dependent MPI Library (C and Fortran)

For general purpose or coarse-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accessed in the same way:

Fortran C code
real*8, external :: x_timer
real*8 :: sec0, sec1, tseconds
...
sec0 = x_timer()
sec1 = x_timer()
tseconds = sec1-sec0
double x_timer(void);
double sec0, sec1, tseconds;
...
sec0 = x_timer();
sec1 = x_timer();
tseconds = sec1-sec0

Standard Profilers

The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. Gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the hotspots As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the -qp (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program:

Profiling Serial Executables
login1$ ifort -qp prog.f90 ;instruments code
login1$ a.out              ;produces gmon.out trace file
login1$ gprof              ;reads gmon.out (default args: a.out gmon.out), report sent to STDOUT
Profiling Parallel Executables
login1$ mpif90 -qp prog.f90           ;instruments code
login1$ setenv GMON_OUT_PREFIX gout.* ;forces each task to produce a gout
login1$ mpirun -np <#> a.out          ;produces gmon.out trace file
login1$ gprof -s gout.*               ;combines gout files into gmon.sum
login1$ gprof a.out gmon.sum          ;reads executable (a.out) and gmon.sum

Detailed documentation is available at www.gnu.org.

Reference

The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use.

Policies

TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities.

Last updated: February 4, 2016