Lonestar 5 User Guide
Last update: January 11, 2017

Updates & Notices

Lonestar 5 Architecture

The Lonestar 5 (LS5) system is designed for academic researchers in Austin and across Texas. It will continue to serve as the primary high performance computing resource in the University of Texas Research Cyberinfrastructure (UTRC) initiative, sponsored by The University of Texas System, as well as partner institutions Texas Tech and Texas A&M. Lonestar 5 is a Cray XC40 customized by the TACC staff to provide a unique software environment designed to meet the needs of our diverse user community.

Figure 1. Lonestar 5 cabinets

Lonestar 5 provides 1252 twenty-four core general compute nodes (for a total of 30,000 processor cores), 16 GPU nodes, and 10 large memory nodes. The system is configured with over 80 TB of memory and 5PB of disk storage, and has a peak performance of 1.2PF.

System Configuration

All Lonestar 5 nodes run SuSE 11 and are managed with batch services through native Slurm 15.08. Global storage areas are supported by an NFS system (HOME)andtwoLustreparalleldistributedfilesystems(WORK and $SCRATCH). Inter-node communication is through an Aries network with Dragonfly topology. Also the TACC Ranch tape archival system is available from Lonestar 5.

The 1252 compute nodes are housed in 7 water-cooled cabinets, with three chassis per cabinet. Each chassis contains 16 blades and each blade consists of 4 dual socket nodes and an Aries interconnect chip. Each node has two Intel E5-2690 v3 12-core (Haswell) processors and 64 GB of DDR4 memory. Twenty four of the compute nodes are reserved for development and are accessible interactively for up to two hours. The 16 GPU nodes are distributed across the cabinets. Each GPU node has a dual socket E5-2680 v2 (Ivy Bridge) 10-core processor and 64GB of DDR4 memory

The Aries network provides dynamic routing, enabling for optimal use of the overall system bandwidth under load. See Section 2.4 for additional information.

Login nodes

  • Dual Socket
  • Xeon CPU E5-2650 v3 (Haswell): 10 cores per socket (20 cores/node), 2.30GHz
  • 128 GB DDR4-2133
  • Hyperthreading Disabled

Compute Nodes

  • Dual Socket
  • Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
  • 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
  • No local disk
  • Hyperthreading Enabled - 48 threads (logical CPUs) per node

Compute nodes lack a local disk, but users have access to a 32 GB /tmp RAM disk to accelerate IO operations. Note that any space taken in /tmp will decrease the total amount of memory available on the node. If 8 GB of data are written to /tmp the maximum memory available for applications and OS on the node will then be 64 GB - 8 GB = 56GB.

Interconnect

Lonestar 5 uses an Aries Dragonfly interconnect. This high performance network has three levels. Rank 1 level has all to all connectivity in the backplane of each 16-node chassis (64 nodes) and provides per packet adaptive routing. Rank 2 level is made of sets of 6 chassis backplanes connected by passive copper cables, forming two cabinet groups (384 nodes). Rank 3 level is made of seven cabinets connected in an all to all fashion with active optical links. Point to point network bandwidth is expected to achieve 70 Gbit/s.

Figure 2 Lonestar 5 Network: Four nodes within a blade (dark green boxes) are connected to an Aires router (larger dark blue box). Nodes within a chassis (16-blades shown arranged in a row) are connected (rank-1 routes) within a chassis by a backplane (light green line). Each blade of a six chassis group, 3 chassis from 2 racks (indicated by perforated lines),are connected (rank-2 routes, blue lines) through 5 ports of the router. Intra group connections (rank-3 routes) are formed through fiber (orange lines) to each router. There are 7 racks of normal-queue computes nodes (ranks 3,4 and 7 not shown).

GPU Nodes

  • Single Socket
  • Xeon E5-2680 v2 (Ivy Bridge) : 10 cores, 2.8 GHz, 115W
  • 64 GB DDR3-1866 (4 x 16GB DIMMS)
  • Nvidia K40 GPU 12 GB GDDR5 (4.2 TF SP, 1.4TF DP)
  • Hyperthreading Enabled - 20 threads (logical CPUs) per node

Large Memory Nodes

Lonestar 5 provides two types of large memory nodes:

  • Eight Haswell nodes available to all users through the largemem512GB queue:

    • 512 GB RAM
    • dual socket Xeon E5-2698 v3
    • 16 cores per socket (32 cores/node), 2.3 GHz
    • Hyperthreading Enabled - 64 threads (logical CPUs) per node
  • Two Ivy Bridge nodes available to approved users through the largemem1TB queue:

    • 1TB RAM
    • quad socket Xeon E7-4860 v2
    • 12 cores per socket (48 cores/node), 2.26 GHz
    • Hyperthreading Enabled - 96 threads (logical CPUs) per node

The large memory nodes are accessible from the LS5 login nodes and access the same three Lustre high-performance file systems (/home, /work, and /scratch) as the rest of LS5. In many ways, however, the large memory nodes form a separate, largely independent cluster. This large memory cluster has its own Infiniband connectivity, Slurm job scheduler, queues, and software stack. The "TACC-largemem" module controls access to the large memory nodes. See Using the Large Memory Nodes section for more information.

Accessing Lonestar 5

The standard way to access Lonestar 5 (ls5.tacc.utexas.edu) and other TACC resources from your local machine is to use an SSH (Secure Shell) client. Please visit the Wikipedia page for more SSH info. SSH clients must support the SSH-2 protocol. Mac users may use the built-in Terminal application. Windows users may choose from many clients available for download. TACC staff recommends either of the following two light-weight and free clients.

Users must connect to Lonestar 5 using the Secure Shell "ssh" command to ensure a secure login session. Use the secure shell commands "scp" and "sftp", and/or standard "rsync" command to transfer files. Initiate an ssh connection to a Lonestar 5 login node from your local system:

localhost$ ssh taccuserid@ls5.tacc.utexas.edu

Login passwords can be changed in the TACC User Portal (TUP). Select "Change Password" under the "HOME" tab after you login. If you've forgotten your password, go to the TUP home page and select the "? Forgot Password" button in the Sign In area. To report a problem please run the ssh command with the "-vvv" option and include the verbose information when submitting a help ticket.

Do not run the optional ssh-keygen command to set up Public-key authentication. This command sets up a passphrase that will interfere with the specially configured .ssh directory that makes it possible for you to execute jobs on the compute nodes. If you have already done this, remove the ".ssh" directory (and the files under it) from your home directory. Log out and log back in to regenerate the keys.

Building your Applications

This section discusses the steps necessary to compile and/or re-build your applications.

Compiling your Applications

The default programming environment is based on the Intel compiler and the Cray Message Passing Toolkit, sometimes referred to as Cray MPICH. For compiling MPI codes, the familiar commands "mpicc", "mpicxx", "mpif90" and "mpif77" are available. Also, the compilers "icc", "icpc", and "ifort" are directly accessible.

To access the most recent versions of GCC, load one of the gcc modules.

Serial MPI Parallel
Language Intel GNU GCC Intel GNU GCC
C icc gcc mpicc mpicc
C++ icpc g++ mpicxx mpicxx
Fortran ifort gfortran mpif90 mpif90

Compiling OpenMP Applications

For pure OpenMP jobs, or other jobs that use threads but not MPI, specify a single node using the "-N" option and a single task using the "-n" option. Then, set the $OMP_NUM_THREADS environment variable to the desired number of threads. As usual with threaded jobs you may also wish to set an affinity using the $KMP_AFFINITY variable. See Using Intel's KMP_AFFINITY section.

Linking your Applications

Some of the more useful load flags/options for the host environment are listed below. For a more comprehensive list, consult the ld man page.

  • Use the "-l" loader option to link in a library at load time. This links in either the shared library "lib``name``.so" (default) or the static library "lib``name``.a", provided the library can be found in ldd's library search path or the $LD_LIBRARY_PATH environment variable paths.

    login1$ ifort prog.f90 -lname
  • To explicitly include a library directory, use the "-L" option:

    login1$ ifort prog.f -L/mydirectory/lib -lname

    In the above example, the user's lib``name``.a library is not in the default search path, so the "-L" option is specified to point to the directory containing lib``name``.a (only the library name is supplied in the "-l" argument; remove the "lib" prefix and the ".a" suffix.)

Many of the modules for applications and libraries, such as the hdf5 library module provide environment variables for compiling and linking commands. Execute the "module help ``module_name" command for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the hdf5 library:

login1$ icc -I$TACC_HDF5_INC hdf5_test.c -o hdf5_test \
    -Wl,-rpath,$TACC_HDF5_LIB -L$TACC_HDF5_LIB -lhdf5 -lz

Here, the module supplied environment variables $TACC_HDF5_LIB and $TACC_HDF5_INC contain the hdf5 library and header library directory paths, respectively. The loader option "-Wl,-rpath" specifies that the $TACC_HDF5_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the $LD_LIBRARY_PATHor the LDD dynamic cache of bindings between shared libraries and directory paths. This avoids having to set the $LD_LIBRARY_PATH (manually or through a module command) before running the executables. (This simple load sequence will work for some of the sequential MKL functions; see MKL Library section for using various packages within MKL.)

You can view the full path of the dynamic libraries inserted into your binary with the ldd command. The example below shows a partial listing for the h5stat binary:

login1$ ldd h5stat
...  
libhdf5.so.10 => /opt/apps/intel16/hdf5/1.8.16/x86_64/lib/libhdf5.so.10 (0x00002b594a8b0000)
libsz.so.2 => /opt/apps/intel16/hdf5/1.8.16/x86_64/lib/libsz.so.2 (0x00002b594b0b9000)
...
libz.so.1 => /lib64/libz.so.1 (0x00002b594b2d6000)

Intel Math Kernel Library (MKL)

The Intel Math Kernel Library (MKL) is a collection of highly optimized functions implementing some of the most important mathematical kernels used in computational science, including standardized interfaces to:

  • BLAS (Basic Linear Algebra Subroutines), a collection of low-level matrix and vector operations like matrix-matrix multiplication
  • LAPACK (Linear Algebra PACKage), which includes higher-level linear algebra algorithms like Gaussian Elimination
  • FFT (Fast Fourier Transform), including interfaces based on FFTW (Fastest Fourier Transform in the West)
  • ScaLAPACK (Scalable LAPACK), BLACS (Basic Linear Algebra Communication Subprograms), Cluster FFT, and other functionality that provide block-based distributed memory (multi-node) versions of selected LAPACK, BLAS, and FFT algorithms;
  • Vector Mathematics (VM) functions that implement highly optimized and vectorized versions of special functions like sine and square root.

MKL with Intel C, C++, and Fortran Compilers

There is no mkl module for the Intel compilers because you don't need one: the Intel compilers have built-in support for MKL. Unless you have specialized needs, there is no need to specify include paths and libraries explicitly. Instead, using MKL with the Intel modules requires nothing more than compiling and linking with the "‑mkl" option.; e.g.

login1$ icc -mkl mycode.c
login1$ ifort -mkl mycode.c

The "‑mkl" switch is an abbreviated form of "‑mkl=parallel", which links your code to the threaded version of MKL. To link to the unthreaded version, use "‑mkl=sequential". A third option, "‑mkl=cluster", which also links to the unthreaded libraries, is necessary and appropriate only when using ScaLAPACK or other distributed memory packages. For additional information, including advanced linking options, see the MKL documentation and Intel MKL Link Line Advisor.

MKL with GNU C, C++, and Fortran Compilers

When using a GNU compiler, load the mkl module before compiling or running your code, then specify explicitly the MKL libraries, library paths, and include paths you application needs. Consult the Intel MKL Link Line Advisor for details. A typical compile/link process on a TACC system will look like this:

login1$ module load gcc
login1$ module load mkl   # available/needed only for GNU compilers
login1$ gcc -fopenmp -I$MKLROOT/include      \
        -Wl,-L${MKLROOT}/lib/intel64 \
        -lmkl_intel_lp64 -lmkl_core  \
        -lmkl_gnu_thread -lpthread   \
        -lm -ldl mycode.c

For your convenience the mkl module file also provides alternative TACC-defined variables like $TACC_MKL_INCLUDE (equivalent to $MKLROOT/include). Execute "module help mkl" for more information.

Using MKL as BLAS/LAPACK with Third-Party Software

When your third-party software requires BLAS or LAPACK, you can use MKL to supply this functionality. Replace generic instructions that include link options like "‑lblas" or "‑llapack" with the simpler MKL approach described above. There is no need to download and install alternatives like OpenBLAS.

Using MKL as BLAS/LAPACK with TACC's MATLAB, Python, and R Modules

TACC's MATLAB, Python, and R modules all use threaded (parallel) MKL as their underlying BLAS/LAPACK library. These means that even serial codes written in MATLAB, Python, or R may benefit from MKL's thread-based parallelism. This requires no action on your part other than specifying an appropriate max thread count for MKL; see the section below for more information.

Controlling Threading in MKL

Any code that calls MKL functions can potentially benefit from MKL's thread-based parallelism; this is true even if your code is not otherwise a parallel application. If you are linking to the threaded MKL (using "‑mkl", "‑mkl=parallel", or the equivalent explicit link line), you need only specify an appropriate value for the max number of threads available to MKL. You can do this with either of the two environment variables MKL_NUM_THREADS or OMP_NUM_THREADS. The environment variable MKL_NUM_THREADS specifies the max number of threads available to each instance of MKL, and has no effect on non-MKL code. If MKL_NUM_THREADS is undefined, MKL uses OMP_NUM_THREADS to determine the max number of threads available to MKL functions. In either case, MKL will attempt to choose an optimal thread count less than or equal to the specified value. Note that OMP_NUM_THREADS defaults to 1 on TACC systems; if you use the default value you will get no thread-based parallelism from MKL.

If you are running a single serial, unthreaded application (or an unthreaded MPI code involving a single MPI task per node) it is usually best to give MKL as much flexibility as possible by setting the max thread count to the total number of hardware threads on the node (48 on the typical Haswell LS5 compute node). Of course things are more complicated if you are running more than one process on a node: e.g. multiple serial processes, threaded applications, hybrid MPI-threaded applications, or pure MPI codes running more than one MPI rank per node. See http://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications and related Intel resources for examples of how to manage threading when calling MKL from multiple processes.

Good Citizenship

The Lonestar 5 system is a shared resource. Hundreds of users may be logged on to the two login nodes at one time accessing the filesystem, hundreds of jobs may be running on all compute nodes, with hundreds more jobs queued up. All users must follow a set of good practices which entail limiting activities that may impact the system for other users. Lonestar 5's login nodes as well as the three shared file systems ($HOME, $WORK, and $SCRATCH) are resources that are shared among all users. Good practices can be boiled down to two items:

The two login nodes provide an interface to the "back-end" compute nodes. Think of the login nodes as a prep area, where users may edit, compile, perform file management, issue transfers, submit batch jobs etc.

The compute nodes are where actual computations occur and where research is done. All batch jobs and executables, as well as development and debugging sessions, are run on the compute nodes.

To access compute nodes on TACC resources, one must either submit a job to a batch queue or initiate an interactive session using the idev utility. You can also access a compute node via ssh if you are already running a job on that node.

Don't Run Jobs on Login Nodes

The login nodes are meant to be used exclusively for file management, editing and compiling. DO NOT run programs on the login nodes. The login nodes are shared among all users currently logged into LS5. A single user running computationally expensive or disk intensive task/s will negatively impact performance for other users. Running jobs on the login nodes is one of the fastest routes to account suspension. Instead, run on the compute nodes via an interactive session (idev) or by submitting a batch job.

This ruling also applies to, for example, Matlab and computationally expensive Python scripts.

DO THIS: build and submit a job

login1$ make
login1$ sbatch myjobscript

DO NOT DO THIS: invoke multiple make sessions, run an executable on a login node.

login1$ make -j 12
login1$ ./myapp.exe

DO THIS: Start an interactive session and run Matlab on a compute

login1$ idev
nid00181$ matlab

DO NOT DO THIS: Run Matlab or other software packages on a login node

login1$ matlab

Interactive programs such as R and MATLAB must be run on the compute nodes, NOT the login nodes, requiring an interactive session. Please see the idev utility.

Respect the Shared File Systems

  • Avoid running jobs in the $HOME directory. Run jobs in $WORK or $SCRATCH such that all job output, both stdout and stderr, is directed to those filesystems.
  • Avoid too many simultaneous file transfers. Three concurrent scp sessions is probably fine. One hundred concurrent file sessions is not.
  • Limit I/O intensive sessions (lots of reads and writes to disk)
  • Be sure to stripe the directories in which you will place large files.

Computing Environment

Lonestar 5's default login shell is Bash. The csh and zsh shells are also available. Submit a support ticket to change your default login shell; the chsh command is not supported.

Lonestar 5 does not support ".profile_user", ".cshrc_user", or ".login_user" startup files. Put your aliases and other customizations directly in the standard startup files. See the default templates in your account for further instructions and examples. Unless you have specialized needs, it is generally best to leave the bash ".profile" file alone and place all customizations in the ".bashrc" file.

Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different versions of software. TACC employs Lmod for environment management.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIB_LIBRARY_PATH environment variables, directory locations (e.g.,$WORK,$HOME) and aliases (e.g., cdw, cdh). Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require third-party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

The environment for executing each major TACC application can be set with a module command. The specifics are defined in a modulefile file, which sets, unsets, appends to, or prepends to environment variables (e.g.,$PATH, $LD_LIBRARY_PATH) for the specific application. Each modulefile also sets functions or aliases for use with the application. You only need to invoke a single command to configure the application/programming environment properly. The general format of this command is:

login1$ module load modulename

To look at the available modules, you can execute the following command:

login1$ module avail

Once you know the module that you want to load, you can simply use the load option. Or you can get more information about that module with the command (in this case Python):

login1$ module spider python/2.7.10

The spider command is also used to find modules. For example, to find all the hdf5 modules, type:

login1$ module spider hdf5

To look at a synopsis about using an application in the module's environment (in this case, fftw2), or to see a list of currently loaded modules, execute the following commands:

login1$ module help fftw2
login1$ module list

Managing your Files

Lonestar 5 supports multiple file transfer programs such as scp, sftp, and rsync. During production, transfer speeds between Lonestar 5 and other resources vary with I/O and network traffic.

File Systems & Quotas

Lonestar 5 mounts the three file systems that are shared across all nodes: home, work, and scratch. The system also defines for you corresponding account-level environment variables $HOME, $SCRATCH, and $WORK. Consult Table 2 for quota and purge policies on these file systems.

Several aliases are provided for users to move easily between file systems:

  • Use the "cdh" or "cd" commands to change to $HOME
  • Use "cdw" to change to $WORK
  • Use the "cds" command to change to $SCRATCH

The $WORK file system mounted on Lonestar 5 is the Global Shared File System hosted on the Stockyard system. It is the same file system that is available on Stampede, Maverick, Wrangler, and other TACC resources. The $STOCKYARD environment variable points to a directory on the file system that is associated with your account; this variable has the same definition on all TACC systems. The $WORK environment variable on Lonestar 5 points to the lonestar subdirectory, a convenient location for activity on Lonestar 5; the value of the $WORK environment variable will vary from system to system. Your quota and reported usage on this file system is the sum of all files stored on Stockyard regardless of their actual location on the work file system.

Scratch storage is provided by DataDirect Networks, and has a raw unformatted capacity of over 5PB. The SCRATCH filesystem comprises:

  • 168 Object Storage Targets
  • 1 MetaData Target
  • 5.472 PB raw storage

TACC's Corral storage system is available as a mount point on Lonestar 5's compute and login nodes.

File System Quota Purge Policy
$HOME 5GB none
$WORK 1TB (across all TACC systems) none
$SCRATCH no quota Files may be purged periodically at discretion of TACC staff.

Sharing Files

Users often wish to collaborate with fellow project members by sharing files and data with each other. Project managers or delegates can create shared workspaces, areas that are private and accessible only to other project members, using UNIX group permissions and commands. Shared workspaces may be created as read-only or read-write, functioning as data repositories and providing a common work area to all project members. Please see Sharing Project Files on TACC Systems for step-by-step instructions.

scp

Use the Secure Shell scp utility to transfer data from any Linux system to and from the login node. A file can be copied from your local system to the remote server by using the command:

localhost% scp filename \
    username@ls5.tacc.utexas.edu:/path/to/project/directory

Consult the man pages for more information on scp:

login1$ man scp

rsync

The rsync command is another way to keep your data up to date. In contrast to scp, rsync transfers only the actual changed parts of a file (instead of transferring an entire file). Hence, this selective method of data transfer can be much more efficient than scp. The following example demonstrates usage of the rsync command for transferring a file named "myfile.c" from the current location on Lonestar 5 to Stampede's $WORK directory.

login1$ rsync myfile.c \  
    username@stampede.tacc.utexas.edu:/work/01698/username/data

An entire directory can be transferred from source to destination by using rsync as well. For directory transfers the options "-avtr" will transfer the files recursively ("-r" option) along with the modification times ("-t" option) and in the archive mode ("-a" option) to preserve symbolic links, devices, attributes, permissions, ownerships, etc. The "-v" option (verbose) increases the amount of information displayed during any transfer. The following example demonstrates the usage of the "-avtr" options for transferring a directory named "gauss" from a local machine to a directory named "data" in the $WORK file system on Lonestar 5.

login1$ rsync -avtr ./gauss \  
    username@ls5.tacc.utexas.edu:/work/01698/username/data

When executing multiple instantiations of scp or rsync, please limit your transfers to no more than 2-3 processes at a time.

For more rsync options and command details, consult the man page or help options

login1$ man rsync
login1$ rsync -h

Running Jobs on Lonestar 5

This section provides an overview of how compute jobs are charged to allocations and describes the Simple Linux Utility for Resource Management (Slurm) batch environment, Lonestar 5 queue structure, lists basic Slurm job control and monitoring commands along with options.

Job Accounting

Lonestar 4 users have been accustomed to requesting allocations and seeing their usage reported in service units (SUs), where an SU is defined as a wallclock core hour. Lonestar 5 will measure usage and calculate SUs differently. On Lonestar 5 an SU will be defined as a wallclock node hour - the use of one node (and all its cores) for one hour of wallclock time plus any additional charges for the use of specialized queues, e.g. largemem, gpu.

Lonestar 5 SUs billed (Node hours) = # nodes * wallclock time * queue multiplier

Table 3a and Table 3b below list production queues and their multipliers.

Measuring usage in Node Hours vs Core Hours results in significantly lower SU values as the "cores per node" scaling factor is not included. Users will therefore see a lower "Available SUs" balance when logging into Lonestar 5 than they have in the past. In the future, users must submit allocation requests, renewals and increases for Lonestar 5 using the Node Hours metric.

Users can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

Interactive vs Batch Jobs

Once logged into Lonestar 5 users are automatically placed on one of two "front-end" login nodes. To determine what type of node you're on, simply issue the "hostname" command. Lonestar 5's login nodes will be labeled login[1-2].ls5.tacc.utexas.edu. The compute nodes will be labeled something like nid00181.

The two login nodes provide an interface to the "back-end" compute nodes. Think of the login nodes as a prep area, where users may edit, compile, perform file management, issue transfers, submit batch jobs etc.

The compute nodes are where actual computations occur and where research is done. All batch jobs and executables, as well as development and debugging sessions, are run on the compute nodes.

To run jobs and access compute nodes on TACC resources, one must either submit a job to a batch queue or initiate an interactive session using the idev utility. You can also access a compute node via ssh if you are already running a job on that node.

Slurm Scheduler

Schedulers such as LoadLeveler, SGE and Slurm differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (cancel, resource request modification, etc.). The scheduler on Lonestar 5 is Slurm. Lonestar4 used SGE, but if you've used any of the newer systems at TACC, you will be familiar with Slurm.

Batch jobs are programs scheduled for execution on the compute nodes, to be run without human interaction. A job script (also called "batch script") contains all the commands necessary to run the program: the path to the executable, program parameters, number of nodes and tasks needed, maximum execution time, and any environment variables needed. Batch jobs are submitted to a queue and then managed by a scheduler. The scheduler manages all pending jobs in the queue and allocates exclusive access to the compute nodes for a particular job. The scheduler also provides an interface allowing the user to submit, cancel, and modify jobs.

All users must wait their turn, not first come, first served. The Slurm scheduler will fit jobs in when the requested resources (nodes, runtime) become available. Do not request more resources than needed, else the job will wait longer in the queue.

Production Queues

The Lonestar 5 production queues, Standard Memory and Large Memory, and their characteristics (wall-clock and processor limits; charge factor; and purpose) are listed in Table 3a and Table 3b below. Queues that don't appear in the table (such as systest) are non-production queues for system and HPC group testing and special support.

Table 3a. Lonestar 5 Standard Memory Queues

Queue Max Runtime Max Nodes and
Associated Cores
per Job
Max Jobs in Queue Queue Multiplier Purpose
normal 48 hrs 171 nodes (4104 cores) 50 1 normal production
large
(by request*)
24 hrs 342 nodes (8208 cores) 1 1 large runs
development 2 hrs 11 nodes (264 cores) 1 1 development nodes
gpu 24 hrs 4 nodes (40 cores) 4 1 GPU nodes
vis 8 hrs 4 nodes (40 cores) 4 1 GPU nodes + VNC service

*For access to the large queue, please submit a ticket to the TACC User Portal. Include in your request reasonable evidence of your readiness to run at scale on Lonestar 5. In most cases this should include strong or weak scaling results summarizing experiments you have run on Lonestar 5 up to the limits of the normal queue.

An important note on scheduling: hyper-threading is currently enabled on Lonestar 5. While there are 24 cores on each non-GPU standard memory node, the operating system and scheduler will report a total of 48 CPUs (hardware threads).

Table 3b. Lonestar 5 Large Memory Queues

These queues are available through TACC-largemem module. See "Using the Large Memory Nodes" section below for more information.

Queue Max Runtime Max Nodes and
Associated Cores
per Job
Max Jobs in Queue Queue Multiplier Purpose
largemem512GB 48 hrs 2 nodes (64 cores) 3 3 large memory (512GB)
32 cores/node
largemem1TB
(by request*)
48 hrs 1 node (48 cores) 2 5 large memory (1TB)
48 cores/node

*For access to the largemem1TB queue, please submit a ticket to the TACC User Portal that includes reasonable evidence of your need for this queue.

An important note on scheduling: hyper-threading is currently enabled on Lonestar 5. While there are 32 cores on each 512GB Haswell node, the operating system and scheduler will report a total of 64 CPUs (hardware threads). Similarly, there are 48 cores on each 1TB node, but the operating system and scheduler will report a total of 96 CPUs.

Submit a batch job with sbatch

Use Slurm's sbatch command to submit a job. Specify the resources needed for your job (e.g., number of nodes/tasks needed, job run time) in a Slurm job script. See "/share/doc/slurm" for example Slurm job submission scripts.

login1$ sbatch myjobscript

where "myjobscript" is the name of a UNIX format text file containing job script commands. This file can contain both shell commands and special statements that include #SBATCH options and resource specifications; shell commands other than the initial parser line (e.g. #!/bin/bash) must follow all #SBATCH Slurm directives. Some of the most common options are described in Table 4 below and in the example job scripts. Details are available online in man pages (e.g., execute "man sbatch" on Lonestar 5).

Options can be passed to sbatch on the command-line or specified in the job script file; we recommend, however, that you avoid using the "--export" flag because there are subtle ways in which it can interfere with the automatic propagation of your environment. As a general rule it is safer and easier to store commonly used #SBATCH directives in a submission script that will be reused several times rather than retyping the options at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script. All batch submissions MUST specify a time limit, number of nodes, and total tasks. Jobs that do not use the -t (time), -N (nodes) and -n (total tasks) options will be rejected.

Batch scripts contain two types of statements: scheduler directives and shell commands in that order. Scheduler directive lines begin with #SBATCH and are followed with sbatch options. Slurm stops interpreting #SBATCH directives after the first appearance of a shell command (blank lines and comment lines are okay). The UNIX shell commands are interpreted by the shell specified on the first line after the #! sentinel; otherwise the Bash shell (/bin/bash) is used. By default, a job begins execution in the directory of submission with the local (submission) environment.

If you don't want stderr and stdout directed to the same file, use both "-e" and "-o" options to designate separate output files. By default, stderr and stdout are sent to a file named "slurm-%j.out", where "%j" is replaced by the job ID; and with only an "-o" option, both stderr and stdout are directed to the same designated output file.

The job script below requests an MPI job with 48 cores spread over 2 nodes and 1.5 hours of run time in the development queue:

#!/bin/bash
#SBATCH -J myMPI            # job name
#SBATCH -o myMPI.o%j        # output and error file name (%j expands to jobID)
#SBATCH -N 2                # number of nodes requested
#SBATCH -n 48               # total number of mpi tasks requested
#SBATCH -p development      # queue (partition) -- normal, development, etc.
#SBATCH -t 01:30:00         # run time (hh:mm:ss) - 1.5 hours

# Slurm email notifications are now working on Lonestar 5 
#SBATCH --mail-user=username@tacc.utexas.edu
#SBATCH --mail-type=begin   # email me when the job starts
#SBATCH --mail-type=end     # email me when the job finishes

# run the executable named a.out
ibrun ./a.out               

Sample Slurm Batch Scripts

Five sample batch scripts are below. Additional example Slurm scripts can be found in /share/docs/slurm.

Table 4. Common sbatch Options

Option Argument Function
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-n total_tasks The job acquires enough nodes to execute total_tasks tasks (launching 48 tasks/node).
Always use the -N option with the -n option. This also allows for fewer than 48 tasks/node to be specified(e.g. for hybrid codes).
-N nodes This option can only be used in conjunction with the -n option (above). Use this option to specify launching less than 48 tasks per node. The job acquires nodes nodes, and total_tasks/nodes tasks are launched on each node.
--ntasks-per-xxx N/A The ntasks-per-core/socket/node options are not available on Lonestar 5. The -N and -n options provide all the functionality needed for specifying a task layout on the nodes.
-t hh:mm:ss Wall clock time for job. Required.
--mail-user= email_address Specify the email address to use for notifications.
--mail-type= {begin, end, fail, all} Specify when user notifications are to be sent (one option per line)
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
-d= afterok:jobid Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes. NOTE: This option only works as a command-line option. Slurm will not process this directive within a job script.
-A projectnumber Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.

Interactive sessions via idev

Request an interactive session to a compute node using TACC's idev utility. Especially useful for development and debugging.

TACC's idev provides interactive access to a node and captures the resulting batch environment which is automatically inherited by any addition terminal sessions that ssh to the node. idev is simple to use, using default resource options that avoid the required options of the Slurm "srun" command for interactive access.

In the sample session below, a user requests interactive access to a single node (default) for 15 minutes (default is 30) in the development queue (idev's default) in order to debug the myprog application. idev returns a compute node login prompt:

WINDOW1 login2$ idev -m 15
WINDOW1 ...  
WINDOW1 --> Sleeping for 7 seconds...OK  
WINDOW1 ...  
WINDOW1 --> Creating interactive terminal session (login) on master node nid00181.  
WINDOW1 ...  
WINDOW1 nid00181$ vim myprog.c
WINDOW1 nid00181$ make myprog

Now the user may open another window to run the newly-compiled application, while continuing to debug in the original terminal session:

WINDOW2 login2$ ssh -Y nid00181
WINDOW2 ...  
WINDOW2 nid00181$ ibrun ./myprog
WINDOW2 ...output  
WINDOW2 nid00181$

Use the "-h" switch to see more options:

login2$ idev -h

Interactive sessions via ssh

Users may also ssh to a compute node from a login node, but only when that user's batch job is running, or the user has an active interactive session, on that node. Once the batch job or interactive session ends, the user will no longer have access to that node.

In the following example session user slindsey submits a batch job (sbatch), queries which compute nodes the job is running on (squeue), then ssh's to one of the job's compute nodes. The displayed node list, nid0[1312-1333], nid01335 is truncated for brevity. Notice that the user attempts to ssh to compute node nid01334 which is NOT in the node list. Since that node is not assigned to this user for this job, the connection is refused. When the job terminates the connection will also be closed even if the user is still logged into the node.

User submits the job described in the "myjobscript" file:

login4 [709]~> sbatch myjobscript
...
Submitted batch job 5462435 

User polls the queue, waiting till the job runs (state "R") and nodes are assigned.

login4 [710]~> squeue -u slindsey

JOBID PARTITION NAME USER ST TIME NODES NODELIST
24144 normalmympi slindsey R 0:39 16 nid0[1312-1333], nid01335 

User attempts to logon to an unassigned node. The connection is denied.

login4 [711]~> ssh nid01334

Access denied: user slindsey (uid=804387) has no active jobs.
Connection closed by 129.114.70.82

User logs in to an assigned compute node and does work. Once the job has finished running, the connection is automatically closed.

login4 [712]~> ssh nid01333
nid01333 [665]~> do science; attach debuggers etc.
...
Connection to nid01333 closed by remote host.
Connection to nid01333 closed.
login4 [713]~>

Parameter Sweeps and High Throughput Jobs

Parameter sweeps, where the same executable is run with multiple input data sets, as well as other high throughput scenarios, can be combined into a single job using TACC's launcher and pylauncher utilities.

The Launcher is a simple shell-based utility for bundling large numbers of independent, single process runs into one multi-node batch submission. This allows users to run more simultaneous serial jobs than is permitted directly through the serial queue, and improves turnaround time. See the sample Slurm launcher script above.

Please see TACC's Launcher documentation or the module help for more information.

login1$ module help launcher

Affinity and Memory Locality

Lonestar 5 has Hyperthreading (HT) enabled. When HT is enabled the OS will address two virtual threads per code. These logical threads share core resources and thus may not improve performance for all workloads. In HT systems it is critical to pay attention to thread binding in multithreaded processes. We are interested in feedback regarding the use of 24 vs 48 threads/tasks per node, and willing to provide assistance setting up these tests.

HPC workloads often benefit from pinning processes to hardware instead of allowing the operating system to migrate them at will. This is particularly important in multicore and heterogeneous systems, where process (and thread) migration can lead to less than optimal memory access and resource sharing patterns, and thus a significant performance degradation. TACC provides an affinity script called tacc_affinity, to enforce strict local memory allocation and process pinning to the socket. For most HPC workloads, the use of tacc_affinity will ensure that processes do not migrate and memory accesses are local. To use tacc_affinity with your MPI executable, use this command:

nid00181$ ibrun tacc_affinity a.out

or place the command in a job script:

ibrun tacc_affinity a.out

This will apply an affinity for the tasks_per_socket option (or an appropriate affinity if tasks_per_socket is not used, and a memory policy that forces memory assignments to the local socket. Try ibrun with and without tacc_affinity to determine if your application runs better with TACC affinity setting.

However, there may be instances in which tacc_affinity is not flexible enough to meet the user's requirements. This section describes techniques to control process affinity and memory locality that can be used to improve execution performance in Lonestar 5 and other HPC resources. In this section an MPI task is synonymous with a process. For a pure MPI-based job (i.e. no threading), it is strongly recommended to use 24 cores per node.

Do not use multiple methods to set affinity simultaneously as this can lead to unpredictable results.

Using numactl

numactl is a linux command that allows explicit control of process affinity and memory policy. Since each MPI task is launched as a separate process, numactl can be used to specify the affinity and memory policy for each task. There are two ways this can be used to exercise numa control when launching a batch executable:

nid00181$ ibrun numactl options ./a.out
nid00181$ ibrun my_affinity ./a.out

The first command sets the same options for each task. Because the ranks for the execution of each a.out are not known to numactl it is not possible to use this command-line to tailor options for each individual task. The second command launches an executable script, my_affinity, that sets affinity for each task. The script will have access to the number of tasks per node and the rank of each task, and so it is possible to set individual affinity options for each task using this method. In general any execution using more than one task should employ the second method to set affinity so that tasks can be properly pinned to the hardware.

In threaded applications, the same numactl command may be used, but its scope is limited globally to all threads, because every forked process or thread inherits the affinity and memory policy of the parent. This behavior can be modified from within a program using the numa API to control affinity. The basic calls for binding tasks and threads are "sched_getaffinity", "sched_setaffinity" and "numalib", respectively. Note, on the login nodes the core numbers for masking are assigned round-robin to the sockets (cores 0, 2, 4,... are on socket 0 and cores 1, 3, 5, ... are on socket 1) while on the compute nodes they are assigned contiguously (cores 0-11 are on socket 0 and 12-23 are on socket 1).

The TACC provided affinity script, tacc_affinity, enforces a strict local memory allocation to the socket, forcing eviction of previous user's IO buffers, and also distributes tasks evenly across sockets. Use this script as a template for implementing your own affinity script if a custom affinity script is needed for your jobs.

Table 5. Common numactl Options

Option Arguments Description
-N 0,1 Socket Affinity. Execute process only on this (these) socket(s)
-C [0-23] Core Affinity. Execute process on this (these, comma separated list) core(s).
-l None Memory Policy. Allocate only on socket where process runs. Fallback to another if full.
-i 0,1 Memory Policy. Strictly allocate round robin on these (comma separated list) sockets. No fallback; abort if no more allocation space is available.
-m 0,1 Memory Policy. Strictly allocate on this (these, comma separated list) sockets. No fallback; abort if no more allocation space is available.
--preferred= 0,1 Memory Policy. Allocate on this socket. Fallback to the other if full.

Additional details on numactl are given in its man page and help information:

login1$ man numactl
login1$ numactl --help

Using Intel's KMP_AFFINITY

To alleviate the complexity of setting affinity in architectures that support multiple hardware threads per core Intel provides the means of controlling thread pinning via the environment variable $KMP_AFFINITY.

login1$ export KMP_AFFINITY=[<modifier>,...]type

Table 6. KMP_AFFINITY types

Option Description
none Does not pin threads.
compact Pack threads close to each other.
scatter Round-robin threads to cores.

KMP_AFFINITY type modifiers include:

  • norespect or respect (OS thread placement)
  • noverbose or verbose
  • nowarnings or warnings
  • granularity=[fine|core] where
    • fine - pinned to HW thread
    • core - able to jump between HW threads within the core

Managing and Monitoring Jobs

After job submission, users may monitor the status of their jobs in several ways. While the job is in the waiting state the system is continuously monitoring the number of nodes that become available and applying a fair share algorithm and a backfill algorithm to determine a fair, expedient scheduling to keep the machine running at optimum capacity. The latest queue information can be displayed several different ways using the showq and squeue commands.

Job Monitoring with showq

TACC's "showq" job monitoring command-line utility displays jobs in the batch system in a manner similar to PBS' utility of the same name. showq summarizes running, idle, and pending jobs, also showing any advanced reservations scheduled within the next week. See Table 7 for more showq options.

Note that the number of cores reported is always 48 x (Number of nodes), independently of how many cores are actually requested per node. The exceptions are the vis and gpu queues which will report 20 x (Number of nodes).

login1$ showq
ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   REMAINING  STARTTIME
================================================================================
24820     plascomcm  oliver        Running 7680    20:38:50  Mon Dec 14 09:29:45
24827     noramp270_ mscott        Running 960      4:38:50  Mon Dec 14 09:29:45
24828     t13_job    mscott        Running 576      4:38:50  Mon Dec 14 09:29:45
24830     z_06_SL16  slindsey      Running 12000   20:38:50  Mon Dec 14 09:29:45
24856     z_07_SL16  slindsey      Running 8160    21:09:48  Mon Dec 14 10:00:43
24857     z_08_SL16  slindsey      Running 8160    21:20:55  Mon Dec 14 10:11:50
24863     idv88476   viennej       Running 768      3:39:56  Mon Dec 14 11:30:51
24864     terawrite  djames        Running 2400    23:59:42  Mon Dec 14 12:50:37

    8 active jobs

Total Jobs: 8     Active Jobs: 8     Idle Jobs: 0     Blocked Jobs: 0

Use the "-U" option with showq to display information for a single user:

login1$ showq -U slindsey

SUMMARY OF JOBS FOR USER: <slindsey>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   REMAINING  STARTTIME
================================================================================
28940     myjob1     slindsey      Running 480      6:44:22  Thu Jan  7 08:13:45
28941     myjob2     slindsey      Running 192      6:44:41  Thu Jan  7 08:14:04

Total Jobs: 2     Active Jobs: 2     Idle Jobs: 0     Blocked Jobs: 0
Option Description
--help display help message and exit
-l | --long display verbose (long) listing
-u | --user displays jobs for current user only
-U username displays jobs for username only

Job Monitoring with squeue

Both the showq -U and squeue -u username commands display similar information:

login1$ squeue -u slindsey

JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
28941      normal   myjob1 slindsey  R    3:16:37      4 nid0[1112-1115]
28940      normal   myjob2 slindsey  R    3:16:56     10 nid0[1102-1111]

Each command's output lists the three jobs (1676351, 1676352 & 1676354) waiting to run. The showq command displays cores and time requested, while the squeue command displays the partition (queue), the state (ST) of the job along with the node list when allocated. In this case, all three jobs are in the Pending (PD) state awaiting "Resources", (nodes to free up). Table 8 details common squeue options and Table 9 describes the command's output fields.

Option Result
-i interval Repeatedly report at intervals (in seconds).
-j job_list Displays information for specified job(s)
-p part_list Displays information for specified partitions (queues).
-t state_list Shows jobs in the specified state(s)
See the squeue man page for state abbreviations: "all" or list of {PD,R,S,CG,CD,CF,CA,F,TO,PR,NF}

Field Description
JOBID job id assigned to the job
USER user that owns the job
STATE current job status, including, but not limited to:
CD (completed)
CF (cancelled)
F (failed)
PD (pending)
R (running)

Using the squeue command with the --start and -j options can provide an estimate of when a particular job will be scheduled:

login1$ squeue --start -j 1676354
JOBID PARTITION NAMEUSERSTSTART_TIMENODES NODELIST(REASON)  
1676534  normal hellow  user3PD2013-08-21T13:42:03  256 (Resources)

Even more extensive job information can be found using the "scontrol" command. The output shows quite a bit about the job: job dependencies, submission time, number of codes, location of the job script and the working directory, etc. See the man page for more details.

Job Deletion with scancel

The scancel command is used to remove pending and running jobs from the queue. Include a space-separated list of job IDs that you want to cancel on the command-line:

login1$ scancel job_id1 job_id2 ...

Use "showq -u" or "squeue -u username" to see your jobs.

Example job scripts are available online in /share/doc/slurm . They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

About Pending Jobs

Viewing queue status may reveal jobs in a pending (PD) state. Jobs submitted to Slurm may be, and remain, in a pending state for many reasons such as:

  • A queue (partition) may be temporarily offline
  • The resources (number of nodes) requested exceed those available
  • Queues are being drained in anticipation of system maintenance.
  • The system is running other high priority jobs

The Reason Codes summarized below identify the reason a job is awaiting execution. If a job is pending for multiple reasons, only one of those reasons is displayed. For a full list, view the squeue man page.

Job Pending Codes Description
Dependency This job is waiting for a dependent job to complete.
NodeDown A node required by the job is down.
PartitionDown The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintenance. Note that this message may be displayed for a time even after the system is back up.
Priority One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours.
ReqNodeNotAvail No nodes can be found satisfying your limits, for instance because maintenance is scheduled and the job can not finish before it
Reservation The job is waiting for its advanced reservation to become available.
Resources The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes.
SystemFailure Failure of the Slurm system, a file system, the network, etc.

Slurm Environment Variables

In addition to the environment variables that can be inherited by the job from the interactive login environment, Slurm provides environment variables for most of the values used in the #SBATCH directives. These are listed at the end of the sbatch man page. The environment variables SLURM_JOB_ID, SLURM_JOB_NAME, SLURM_SUBMIT_DIR and SLURM_NTASKS_PER_NODE may be useful for documenting run information in job scripts and output. Table 11 below lists some important Slurm-provided environment variables.

Note that environment variables cannot be used in an #SBATCH directive within a job script. For example, the following directive will NOT work as expected:

#SBATCH -J {$SLURM_JOB_ID}.out

Instead, use the following directive:

#SBATCH -o myMPI.o%j

where "%j" expands to the jobID.

Environment Variable Description
SLURM_JOB_ID batch job id assigned by Slurm upon submission
SLURM_JOB_NAME user-assigned job name
SLURM_NNODES number of nodes
SLURM_NODELIST list of nodes
SLURM_NTASKS total number of tasks
SLURM_QUEUE queue (partition)
SLURM_SUBMIT_DIR directory of submission
SLURM_TASKS_PER_NODE number of tasks per node
SLURM_TACC_ACCOUNT TACC project/allocation charged

Job Dependencies

Some workflows may have job dependencies, for example a user may wish to perform post-processing on the output of another job, or a very large job may have to be broken up into smaller pieces so as not to exceed maximum queue runtime. In such cases you may use Slurm's command-line "--dependency=" options. Slurm will not process this option within a job script.

The following command submits a job script that will run only upon successful completion of another previously submitted job:

login1$ sbatch --dependency=afterok:jobid job_script_name

Monitor queue status with sinfo

The "sinfo" command gives a wealth of information about the status of the queues, but the command without arguments might give you more information than you want. Use the print options in in the snippet below with sinfofor a more readable listing that summarizes each queue on a single line:

login1$ sinfo -o "%20P %5a %.10l %16F"

The column labeled "NODES(A/I/O/T)" of this summary listing displays the number of nodes with the Allocated, Idle, and Other states along with the Total node count for the partition. See "man sinfo" for more information.

Software on Lonestar 5

Use TACC's Software Search tool or the "module spider" command to discover available software packages.

Users are welcome to install packages in their Home or Work directories. No super-user privileges are needed, simply use the "--prefix" option when configuring then making the package.

Users must provide their own license for commercial packages.

Visualization

Lonestar 5 supports both interactive and batch visualization on any compute node using the Software-Defined Visualization stack (sdvis.org). Traditional hardware-accelerated rendering is available on the 16 GPU nodes through the vis queue, where each node is configured with one NVIDIA K40s GPU.

Remote Desktop Access

Remote desktop access to Lonestar 5 is provided through a virtual network connection (VNC) to one or more nodes. Users must first connect to a Lonestar login node (see Accessing Lonestar 5 and from there submit a job that:

  • allocates a set of Lonestar nodes
  • starts a vncserver process on the first allocated node
  • identifies via an output message a vncserver access port to connect to

Once the vncserver process is running, the viewer establishes a secure SSH tunnel to the specified vncserver access port, and starts a VNC viewer application on their local system which presents the virtual desktop to the user.

Note: If this is your first time connecting to LS5, you must run vncpasswd to create a password for your VNC servers. This should NOT be your login password! This mechanism only deters unauthorized connections; it is not fully secure, as only the first eight characters of the password are saved. All VNC connections are tunneled through SSH for extra security, as described below.

Follow the steps below to start an interactive session.

  1. Start a Remote Desktop

    TACC has provided a VNC Slurm job script (/share/doc/slurm/job.vnc) that requests one node in the vis queue for four hours, creating a VNC session. Submit this job with the sbatch command:

    login1$ sbatch /share/doc/slurm/job.vnc

    You may modify or overwrite script defaults with sbatch command-line options:

    • "-t hours:minutes:seconds" - modify the job runtime
    • "-A projectnumber" - specify the project/allocation to be charged
    • "-N nodes" - specify number of nodes needed
    • "-n processes" - specify the number of processes per node NEW to LS5
    • "-p partition" - specify alternate queue (default queue is normal)

    All arguments after the job script name are sent to the vncserver command. For example, to set the desktop resolution to 1440x900, use:

    login1$ sbatch /share/doc/slurm/job.vnc -geometry 1440x900

    The vnc.job script starts a vncserver process and writes to the output file, vncserver.out in the job submission directory, with the connect port for the vncviewer. Watch for the "To connect via VNC client" message at the end of the output file, or watch the output stream in a separate window with the commands:

    login1$ touch vncserver.out ; tail -f vncserver.out

    The spartan window manager twm is the default VNC desktop. The lightweight window manager, xfce, is available as a module and it is recommended for remote performance. To use Xfce, open the "~/.vnc/xstartup" file (created after your first VNC session) and replace "twm" with "startxfce4". You will need to load the xfce module before submitting the vnc job script:

    login1$ module load xfce
  2. Create an SSH Tunnel to Lonestar 5

    TACC requires users to create a secure SSH tunnel from their local system to either one of two LS5 login nodes: login1.ls5.utexas.edu or login2.ls5.utexas.edu. On Unix or Linux systems execute the following command once the port has been opened on an LS5 login node:

    localhost$ ssh -f -N -L xxxx:ls5.tacc.utexas.edu:yyyy username@login1.ls5.tacc.utexas.edu

    or

    localhost$ ssh -f -N -L xxxx:ls5.tacc.utexas.edu:yyyy username@login2.ls5.tacc.utexas.edu

    where:

    • "yyyy" is the port number given by the vncserver batch job
    • "xxxx" is a port on the remote system. Generally, the port number specified on one of the LS5 login nodes, yyyy, is a good choice to use on your local system as well
    • "-f" instructs SSH to only forward ports, not to execute a remote command
    • "-N" backgrounds the ssh command after connecting
    • "-L" forwards the port

    On Windows systems, find the menu in the SSH client where tunnels can be specified, and enter the local and remote ports as required, then ssh to LS5.

  3. Connecting vncviewer

    Once the SSH tunnel has been established, use a VNC client to connect to the local port you created, which will then be tunneled to your VNC server on Lonestar. Connect to "localhost:xxxx", where "xxxx" is the local port you used for your tunnel. In the examples above, we would connect the VNC client to "localhost::xxxx" (note the "::", some VNC clients accept "localhost:xxxx").

    TACC staff recommends the following VNC clients:

    • TigerVNC VNC Client, a platform independent application
    • TightVNC for Windows and Linux
    • Chicken of the VNC for Mac

    Before the desktop is presented, the user will be prompted for their VNC server password (the password created before your first session using vncpasswd as explained above). Depending on your local system this prompt's location may or may not be obvious. If you don't see it immediately take a good look around your desktop. The virtual desktop should appear at this point and includes one or two initial xterm windows (which may be overlapping). One, which is white-on-black, manages the lifetime of the VNC server process. Killing this window (by typing "exit" or "ctrl-D" at the prompt, or selecting the "X" in the upper corner) will cause the vncserver process to terminate and the original batch job to end. Because of this, we recommend that this window not be used for other purposes; it is just too easy to accidentally kill it and terminate the session. Move it off to one side out of the way.

    The other xterm window is black-on-white, and can be used to start both serial programs running on the node hosting the vncserver process, or parallel jobs running across the set of cores associated with the original batch job. Additional xterm windows can be created using the window-manager left-button menu.

Visualization Applications

TACC Staff is working on installing Parallel VisIt and Paraview on Lonestar 5. Documentation will be forthcoming.

Assessing the Need

The large memory nodes are for jobs that require more than the 64GB of RAM available on the standard memory nodes. Demand for these nodes is high and wait times can be long. Please assess your memory needs carefully before submitting jobs to the large memory queues: in many cases there are ways to resolve memory issues that do not require moving to large memory nodes. For MPI applications, it is often enough to run with fewer MPI tasks per node. For example, instead of "-N 4 -n 96" (96 tasks spread across 4 nodes, or 24 tasks per node), one can increase the memory available to each task by running "-N 8 -n 96" (96 tasks spread across 8 nodes, or 12 tasks per node) or even "-N 96 -n 96" (1 task per node). TACC's Remora tool allows you to examine your memory needs easily. For more information about remora:

login1$ module load remora; module help remora

Accessing the Large Memory Queues

The TACC-largemem module provides access to the large memory queues. It is the large memory equivalent of the (default) "TACC" module that manages access to the standard memory queues. To configure your environment for the large memory queues, execute:

login1$ module load TACC-largemem

to automatically swap out the TACC module. Loading the TACC-largemem module has two effects:

  1. It configures Slurm so it points to the two large memory queues (and only those queues)
  2. It swaps out any Aries-based MPI module and automatically loads the mvapich2-largemem module, an MPI stack that is compatible with the Infiniband network connecting the large memory nodes.

To reconfigure your environment for the standard memory queues, execute

 login1$ module load TACC

to automatically swap out the TACC-largemem module and replace it with the default TACC module. You can of course achieve the same outcome in other ways (e.g. executing "module reset" to return to system defaults). Do not, however, simply unload the TACC-largemem module; doing so would have the effect of leaving you without access to any Slurm commands.

MPI Applications on the Large Memory Queues

To run an MPI application in either of the large memory queues you will need to rebuild the application using the mvapich2-largemem module. This module provides an MPI stack compatible with the Infiniband network on the large memory queues. The "cray_mpich" module targets the Aries network and will not work in the large memory queues.

Building for the 512GB Haswell Nodes

The login nodes and 512GB compute nodes share the same Haswell architecture. Be sure to load the mvapich2-largemem when building MPI applications for the large memory nodes. Beyond this requirement, there are no other special considerations when using the login nodes to build software for the largemem512GB nodes.

Building for the 1TB Ivy Bridge Nodes

The Ivy Bridge 1TB compute nodes have a different processor architecture than the Haswell login nodes. This can affect the way you use the Intel compiler and Haswell login nodes to build software targeting the 1TB Ivy Bridge nodes. Among the plausible approaches:

  • Single Binary Optimized for Both Architectures (recommended): Specify "-xAVX -axCORE-AVX2" with the Intel compiler to build a multi-architecture binary that will detect the processor architecture at runtime and select processor-specific optimized code. In a typical build system, add these flags to the CFLAGS, CXXFLAGS, FFLAGS, and LDFLAGS variables. The resulting binary may be up to 2x larger than a single-target executable, and compilation will take more time. But this approach is otherwise an excellent choice for codes you want to run in both large memory queues.

  • Generic Binary (supported): If you compile with no architecture flags (e.g. "icc mycode.c"), the compiler defaults to "-msse2", producing a binary that will run on essentially any modern Intel processor. The executable, however, will not exploit new hardware features or the best processor-specific optimizations.

When building MPI applications, be sure to load the "mvapich2-largemem" module.

Running Jobs in the Large Memory Queues

Submitting a batch job to a large memory queue is similar to doing so on the standard queues. Interactive sessions using idev are also possible, but queue wait times may make interactive sessions impractical. To submit a batch job or request an interactive session on a large memory queue:

  1. First load the TACC-largemem module so that Slurm points to the large memory queues. You must load this module before executing your sbatch or idev command.

  2. For MPI jobs, check to make sure you have mvapich2-largemem module loaded. Loading the TACC-largemem module automatically loads mvapich2-largemem module. You can also load the module explicitly before calling sbatch/idev, in your job script itself, or from within your interactive idev session.

  3. Specify the appropriate queue in one of three ways:

    • in a batch script:

      #SBATCH -p largemem512GB
    • as a command-line option to the sbatch command:

      login1$ sbatch -p largemem512GB mybatchscript
    • or as an idev option

       

      login1$ idev -p largemem512GB

TACC and Cray Environments

The Lonestar 5 environment actually includes two distinct modes: a TACC Environment and a Cray Environment. This user guide describes the TACC Environment, designed to make available to you a user experience similar to other TACC resources and a software stack built and maintained by the TACC staff. The Cray Environment is designed to make available to you a user experience similar to other Cray systems and the standard software stack provided by the vendor.

Help

For assistance, please submit a ticket via the TACC User Portal with "Lonestar 5" in the Resource field. The ticket will be directed to the appropriate staff member for support.