Hikari User Guide
Last update: March 22, 2018 see revision history

Material in red represents temporary conditions, placeholders, or other content subject to change in the near future.

Notices

  • Hikari is part of TACC's computing ecosystem, sharing many attributes with other systems. This guide may refer to sections in other user guides or documentation where appropriate.
  • Hikari's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour) rather than a core-hour. We then multiply by a queue-specific charge rate to adjust for supply and demand. See Job Accounting for more information.
Hikari
Figure 1. Hikari System

Introduction

Hikari is a large-scale compute resource deployed as a result of a collaboration between TACC, the New Energy and Industrial Technology Development Organization (NEDO), a Japanese government agency, and NTT FACILITIES INC. It is designed to support secure/compliant computing needs (HIPAA/FISMA compliant data), as well as an increasing demand for fast turnaround on the jobs submitted through web APIs from Science Gateways. Hikari features a secure data management workflow for protected data. Additionally, Hikari is the "greenest" of the TACC systems, demonstrating the potential of High Voltage Direct Current (HVDC) data centers.

This user guide provides an overview of the system architecture, how to access the machine, how to compute, what software is natively available to the user community, and how to move data off and on the connected filesystem.

Sustainable Supercomputing

The Hikari system is part of a demonstration project to show off the potential of High Voltage Direct Current (HVDC) data centers. Hikari operates on a 380V DC distribution system with 250KW of solar panels, an HVDC UPS battery system, and HVDC CRAC (Computer Room Air Conditioner) units, as well as the power distribution equipment. In addition to the HVDC components, the HPE Apollo 8000 has a number of other "green" features including self-contained water-cooled racks that don't eject any heat into the datacenter air. Separate water loops move air directly across the processors - these loops are at near-vacuum pressures so the water will boil at low temperatures (around 105F or 40C) and convect to the sides of each blade, where it hits a heat exchanger, condenses, and is pushed back over the processors without the need for active pumps in each blade.

Architecture

Hikari is an HPE Apollo 8000 system, providing 432 24-core general compute nodes (for a total of 10,368 processor cores). The system is configured with over 27TB of memory and 13TB of disk storage, and has a peak performance of ~0.4PF.

All Hikari nodes run CentOS 7 and are managed with batch services through native Slurm 17.02. Global storage areas are supported by an NFS system ($HOME) and a Luster parallel distributed file system ($WORK). Inter-node communication is through EDR Infiniband interconnect provided by Mellanox.

Compute Nodes

Hikari hosts 432 Haswell compute nodes.

Model: Intel Xeon E5-2690 v3 ("Haswell")
Total cores per Haswell node: 24 cores on two sockets (12 cores / socket)
Hardware threads per core: 2
Hardware threads per node: 24 x 2 = 48
Clock rate: 2.60GHz
Disk: 120GB SATA M.2 Solid State Drive
Memory: 64GB DDR4-2133 (8 x 8GB single rank x4 DIMMs)
Local storage: Users have access to a 64 GB /tmp disk to accelerate IO operations.

Login Nodes

Hikari hosts 2 login nodes:

  • Dual Socket
  • Intel Xeon CPU E5-2660 v3 (Haswell) @ 2.60GHz: 10 cores/socket (20 cores/node)
  • 128 GB DDR4-2133 (8 x 16GB dual rank x4 DIMMS)
  • Hyperthreading Disabled

Network

  • Mellanox EDR Infiniband
  • Fat Tree Interconnect
  • HP Infiniband EDR 1-port 840 Apollo 8000 Adapter
  • up to 100Gbps bandwidth and a sub-microsecond low latency
  • Intel Ethernet Controller I350 IEEE 802.3 1Gbps Adapter

File Systems

Hikari mounts two filesystems that are shared across all nodes: /home and work. The system also defines for you corresponding account-level environment variables $HOME and $WORK. $HOME has a strict quota of 5GB. Large data should be placed on $WORK. Hikari does not have a scratch file system mounted.

File System Quota Purge Policy
$HOME 5GB none
$WORK 1TB (across all TACC systems) none

Several aliases are provided for users to move easily between file systems:

Use the "cdh" or "cd" commands to change to $HOME Use "cdw" to change to $WORK

Mounted Storage Systems

TACC's Ranch tape archival system is available from Hikari via remote access. Private, secure mount points will be offered to user groups with Protected Data requirements.

Accessing the System

Access to all TACC systems now requires Multi-Factor Authentication (MFA). You can create an MFA pairing on the TACC User Portal. After login on the portal, go to your account profile (Home->Account Profile), then click the "Manage" button under "Multi-Factor Authentication" on the right side of the page. See Multi-Factor Authentication at TACC for further information.

Secure Shell (SSH)

The "ssh" command (SSH protocol) is the standard way to connect to Hikari. SSH also includes support for the file transfer utilities scp and sftp. Wikipedia is a good source of information on SSH. SSH is available within Linux and from the terminal app in the Mac OS. If you are using Windows, you will need an SSH client that supports the SSH-2 protocol: e.g. Bitvise, OpenSSH, PuTTY, or SecureCRT. Initiate a session using the ssh command or the equivalent; from the Linux command line the launch command looks like this:

localhost$ ssh myusername@hikari.tacc.utexas.edu

Use your TACC password for direct logins to TACC resources. You can change your TACC password through the TACC User Portal. Log into the portal, then select "Change Password" under the "HOME" tab. If you've forgotten your password, go to the TACC User Portal home page and select "Password Reset" under the Home tab.

To report a connection problem, execute the ssh command with the "-vvv" option and include the verbose output when submitting a help ticket.

Do not run the "ssh-keygen" command on Hikari. This command will create and configure a key pair that will interfere with the execution of job scripts in the batch system. If you do this by mistake, you can recover by renaming or deleting the .ssh directory located in your home directory; the system will automatically generate a new one for you when you next log into Hikari.

  1. execute "mv .ssh dot.ssh.old"
  2. log out
  3. log into Hikari again

After logging in again the system will generate a properly configured key pair.

Using Hikari

Hikari nodes run Red Hat Enterprise Linux 7. Regardless of your research workflow, you'll need to master Linux basics and a Linux-based text editor (e.g. emacs, nano, gedit, or vi/vim) to use the system properly. This user guide does not address these topics, however. There are numerous resources in a variety of formats that are available to help you learn Linux, including some listed on the TACC training sites. If you encounter a term or concept in this user guide that is new to you, a quick internet search should help you resolve the matter quickly.

Configuring Your Account

Linux Shell

The default login shell for your user account is Bash. To determine your current login shell, execute:

$ echo $SHELL

If you'd like to change your login shell to csh, sh, tcsh, or zsh, submit a ticket through the TACC. The "chsh" ("change shell") command will not work on TACC systems.

When you start a shell on Hikari, system-level startup files initialize your account-level environment and aliases before the system sources your own user-level startup scripts. You can use these startup scripts to customize your shell by defining your own environment variables, aliases, and functions. These scripts (e.g. .profile and .bashrc) are generally hidden files: so-called dotfiles that begin with a period, visible when you execute: "ls -a".

Before editing your startup files, however, it's worth taking the time to understand the basics of how your shell manages startup. Bash startup behavior is very different from the simpler csh behavior, for example. The Bash startup sequence varies depending on how you start the shell (e.g. using ssh to open a login shell, executing the "bash" command to begin an interactive shell, or launching a script to start a non-interactive shell). Moreover, Bash does not automatically source your .bashrc when you start a login shell by using ssh to connect to a node. Unless you have specialized needs, however, this is undoubtedly more flexibility than you want: you will probably want your environment to be the same regardless of how you start the shell. The easiest way to achieve this is to execute "source ~/.bashrc" from your ".profile", then put all your customizations in ".bashrc". The system-generated default startup scripts demonstrate this approach. We recommend that you use these default files as templates.

For more information see the Bash Users' Startup Files: Quick Start Guide and other online resources that explain shell startup. To recover the originals that appear in a newly created account, execute "/usr/local/startup_scripts/install_default_scripts".

Environment Variables

Your environment includes the environment variables and functions defined in your current shell: those initialized by the system, those you define or modify in your account-level startup scripts, and those defined or modified by the modules that you load to configure your software environment. Be sure to distinguish between an environment variable's name (e.g. HISTSIZE) and its value ($HISTSIZE). Understand as well that a sub-shell (e.g. a script) inherits environment variables from its parent, but does not inherit ordinary shell variables or aliases. Use export (in Bash) or setenv (in csh) to define an environment variable.

Execute the "env" command to see the environment variables that define the way your shell and child shells behave.

Pipe the results of env into grep to focus on specific environment variables. For example, to see all environment variables that contain the string GIT (in all caps), execute:

$ env | grep GIT

The environment variables PATH and LD_LIBRARY_PATH are especially important. PATH is a colon-separated list of directory paths that determines where the system looks for your executables. LD_LIBRARY_PATH is a similar list that determines where the system looks for shared libraries.

Accessing the Compute Nodes

You connect to Hikari through one of four "front-end" login nodes. The login nodes are shared resources: at any given time, there are many users logged into each of these login nodes, each preparing to access the "back-end" compute nodes (Figure 2. Login and Compute Nodes). What you do on the login nodes affects other users directly because you are competing for the same memory and processing power. This is the reason you should not run your applications on the login nodes or otherwise abuse them. Think of the login nodes as a prep area where you can manage files and compile code before accessing the compute nodes to perform research computations. See Good Citizenship for more information.

You can use your command-line prompt, or the "hostname" command, to tell you whether you are on a login node or a compute node. The default prompt, or any custom prompt containing "\h", displays the short form of the hostname (e.g. c401-064). The hostname for a Hikari login node begins with the string "login" (e.g. login2.hikari.tacc.utexas.edu), while compute node hostnames begin with the character "c" (e.g. c401-064.stampede2.tacc.utexas.edu).

While some workflows, tools, and applications hide the details, there are three basic ways to access the compute nodes:

  1. Submit a batch job using the sbatch command. This directs the scheduler to run the job unattended when there are resources available. Until your batch job begins it will wait in a queue. You do not need to remain connected while the job is waiting or executing. See Running Jobs for more information. Note that the scheduler does not start jobs on a first come, first served basis; it juggles many variables to keep the machine busy while balancing the competing needs of all users. The best way to minimize wait time is to request only the resources you really need: the scheduler will have an easier time finding a slot for the two hours you need than for the 48 hours you unnecessarily request.
  2. Begin an interactive session using idev or srun. This will log you into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine. An interactive session is a great way to develop, test, and debug code. When you request an interactive session, the scheduler submits a job on your behalf. You will need to remain logged in until the interactive session begins.
  3. Begin an interactive session using ssh to connect to a compute node on which you are already running a job. This is a good way to open a second window into a node so that you can monitor a job while it runs.

Be sure to request computing resources that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single core on a single node, and will only see that node's memory.
  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming model and is also restricted to a single node, but the program's individual threads can run on multiple cores on that node.
  • An MPI (Message Passing Interface) program can exploit the distributed computing power of multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network. The tasks on a given node, however, can only directly access the memory on that node. Depending on the program's memory requirements, it may not be possible to run a task on every core of every node assigned to your job. If it appears that your MPI job is running out of memory, try launching it with fewer tasks per node to increase the amount of memory available to individual tasks.

Figure 2. Login and compute nodes

Using Modules to Manage your Environment

Lmod, a module system developed and maintained at TACC, makes it easy to manage your environment so you have access to the software packages and versions that you need to conduct your research. This is especially important on a system like Hikari that serves thousands of users with an enormous range of needs. Loading a module amounts to choosing a specific package from among available alternatives:

$ module load intel          # load the default Intel compiler
$ module load intel/16.0.1   # load a specific version of Intel compiler

A module does its job by defining or modifying environment variables (and sometimes aliases and functions). For example, a module may prepend appropriate paths to $PATH and $LD_LIBRARY_PATH so that the system can find the executables and libraries associated with a given software package. The module creates the illusion that the system is installing software for your personal use. Unloading a module reverses these changes and creates the illusion that the system just uninstalled the software:

$ module load   ddt  # defines DDT-related env vars; modifies others
$ module unload ddt  # undoes changes made by load

The module system does more, however. When you load a given module, the module system can automatically replace or deactivate modules to ensure the packages you have loaded are compatible with each other. In the example below, the module system automatically unloads one compiler when you load another, and replaces Intel-compatible versions of IMPI and PETSc with versions compatible with gcc:

$ module load intel  # load default version of Intel compiler
$ module load petsc  # load default version of PETSc
$ module load gcc    # change compiler

Lmod is automatically replacing "intel/17.0.4" with "gcc/7.1.0".

Due to MODULEPATH changes, the following have been reloaded:
1) impi/17.0.3     2) petsc/3.7

To see the modules you currently have loaded:

$ module list

To see all modules that you can load right now because they are compatible with the currently loaded modules:

$ module avail

To see all installed modules, even if they are not currently available because they are incompatible with your currently loaded modules:

$ module spider   # list all modules, even those not available to load

To filter your search:

$ module spider gcc              # all modules with names containing 'gcc'
$ module spider sundials/2.5.0   # additional details on a specific module

Among other things, the latter command will tell you which modules you need to load before the module is available to load. You might also search for modules that are tagged with a keyword related to your needs (though your success here depends on the diligence of the module writers). For example:

$ module keyword performance

You can save a collection of modules as a personal default collection that will load every time you log into Hikari. To do so, load the modules you want in your collection, then execute:

$ module save    # save the currently loaded collection of modules 

Two commands make it easy to return to a known, reproducible state:

$ module reset   # load the system default collection of modules
$ module restore # load your personal default collection of modules

On TACC systems, the command "module reset" is equivalent to "module purge; module load TACC". It's a safer, easier way to get to a known baseline state than issuing the two commands separately.

Help text is available for both individual modules and the module system itself:

$ module help tophat  # show help text for software package tophat
$ module help         # show help text for the module system itself

See Lmod's online documentation for more extensive documentation. The online documentation addresses the basics in more detail, but also covers several topics beyond the scope of the help text (e.g. writing and using your own module files).

It's safe to execute module commands in job scripts. In fact, this is a good way to write self-documenting, portable job scripts that produce reproducible results. If you use "module save" to define a personal default module collection, it's rarely necessary to execute module commands in shell startup scripts, and it can be tricky to do so safely. If you do wish to put module commands in your startup scripts, see Hikari's default startup scripts for a safe way to do so.

Good Citizenship

You share Hikari with many users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb.

Login Nodes

When you connect to Hikari you share the login node with dozens of other users.

  • Know when you're on a login node. You can use your Linux prompt, the "hostname" command, or other mechanisms to do so. See Accessing the Compute Nodes for more information.

  • Know what's appropriate on a login node. A login node is a good place to edit and manage files, initiate file transfers, compile code, submit new jobs, and track existing jobs.

  • Avoid computationally intensive activity on login nodes. This means:

    • Don't run research applications on the login nodes; this includes frameworks like MATLAB and R.
    • Don't launch too many simultaneous processes: while it's fine to compile on a login node, "make -j 16" (which compiles on 16 cores) may be a bit rude.
    • That script you wrote to check job status should probably do so every few minutes rather than several times a second.

Shared Lustre File Systems

This section focuses on ways to avoid causing problems on $HOME, $WORK, and $SCRATCH. File Systems above is a brief overview of these file systems. Configuring Your Account covers environment variables and aliases that help you navigate the file systems.

  • Stripe the receiving directory before creating large files in the directory or transferring large files to the directory. See Striping Large Files for more information.

  • Don't run jobs in $HOME. The $HOME file system is for routine file management, not parallel jobs.

  • Run I/O intensive jobs in $SCRATCH rather than $WORK. If you stress $WORK, you affect every user on every TACC system.

  • Don't get greedy. If you know or suspect your workflow is I/O intensive, don't submit a pile of simultaneous jobs. Writing restart/snapshot files can stress the file system; avoid doing so too frequently.

  • Watch your file system quotas. If you're near your quota in $WORK and your job is repeatedly trying (and failing) to write to $WORK, you will stress the file system. If you're near your quota in $HOME, jobs run on any file system may fail, because all jobs write some data to the hidden $HOME/.slurm directory.

  • Avoid opening and closing files repeatedly in tight loops. Every open/close operation requires the MDS, which is a potential point of congestion. If possible, open files once at the beginning of your program/workflow, then close them at the end.

  • Avoid storing many small files in a single directory, and avoid workflows that require many small files. A few hundred files in a single directory is probably fine; tens of thousands is almost certainly too many. If you must use many small files, group them in separate directories of manageable size.

Internal and External Networks

  • Avoid too many simultaneous file transfers. You share the network bandwidth with other users; don't use more than your fair share. Two or three concurrent scp sessions is probably fine. Twenty is probably not.

  • Avoid recursive file transfers, especially those involving many small files. Create a tar archive before transfers. This is especially true when transferring files to or from Ranch.

Submitting Jobs

  • When you submit a job to the scheduler, don't ask for more time than you really need. The scheduler will have an easier time finding a slot for the 2 hours you need than the 48 hours you request. This means shorter queue waits times for you and everybody else.

  • Test your submission scripts. Start small: make sure everything works on 2 nodes before you try 200. Work out submission bugs and kinks with 5 minute jobs that won't wait long in the queue and involve short, simple substitutes for your real workload: simple test problems; "hello world" codes; one-liners like "ibrun hostname"; or an ldd on your executable.

  • Respect memory limits and other system constraints. If your application needs more memory than is available, your job will fail, and may leave nodes in unusable states. Monitor your application's needs. Execute "module load remora" followed by "module help remora" for more information on a particularly handy monitoring tool.

Help Desk Tickets

  • Do your homework before submitting a help desk ticket. What does the user guide and other documentation say? Search the internet for key phrases in your error logs; that's probably what the consultants answering your ticket are going to do. What have you changed since the last time your job succeeded?

  • Subscribe to TACC User News. This is the best way to keep abreast of maintenance schedules, system outages, and other general interest items.

  • Have realistic expectations. Consultants can address system issues and answer questions about Hikari. But they can't teach parallel programming in a ticket, and may know nothing about the package you downloaded. They may offer general advice that will help you build, debug, optimize, or modify your code, but you shouldn't expect them to do these things for you.

  • Describe your issue as precisely and completely as you can: what you did, what happened, verbatim error messages, other meaningful output. When appropriate, include the information a consultant would need find your artifacts and understand your workflow: e.g. the directory containing your build and/or job script; the modules you were using; relevant job numbers; and recent changes in your workflow that could affect or explain the behavior you're observing.

  • Be patient. It may take a business day for a consultant to get back to you, especially if your issue is complex. It might take an exchange or two before you and the consultant are on the same page. If the admins disable your account, it's not punitive. When the file system is in danger of crashing, or a login node hangs, they don't have time to notify you before taking action.

Hikari mounts three file Lustre file systems that are shared across all nodes: the home, work, and scratch file systems. Hikari's startup mechanisms define corresponding account-level environment variables $HOME, $SCRATCH, and $WORK that store the paths to directories that you own on each of these file systems. Consult the Hikari File Systems table for the basic characteristics of these file systems and Good Citizenship for tips on file system etiquette.

Hikari's home file system is mounted only on Hikari, but the work file system mounted on Hikari is the Global Shared File System hosted on Stockyard. It is the same file system that is available on Stampede2, Maverick, Wrangler, Lonestar5, and other TACC resources.

The $STOCKYARD environment variable points to the highest-level directory that you own on the Global Shared File System. The definition of the $STOCKYARD environment variable is of course account-specific, but you will see the same value on all TACC systems (see Figure 3). This directory is an excellent place to store files you want to access regularly from multiple TACC resources.

Your account-specific $WORK environment variable varies from system to system and is a sub-directory of $STOCKYARD (Figure 3). The sub-directory name corresponds to the associated TACC resource. The $WORK environment variable on Hikari points to the $STOCKYARD/hikari subdirectory, a convenient location for files you use and jobs you run on Hikari. Remember, however, that all subdirectories contained in your $STOCKYARD directory are available to you from any system that mounts the file system. If you have accounts on both Hikari and Maverick, for example, the $STOCKYARD/hikari directory is available from your Maverick account, and $STOCKYARD/maverick is available from your Hikari account. Your quota and reported usage on the Global Shared File System reflects all files that you own on Stockyard, regardless of their actual location on the file system.

Note that resource-specific sub-directories of $STOCKYARD are nothing more than convenient ways to manage your resource-specific files. You have access to any such sub-directory from any TACC resources. If you are logged into Hikari, for example, executing the alias cdw (equivalent to "cd $WORK") will take you to the resource-specific sub-directory $STOCKYARD/hikari. But you can access this directory from other TACC systems as well by executing "cd $STOCKYARD/hikari". These commands allow you to share files across TACC systems. In fact, several convenient account-level aliases make it even easier to navigate across the directories you own in the shared file systems:

Figure 3. Account-level directories on the work file system (Global Shared File System hosted on Stockyard). Example for fictitious user bjones. All directories usable from all systems. Sub-directories (e.g. wrangler, maverick) exist only when you have allocations on the associated system.

Transferring Files

You can transfer files between Hikari and Linux-based systems using either scp or rsync. Both scp and rsync are available in the Mac Terminal app. Windows ssh clients typically include scp-based file transfer capabilities.

The Linux scp (secure copy) utility is a component of the OpenSSH suite. Assuming your Hikari username is bjones, a simple scp transfer that pushes a file named "myfile" from your local Linux system to Hikari $HOME would look like this:

localhost$ scp ./myfile bjones@hikari.tacc.utexas.edu:  # note colon after net address

You can use wildcards, but you need to be careful about when and where you want wildcard expansion to occur. For example, to push all files ending in ".txt" from the current directory on your local machine to /work/01234/bjones/scripts on Hikari:

localhost$ scp *.txt bjones@hikari.tacc.utexas.edu:/work/01234/bjones/hikari

To delay wildcard expansion until reaching Hikari, use a backslash ("\") as an escape character before the wildcard. For example, to pull all files ending in ".txt" from /work/01234/bjones/scripts on Hikari to the current directory on your local system:

localhost$ scp bjones@hikari.tacc.utexas.edu:/work/01234/bjones/hikari/\*.txt .

You can of course use shell or environment variables in your calls to scp. For example:

localhost$ destdir="/work/01234/bjones/hikari/data"
localhost$ scp ./myfile bjones@hikari.tacc.utexas.edu:$destdir

You can also issue scp commands on your local client that use Hikari environment variables like $HOME, $WORK, and $SCRATCH. To do so, use a backslash ("\") as an escape character before the "$"; this ensures that expansion occurs after establishing the connection to Hikari:

localhost$ scp ./myfile bjones@hikari.tacc.utexas.edu:\$WORK/data   # Note backslash

Avoid using scp for recursive ("-r") transfers of directories that contain nested directories of many small files:

localhost$ scp -r  ./mydata     bjones@hikari.tacc.utexas.edu:\$WORK  # DON'T DO THIS

Instead, use tar to create an archive of the directory, then transfer the directory as a single file:

localhost$ tar cvf ./mydata.tar mydata                                # create archive
localhost$ scp     ./mydata.tar bjones@hikari.tacc.utexas.edu:\$WORK  # transfer archive

The rsync (remote synchronization) utility is a great way to synchronize files that you maintain on more than one system: when you transfer files using rsync, the utility copies only the changed portions of individual files. As a result, rsync is especially efficient when you only need to update a small fraction of a large dataset. The basic syntax is similar to scp:

localhost$ rsync       mybigfile bjones@hikari.tacc.utexas.edu:\$WORK/data
localhost$ rsync -avtr mybigdir  bjones@hikari.tacc.utexas.edu:\$WORK/data

The options on the second transfer are typical and appropriate when synching a directory: this is a recursive update ("-r") with verbose ("-v") feedback; the synchronization preserves time stamps ("-t") as well as symbolic links and other meta-data ("-a"). Because rsync only transfers changes, recursive updates with rsync may be less demanding than an equivalent recursive transfer with scp.

See Good Citizenship for additional important advice about striping the receiving directory when transferring large files; watching your quota on $HOME and $WORK; and limiting the number of simultaneous transfers. Remember also that $STOCKYARD (and your $WORK directory on each TACC resource) is available from all major TACC systems: there's no need for scp when both the source and destination involve sub-directories of $STOCKYARD. See Managing Your Files for more information about transfers on $STOCKYARD.

Sharing Files with Collaborators

If you wish to share files and data with collaborators in your project, see Sharing Project Files on TACC Systems for step-by-step instructions. Project managers or delegates can use Unix group permissions and commands to create read-only or read-write shared workspaces that function as data repositories and provide a common work area to all project members.

Striping Large Files

Before transferring large files to Hikari, or creating new large files, be sure to set an appropriate default stripe count on the receiving directory. To avoid exceeding your fair share of any given OST, a good rule of thumb is to allow at least one stripe for each 100GB in the file. For example, to set the default stripe count on the current directory to 30 (a plausible stripe count for a directory receiving a file approaching 3TB in size), execute:

$ lfs setstripe -c 30 $PWD

Note that an "lfs setstripe" command always sets both stripe count and stripe size, even if you explicitly specify only one or the other. Since the example above does not explicitly specify stripe size, the command will set the stripe size on the directory to Hikari's system default (1MB). In general there's no need to customize stripe size when creating or transferring files.

Remember that it's not possible to change the striping on a file that already exists. Moreover, the "mv" command has no effect on a file's striping if the source and destination directories are on the same file system. You can, of course, use the "cp" command to create a second copy with different striping; to do so, copy the file to a directory with the intended stripe parameters.

Software on Hikari

You can discover already installed software using TACC's Software Search tool or the "module spider" or "module avail" commands.

You are welcome to install packages in your own $HOME or $WORK directories. No super-user privileges are needed, simply use the "--prefix" option when configuring then making the package.

Users must provide their own license for commercial packages.

At this time, the following software packages are available:

------------------------------ /opt/apps/gcc5_2/modulefiles -------------------------------
python/2.7.11

---------------------------- /opt/apps/intel16/modulefiles -----------------------------
impi/5.1.2 (L)    ompi/1.10.2

-------------------------------- /opt/apps/modulefiles ---------------------------------
TACC/1.0        (L)    cmake/3.10.2 (D)    idev/1.0               settarg/7.7
autotools/1.1          dmtcp/2.4.4         intel/16.0.1    (L)    spades/3.10.1
bedtools/2.25.0        gatk/3.4.46         jellyfish/2.2.6        tacc-singularity/2.3.1
boost/1.61.0           gcc/4.9.3           lmod/7.7               tophat/2.1.1
bowtie/2.2.6           gcc/5.2.0    (D)    picard/1.137           trinityrnaseq/2.0.6
bwa/0.7.12             git/2.8.3           remora/1.7             vcftools/0.1.13
cmake/3.7.1            hwloc/1.11.2        samtools/1.3           verifybamid/1.1.2

Building Software

Like Stampede2, Hikari's default programming environment is based on the Intel compiler and Intel MPI library. For compiling MPI codes, the familiar commands "mpicc", "mpicxx", "mpif90" and "mpif77" are available. Also, the compilers "icc", "icpc", and "ifort" are directly accessible. To access the most recent versions of GCC, load the gcc module.

You're welcome to download third-party research software and install it in your own account. See the Stampede2 user guide for more information.

Hikari has no MPI library available for use with the GCC compilers.

Consult the Stampede2 User Guide for detailed information on building software.

Intel Math Kernel Library (MKL)

The Intel Math Kernel Library (MKL) is a collection of highly optimized functions implementing some of the most important mathematical kernels used in computational science. For in depth information on MKL, consult the Stampede2 User Guide. Note that MKL differs on Hikari in that:

  • MKL is not available with the GNU compilers
  • Total number of hardware threads is 48 on Hikari's Haswell nodes

Running Jobs on the Hikari Compute Nodes

Job Accounting

Hikari's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). We then multiply by a charge rate that reflects supply and demand for the type of node you use. For any given job, the total cost in SUs is:

SUs billed (node-hrs) = ( # nodes ) x ( job duration in wall clock hours ) x ( charge rate per node-hour )

The system tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. In general, your queue wait time will be less if you request only the time you need: the scheduler will have an easier time finding a slot for the 2 hours you really need than for the 48 hours you request in your job script.

Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

To display a summary of your TACC project balances and disk quotas at any time, execute:

login1$ /usr/local/etc/taccinfo # Generally more current than balances displayed on the portals.

Slurm Job Scheduler

Hikari's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.

Slurm Partitions (Queues)

Currently available queues include those in Hikari Production Queues.

Queue Name Max Nodes per Job
(assoc'd cores)*
Max Duration Max Jobs in Queue* Charge Rate
(per node-hour)
normal 432 nodes
(10368 cores)*
48 hrs 50* 1 Service Unit (SU)

* Queue status as of March 5, 2018. Queues and limits are subject to change without notice.

Submitting Batch Jobs with sbatch

Use Slurm's "sbatch" command to submit a batch job to one of the Hikari queues:

login1$ sbatch myjobscript

Here "myjobscript" is the name of a text file containing #SBATCH directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.

In your job script you (1) use #SBATCH directives to request computing resources (e.g. 10 nodes for 2 hrs); and then (2) use shell commands to specify what work you're going to do once your job begins. There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.

Hikari MPI job script
Hikari OpenMP job script
Hikari Hybrid (MPI + OpenMP) job script
Hikari Serial job script
Hikari Protected Data job script

Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your applications(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm "--export" option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.

The Common sbatch Options table below describes some of the most common sbatch command options. Slurm directives begin with "#SBATCH"; most have a short form (e.g. "-N") and a long form (e.g. "--nodes"). You can pass options to sbatch using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases "#!/bin/bash" or "#!/bin/csh" is the right choice. Avoid "#!/bin/sh" (its startup behavior can lead to subtle problems on Hikari), and do not include comments or any other characters on this first line. All #SBATCH directives must precede all shell commands. Note also that certain #SBATCH options or combinations of options are mandatory, while others are not available on Hikari.

Table 4. Common sbatch Options

Option Argument Comments
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-N total_nodes Required. Define the resources you need by specifying either:
(1) "-N" and "-n"; or
(2) "-N" and "--ntasks-per-node".
-n total_tasks This is total MPI tasks in this job. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as "-N".
--ntasks-per-node
or
--tasks-per-node
tasks_per_node This is MPI tasks per node. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set --ntasks-per-node to 1.
-t hh:mm:ss Required. Wall clock time for job.
--mail-user= email_address Specify the email address to use for notifications.
--mail-type= begin, end, fail, or all Specify when user notifications are to be sent (one option per line).
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
-d= afterok:jobid Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes
-A projectnumber Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.
-a
or
--array
N/A Not available. Use the launcher module for parameter sweeps and other collections of related serial jobs.
--mem N/A Not available. If you attempt to use this option, the scheduler will not accept your job.
--export= N/A Avoid this option on Hikari. Using it is rarely necessary and can interfere with the way the system propagates your environment.

By default, Slurm writes all console output to a file named "slurm-%j.out", where %j is the numerical job ID. To specify a different filename use the "-o" option. To save stdout (standard out) and stderr (standard error) to separate files, specify both "-o" and "-e".

Launching Applications

The primary purpose of your job script is to launch your research application. How you do so depends on several factors, especially (1) the type of application (e.g. MPI, OpenMP, serial), and (2) what you're trying to accomplish (e.g. launch a single instance, complete several steps in a workflow, run several applications simultaneously within the same job). While there are many possibilities, your own job script will probably include a launch line that is a variation of one of the examples described in this section:

Launching One Serial Application

To launch a serial application, simply call the executable. Specify the path to the executable in either the PATH environment variable or in the call to the executable itself:

mycode.exe                   # executable in a directory listed in $PATH
$WORK/apps/myprov/mycode.exe # explicit full path to executable
./mycode.exe                 # executable in current directory
./mycode.exe -m -k 6 input1  # executable with notional input options

Launching One Multi-Threaded Application

Launch a threaded application the same way. Be sure to specify the number of threads. Note that the default OpenMP thread count is 1.

export OMP_NUM_THREADS=24    # 24 total OpenMP threads (1 per Haswell core)
./mycode.exe

Launching One MPI Application

To launch an MPI application, use the TACC-specific MPI launcher "ibrun", which is a Hikari-aware replacement for generic MPI launchers like mpirun and mpiexec. In most cases the only arguments you need are the name of your executable followed by any options your executable needs. When you call ibrun without other arguments, your Slurm #SBATCH directives will determine the number of ranks (MPI tasks) and number of nodes on which your program runs.

ibrun ./mycode.exe           # use ibrun instead of mpirun or mpiexec

Launching One Hybrid (MPI+Threads) Application

When launching a single application you generally don't need to worry about affinity: both Intel MPI and MVAPICH2 will distribute and pin tasks and threads in a sensible way.

export OMP_NUM_THREADS=8    # 8 OpenMP threads per MPI rank
ibrun ./mycode.exe          # use ibrun instead of mpirun or mpiexec

More Than One Serial Application in the Same Job

TACC's "launcher" utility provides an easy way to launch more than one serial application in a single job. This is a great way to engage in a popular form of High Throughput Computing: running parameter sweeps (one serial application against many different input datasets) on several nodes simultaneously. The launcher utility will execute your specified list of independent serial commands, distributing the tasks evenly, pinning them to specific cores, and scheduling them to keep cores busy. Execute "module load launcher" followed by "module help launcher" for more information.

MPI Applications One at a Time

To run one MPI application after another (or any sequence of commands one at a time), simply list them in your job script in the order in which you'd like them to execute. When one application/command completes, the next one will begin.

module load git
module list
./preprocess.sh
ibrun ./mycode.exe input1    # runs after preprocess.sh completes
ibrun ./mycode.exe input2    # runs after previous MPI app completes

Interactive Sessions with idev and srun

TACC's own idev utility is the best way to begin an interactive session on one or more compute nodes. To launch a thirty-minute session on a single node in the development queue, simply execute:

login1$ idev

You'll then see output that includes the following excerpts:

...
-----------------------------------------------------------------
      Welcome to the HIKARI Supercomputer          
-----------------------------------------------------------------
...

-> After your idev job begins to run, a command prompt will appear,
-> and you can begin your interactive development session. 
-> We will report the job status every 4 seconds: (PD=pending, R=running).

->job status:  PD
->job status:  PD
...
c449-001$

The "job status" messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev, the prompts may invite you to choose a default project and a default number of tasks per node for future idev sessions.

For command line options and other information, execute "idev --help". It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:

login1$ idev -p normal -N 2 -n 8 -m 150 # normal queue, 2 nodes, 8 total tasks, 150 minutes

For more information see the idev documentation.

You can also launch an interactive session with Slurm's srun command, though there's no clear reason to prefer srun to idev. A typical launch line would look like this:

login1$ srun --pty -N 2 -n 8 -t 2:30:00 -p normal /bin/bash -l # same conditions as above

Interactive Sessions using ssh

If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute "top", then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.

There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev session, and the squeue or showq utilities. The sequence of identifying your compute node then connecting to it would look like this:

login1$ squeue -u bjones
 JOBID       PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
858811     development idv46796   bjones  R       0:39      1 c448-004
1ogin1$ ssh c448-004
...
C448-004$

SLURM Environment Variables

Be sure to distinguish between internal Slurm replacement symbols (e.g. "%j" described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID). Execute "env | grep SLURM" from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like "%j" only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an #SBATCH directive. For example, the following directive will not work the way you might think:

#SBATCH -o myMPI.o${SLURM_JOB_ID}   # incorrect

Instead, use the following directive:

#SBATCH -o myMPI.o%j     # "%j" expands to your job's numerical job ID

Similarly, you cannot use paths like $WORK or $SCRATCH in an #SBATCH directive.

For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself ("man slurm") and its individual command (e.g. "man sbatch"); as well as numerous other online resources.

Protected Data on Hikari

IMPORTANT: If your project contains protected data, contact help@tacc.utexas.edu to obtain a private and explicit path to your own protected storage location.

This path will look something like /corral-secure/projects/xxxx/yyy/. This path will only be visible to your project group. You MUST always be sure you are using that path, and only that path, for storing and accessing protected data.

"Category 1", HIPAA-PHI, and other Restricted Data Types

Hikari storage can be used for data subject to special security controls, such as HIPAA Protected Health Information and data subject to FERPA controls, but only in controlled circumstances after appropriate review and approval by both TACC and the organization or PI which owns the data. Users with sensitive data are required to contact the administrators through the help system or by sending an e-mail to help@tacc.utexas.edu to discuss your needs before storing such data on Hikari.

Data Retention Policies

Files on Hikari are never "purged" using automated processes. Once an allocation has expired, data will typically be retained for 6 months after the expiration of the allocation, however data may be deleted at any time at the discretion of the system administrators unless there is an allocation request pending. Important data should never be stored on only one system, and users are encouraged to maintain a second copy of their most important data on another system at TACC or elsewhere.

Running Jobs with Protected Data

Running batch jobs with protected data is similar to running normal batch jobs with one important restriction:

A sample workflow must include staging protected data to dedicated volume on Hikari using an encrypted file transfer protocol:

localhost$ scp sensitivefile \ 
    username@hikari.tacc.utexas.edu:/corral-secure/projects/project_directory/

Then write a job script with all I/O occurring on the designated volume and ensure the job script uses the appropriate protected data queue:

#!/bin/bash
#SBATCH -J myMPI            # job name
#SBATCH -o myMPI.o%j        # output and error file name
#SBATCH -N 2                # number of nodes requested
#SBATCH -n 48               # total number of mpi tasks requested
#SBATCH -t 01:30:00         # run time (hh:mm:ss) - 1.5 hours
#SBATCH -p normal           # job will run in the normal queue

#SBATCH --mail-user=username@tacc.utexas.edu
#SBATCH --mail-type=begin   # email me when the job starts
#SBATCH --mail-type=end     # email me when the job finishes

# run the executable named a.out
ibrun ./a.out --input /corral-secure/projects/project_directory/sensitivefile \
              --output /corral-secure/projects/project_directory/sensitiveoutput

Submit the job in the usual way:

login1$ sbatch myjobscript

Monitoring Jobs and Queues

Several commands are available to help you plan and track your job submissions as well as check the status of the Slurm queues.

When interpreting queue and job status, remember that Hikari doesn't operate on a first-come-first-served basis. Instead, the sophisticated, tunable algorithms built into Slurm attempt to keep the system busy, while scheduling jobs in a way that is as fair as possible to everyone. At times this means leaving nodes idle ("draining the queue") to make room for a large job that would otherwise never run. It also means considering each user's "fair share", scheduling jobs so that those who haven't run jobs recently may have a slightly higher priority than those who have.

Monitoring Queue Status with sinfo and qlimits

To display resource limits for the Hikari queues, execute "qlimits". The result is real-time data; the corresponding information in this document's table of Hikari queues may lag behind the actual configuration that the qlimits utility displays.

Slurm's "sinfo" command allows you to monitor the status of the queues. If you execute sinfo without arguments, you'll see a list of every node in the system together with its status. To skip the node list and produce a tight, alphabetized summary of the available queues and their status, execute:

login1$ sinfo -S+P -o "%18P %8a %20F"    # compact summary of queue status

An excerpt from this command's output looks like this:

PARTITION          AVAIL    NODES(A/I/O/T)
development*       up       41/70/1/112
normal             up       3685/8/3/3696

The AVAIL column displays the overall status of each queue (up or down), while the column labeled "NODES(A/I/O/T)" shows the number of nodes in each of several states ("Allocated", "Idle", "Other", and "Total"). Execute "man sinfo" for more information. Use caution when reading the generic documentation, however: some available fields are not meaningful or are misleading on Hikari (e.g. TIMELIMIT, displayed using the "%l" option).

Monitoring Job Status with squeue

Slurm's squeue command allows you to monitor jobs in the queues, whether pending (waiting) or currently running:

login1$ squeue             # show all jobs in all queues
login1$ squeue -u bjones   # show all jobs owned by bjones
login1$ man squeue         # more info

An excerpt from the default output looks like this:

 JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
170361      normal   spec12   bjones PD       0:00     32 (Resources)
170356      normal    mal2d slindsey PD       0:00     30 (Priority)
170204      normal   rr2-a2 tg123456 PD       0:00      1 (Dependency)
170250 development idv59074  aturing  R      29:30      1 c455-044
169669      normal  04-99a1  aturing CG    2:47:47      1 c425-003

The column labeled "ST" displays each job's status:

  • "PD" means "Pending" (waiting);
  • "R" means "Running";
  • "CG" means "Completing" (cleaning up after exiting the job script).

Pending jobs appear in order of decreasing priority. The last column includes a nodelist for running/completing jobs, or a reason for pending jobs. If you submit a job before a scheduled system maintenance period, and the job cannot complete before the maintenance begins, your job will run when the maintenance/reservation concludes. The squeue command will report "ReqNodeNotAvailable" ("Required Node Not Available"). The job will remain in the PD state until Hikari returns to production.

The default format for squeue now reports total nodes associated with a job rather than cores, tasks, or hardware threads.

The default format lists all nodes assigned to displayed jobs; this can make the output difficult to read. A handy variation that suppresses the nodelist is:

login1$ squeue -o "%.10i %.12P %.12j %.9u %.2t %.9M %.6D"  # suppress nodelist

The "--start" option displays job start times, including very rough estimates for the expected start times of some pending jobs that are relatively high in the queue:

login1$ squeue --start -j 167635     # display estimated start time for job 167635

Monitoring Job Status with showq

TACC's "showq" utility mimics a tool that originated in the PBS project, and serves as a popular alternative to the Slurm "squeue" command:

login1$ showq            # show all jobs; default format
login1$ showq -u         # show your own jobs
login1$ showq -U bjones  # show jobs associated with user bjones
login1$ showq -h         # more info

The output groups jobs in four categories: ACTIVE, WAITING, BLOCKED, and COMPLETING/ERRORED. A BLOCKED job is one that cannot yet run due to temporary circumstances (e.g. a pending maintenance or other large reservation.).

If your waiting job cannot complete before a maintenance/reservation begins, showq will display its state as "**WaitNod"** ("Waiting for Nodes"). The job will remain in this state until Hikari returns to production.

The default format for showq now reports total nodes associated with a job rather than cores, tasks, or hardware threads.

Other Job Management Commands (scancel, scontrol, and sacct)

It's not possible to add resources to a job (e.g. allow more time) once you've submitted the job to the queue.

To cancel a pending or running job, first determine its jobid, then use scancel:

login1$ squeue -u bjones    # one way to determine jobid
   JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  170361      normal   spec12   bjones PD       0:00     32 (Resources)
login1$ scancel 170361      # cancel job

For detailed information about the configuration of a specific job, use scontrol:

login1$ scontrol show job=170361

To view some accounting data associated with your own jobs, use sacct:

login1$ sacct --starttime 2017-08-01  # show jobs that started on or after this date

Dependent Jobs using sbatch

You can use sbatch to help manage workflows that involve multiple steps: the "--dependency" option allows you to launch jobs that depend on the completion (or successful completion) of another job. For example you could use this technique to split into three jobs a workflow that requires you to (1) compile on a single node; then (2) compute on 40 nodes; then finally (3) post-process your results using 4 nodes.

login1$ sbatch --dependency=afterok:173210 myjobscript

For more information see the Slurm online documentation. Note that you can use $SLURM_JOBID from one job to find the jobid you'll need to construct the sbatch launch line for a subsequent one. But also remember that you can't use sbatch to submit a job from a compute node.

Policies

  • TACC HIPAA Policy Statement
  • TACC Usage Policy

Revision History

"Last Update" at the top of this document is the date of the most recent change to this document. This revision history is a list of non-trivial updates; it excludes routine items such as corrected typos and minor format changes.

Click to view