Stampede Archive: Knights Corner Technical Material
Last update: October 3, 2016 13:14

This document contains technical material on Stampede's Knights Corner (KNC) coprocessors, the first generation of Intel's Many Integrated Core (MIC) architecture. It is a lightly edited collection of KNC material from the Stampede User Guide that was in place just prior to Stampede's Knights Landing (KNL) Upgrade. The KNL Upgrade introduces next generation MIC technology to Stampede.

Note that the KNL upgrade adds new nodes (and new capabilities) to the system, but leaves the original Stampede hardware intact. The current Stampede User Guide focuses on the newer KNL technology, but links to this document for those users whose research needs continue to include the legacy KNC coprocessors. There is no KNL content in this document: all references to terms like MIC, Xeon Phi, etc. refer to the legacy KNC coprocessors available on the Sandy Bridge side of Stampede.

Innovative Computing Capability with Intel KNC Coprocessor

As directed in the NSF award, the system is equipped with an innovative computing component. The TACC innovative solution features an Intel® Many Integrated Core Architecture (Intel® MIC Architecture) coprocessor in each compute node. The Phi Xeon Coprocessor (often called a MIC, pronounced "Mike" ) has significantly more cores (61), and 4 times the width of vector registers seen on TACC's Lonestar IV system. The basis of the Phi coprocessor is a light-weight x86 core with in-order instruction processing, coupled with heavy-weight 512bit SIMD registers and instructions. With these two features the Phi die can support 60+ cores, and can execute 8 double precision (DP) vector instructions. The core count and vector lengths are basic extensions of an x86 processor, and allow the same programming paradigms (serial, threaded and vector) used on other Xeon (E5) processors. Unlike the GPGPU accelerator model, the same program code can be used efficiently on the host and the coprocessor. Also, the same Intel compilers, tools, libraries, etc. that you use on Intel and AMD systems are available for the Phi processors.

These coprocessors contain a large number of (relatively) simple cores running at lower frequency to deliver much higher peak performance per chip than is available using more traditional multi-core approaches. In the case of the Xeon Phi Coprocessor SE10P used in the Stampede system, each coprocessor chip has a peak performance of ~1070 GFLOPS, approximately six times the peak performance of a single Xeon E5 processor, or three times the aggregate peak performance of the two Xeon E5 processors in each Stampede compute node. Each coprocessor is equipped with 8GB of GDDR5 DRAM with a peak bandwidth of 352GB/s, also significantly higher than the 51.2GB/s peak bandwidth available to each Xeon E5 processor chip.

A critical advantage of the Xeon Phi coprocessor is that, unlike GPU-based coprocessors, the processing cores in the Xeon Phi coprocessor run the Intel x86 instruction set (with 64-bit extensions), allowing the use of familiar programming models, software, and tools.

The many-core Phi processor is integrated into each node as a coprocessor with an interconnect to the E5 processors and the external network (HCA card) through a PCIe express interface as shown in Figure 1; the connectivity is similar to the way GPU accelerators are configured in a node.

Stampede Zeus Node: 2 Xeon E5 processors and 1 Xeon Phi coprocessor

The MIC coprocessor, however, runs a Linux Operating System (OS), thereby making MIC function as a separate Symmetric Multiprocessor (SMP). So, while the MIC can be used as a work offload engine by the E5 processors, it is also capable working as another independent (SMP) processor. In the latter mode MPI processes can be launched on the MIC and/or the E5 processors. In this "symmetric" mode the MIC appears as an extra node for launching MPI tasks.

System Configuration

All Stampede nodes run CentOS 6.3 and are managed with batch services through Slurm 2.4. Global $HOME, $WORK and $SCRATCH storage areas are supported by three Lustre parallel distributed file systems with 76 IO servers. Inter-node communication (MPI/Lustre) is through an FDR Mellanox InfiniBand network.

The 6400 Dell Zeus C8220z compute nodes are housed in 160 racks (40 nodes/rack), along with two 36-port Mellanox leaf switches. Each node has two Intel E5 8-core (Sandy Bridge) processors and an Intel Xeon Phi 61-core (Knights Corner) coprocessor, connected by an x16 PCIe bus. The host and coprocessor are configured with DDR3 32GB and DDR5 8GB memory, respectively. Forty of the compute nodes are reserved for development and are accessible interactively for a limited time. The 16 large-memory nodes are in a single rack. Each node is a Dell PowerEdge R820 server with 4 E5-4650 8-core processors and 1TB of DDR3 memory.

The interconnect is an FDR InfiniBand network of Mellanox switches, consisting of a fat tree topology of eight core-switches and over 320 leaf switches with a 5/4 oversubscription. The network configuration for the compute nodes is shown in Figure 3.

The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1 through 4.

Table 1. System Configuration & Performance

Component Technology Performance/Size
Nodes(sled) 2 8-core Xeon E5 processors, 1 61-core Xeon Phi coprocessor 6400 Nodes
Memory Distributed, 32GB/node 205TB (Aggregate)
Shared Disk Lustre 2.1.3, parallel File System 14 PB
Local Disk SATA (250GB) 1.6PB (Aggregate)
Interconnect InfiniBand Mellanox Switches/HCAs FDR 56 Gb/s

Compute nodes

A Compute node consists of a Dell C8220z double-wide sled in a 4 rack-unit chassis with 3 other sleds. Each node runs CentOS 6.3 with the 2.6.32 x86_64 Linux kernel. Each node contains two Xeon Intel 8-Core 64-bit E5-processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.7GHz and supports 8 floating-point operations per clock period with a peak performance of 21.6 GFLOPS/core or 346 GFLOPS/node. Each node contains 32GB of memory (2GB/core). The memory subsystem has 4 channels from each processor's memory controller to 4 DDR3 ECC DIMMS, each rated at 1600 MT/s (51.2GB/s for all four channels in a socket). The processor interconnect, QPI, runs at 8.0 GT/s between sockets. The Intel Xeon Phi is a special production model with 61 1.1 GHz cores with a peak performance of 16.2 DP GFLOPS/core or 1.0 DP TFLOPS/Phi. Each coprocessor contains 8GB of GDDR5 memory, with 8 dual-channel controllers, with a peak memory performance of 320GB/s.

Table 2. Dell DCS (Dell Custom Solution) C8220z Compute Node

Component Technology
Sockets per Node/Cores per Socket
2/8 Xeon E5-2680 2.7GHz (turbo, 3.5)
1/61 Xeon Phi SE10P 1.1GHz
Motherboard Dell C8220, Intel PQI, C610 Chipset
Memory Per Host
Memory per Coprocessor
32GB 8x4G 4 channels DDR3-1600MHz
QPI 8.0 GT/s
PCI Express Processor
PCI Express Coprocessor
x40 lanes, Gen 3
x16 lanes, Gen 2 (extended)
250GB Disk 7.5K RPM SATA

Figure 2.

Stampede Zeus Node: 2 Xeon E5 processors and 1 Xeon Phi coprocessor
Intel Xeon Phi Coprocessor

Table 3. PowerEdge R720 Login Nodes

Component Technology
4 login nodes
Processors Sockets per Node/Cores per Socket E5-2680, 2.7GHz
Motherboard Dell R720, Intel QPI C600 Chipset
Memory Per Node 32GB 8x4GB 4 channels/CPU DDR3-1600 (MT/s)
Cache: 256KB/core L2; 20MB/CPU L3
Global: Lustre, xxGB quota
Local: Shared, 432GB SATA 10K rpm

Intel Knights Corner Coprocessor

The E5 architecture includes the following features important to HPC:

  • 4 DP (double precision) vector width and AVX Instruction set
  • 4-Channel On-chip (integrated) memory controllers
  • Support for 1600MT/s DDR3 memory
  • Dual Intel QuickPath links between Xeon dual-processor systems support 8.0GT/s
  • Turbo Boost version 2.0, up to peak 3.5GHz in turbo mode.
  • In these Romley platforms PCIe lanes are controlled by CPUs (do not pass through the chip set)
  • Gen 3.0 PCI Express
  • Improved Hyper-threading (turned off but good for some HPC Packages)
  • 64 KB L1 Cache/core (32KB L1 Data and 32KB L1 Instruction)


The 56GB/s FDR InfiniBand interconnect consists of Mellanox switches, fiber cables and HCAs (Host Channel Adapters). Eight 648-port SX6536 core switches and over 320 36-port SX6025 endpoint switches (2 in each compute-node rack) form a 2-level Clos fat tree topology, illustrated in Figure 3. Core and endpoint switches have 4.0 and 73 Tb/s capacities, respectively. There is a 5/4 oversubscription at the endpoint (leaf) switches (20 node input ports: 16 core-switch output ports). Any MPI message is only 5 hops or less from source to destination.

Figure 3.

Stampede Interconnect

File Systems Overview

User-owned storage on the Stampede system is available in three directories, identified by the $HOME, $WORK and $SCRATCH environment variables. These directories are part of separate Lustre shared file systems, and accessible from any node in the system.

Table 4. Storage Systems

Storage Class Size Architecture Features
Local (each node) Login: 1TB
Compute: 250GB
Big Mem: 600GB
Login: 432GB partition mounted on /tmp
80GB partition mounted on /tmp
398GB partition mounted on /tmp
Parallel 14PB Lustre 72 Dell R610 data servers (OSS) through IB
user striping allowed
MPI-IO, XPB, YPB, and ZPB partitions on $HOME/$WORK/$SCRATCH
4 Dell R710 meta data servers with 2 Dell MD 3220 Storage Arrays
Ranch (Tape Storage) 60PB SAM-FS (Storage Archive Manager) 10GB/s connection through 4 GridFTP Servers

Building Software

Unix shells allow users to customize their environment via startup files containing scripts. Customizing your environment with startup scripts is not entirely trivial. Below are some simple instructions, as well as an explanation of the shell set up operations.

TACC Bash users should consult the Bash Users' Startup Files: Quick Start Guide document for instructions on how best to set up the user environment.

Compiling for the KNC

The Stampede system's default programming environment uses the Intel C++ and Fortran compilers. These Intel compilers are the only compilers able to compile programs for the Phi coprocessors. This section explains and illustrates the important and basic uses of this compiler for all the programming models described above (OpenMP/MPI on the E5/Phi systems in native mode, offloading and heterogeneous modes). The Intel compiler commands can be used for both compiling (making ".o" object files) and linking (making an executable from ".o" object files).

Basic Compiler Commands and Serial Program Compiling

Appropriate file name extensions are required for each compiler. By default, the executable filename is "a.out", but it may be renamed with the "-o" option. We use "a.out" throughout this guide to designate a generic executable file. The compiler command performs two operations: it makes a compiled object file (having a ".o" suffix) for each file listed on the command-line, and then combines them with system library files in a link step to create an executable. To compile without the link step, use the "-c" option.

The same code can be compiled to run either natively on the host or natively on the MIC. Use the same compiler commands for the host (E5) or the MIC (Phi) compiling, but include the "-mmic" option to create a MIC executable. We suggest you name MIC executables with a ".mic" suffix.

Table 5. Compiling Serial Programs

Language Compiler File Extension Example
C icc .c icc compiler_options prog.c
C++ icpc .C, .cc, .cpp, .cxx icpc compiler_options prog.cpp
F77 ifort .f, .for, .ftn ifort compiler_options prog.f
F90 ifort .f90, .fpp ifort compiler_options prog.f90

Table 6. Host, MIC and Host+MIC Offload Compilations

Mode Required Options Notes
none Use -xhost to generate AVX (Advanced Vector Extensions) instructions.
Phi (MIC)
-mmic Suggestion: name executables with a ".mic" suffix to differentiate them from an E5 executables. e.g., "ifort -mmic prog.f90 -o prog.mic"
Host + Offload none for automatic offloading of MKL lib functions use environment variables;
for direct offloading use pragmas

The following examples illustrate how to rename an executable (-o option), compile for the host (run on the E5 processors), and compile for the MIC (run natively on the MIC):

A C program example:

login1$ icc -xhost -O2 -o flamec.exe     prog.c
login1$ icc -mmic  -O2 -o flamec.exe.mic prog.c

A Fortran program example:

login1$ ifort -xhost -O2 -o flamef.exe     prog.f90
login1$ ifort -mmic  -O2 -o flamef.exe.mic prog.f90

Commonly used options may be placed in an "icc.cfg" or "ifc.cfg" file for compiling C and Fortran code, respectively.

For additional information, execute the compiler command with the "-help" option to display every compiler option, its syntax, and a brief explanation, or display the corresponding man page, as follows:

login1$ icc  -help
login1$ icpc -help
login1$ ifort -help
login1$ man icc
login1$ man icpc
login1$ man ifort

Some of the more important options are listed in the Basic Optimization section of this guide.

Compiling OpenMP Programs

Since each Stampede node has many cores (16 E5 and 61 Phi cores), applications can take advantage of multi-core shared-parallelism by using threading paradigms such as OpenMP. For applications with OpenMP parallel directives, include the "-openmp" option on the compiler command to enable the parallel thread generation. Use the "-openmp_report" option to display diagnostic information.

Table 7. Important OpenMP Compiler Options

Compiler Options
-openmp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives.
Use whenever OpenMP pragmas are present in core for E5 processor or Phi coprocessor.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level

Below are host and MIC compile examples for enabling OpenMP code directives.

  login1$ icc   -xhost -openmp -O2 -o flamec.exe      prog.c
  login1$ icc   -mmic  -openmp -O2 -o flamec.exe.mic  prog.c
  login1$ ifort -xhost -openmp -O2 -o flamef.exe      prog.f90
  login1$ ifort -mmic  -openmp -O2 -o flamef.exe.mic  prog.f90

The Intel compiler accepts OpenMP pragmas and OpenMP API calls that adhere to the 3.1 standard (). The $KMP_AFFINITY and OpenMP environment variables that set thread affinity and thread control are described in the "running code" section below.

Compiling MPI Programs

Note: Use Intel's mpi, impi, or mvapich2-mic modules for running MPI applications on the MIC. The regular mvapich2 modules do not support MIC native applications.

Compiling for Phi Offloading

During the execution of an application, work can be re-directed to execute on the Phi coprocessor. There are two ways to offload work onto the MIC: automatic offloading, and compiler-assisted offloading. Automatic offloading of a growing set of Intel's MKL (Math Kernel Library) library routines can be invoked by simply setting environment variables prior to execution. Compiler-assisted offloading requires the application developer to insert directives within the application code. Even in this explicit form the synchronous offloading can automatically determine and move data between the host and MIC. Direct control of data allocation, data transfers and asynchronous offloading allows developers to specifically optimize offloading for the spatial and temporal data requirements of their application. The Offload Coding section explains and illustrates the directives used in compiler-assisted offloading.

Automatic Offloading (AO)

Certain routines in the MKL library have been redesigned to run on the host, run as an offloaded routine on the MIC, or run on both with the workload split between the two. Users can control the mode at run time with environment variables. (There is also an API for developers to code the control directly inside the application.) In the example below the "has_dgemm" program is compiled and the "dgemm" routine is loaded from the MKL library in the usual way. The following lines show how the offload is controlled (in a batch job script or an interactive session). The $OMP_NUM_THREADS and $MIC_OMP_NUM_THREADS variables set the number of threads on the host and the MIC, respectively.

Compile and run example for AO:

login1$ icc   -O3 -xhost  -mkl has_dgemm.c
login1$ ifort -O3 -xhost  -mkl has_dgemm.f90
login1$ idev
c401-001$ export MKL_MIC_ENABLE=1
c401-001$ export OMP_NUM_THREADS=16
c401-001$ ./a.out

Compiler-assisted Offloading

The Intel C/C++ and Fortran 2013 compilers support directives that offload regions of work to a Phi coprocessor. When the execution of an offload-enabled application on the host encounters an offload region, the data and instructions are sent to a Phi coprocessor for execution and the host waits for completion. If a Phi coprocessor is not available, the application will run an E5-compiled version of the region on the host.

Specific compiler and loader options for offloaded code can be set as a comma-separated list in a string within the following compiler command options:




In the following example the host code is compiled with "-O3", and the offloaded code is compiled with "-O2 -fma" and linked with "-g".

login1$ icc -O3 -offload-option,mic,compiler,"-O2 -fma" \

Libraries on the KNC Coprocessor

You can build libraries that run natively on the Phi coprocessor by including the "-mmic" compiler option when invoking the compiler. The $MIC_LD_LIBRARY_PATH environment variable can be used to point to the directory containing the native library. Static native library archives can be created with the "ar" archiver, as is the case with host static library archives. Similarly, dynamic native library archives are created the same way they are created for the host by adding the "-mmic" and "-fPIC" options to the compilation of the object file and using the "-shared" flag when creating the shared library:

login1$ icc -mmic -fPIC -c fun1.c fun2.c
login1$ ar rcs libsampleMIC.a fun1.o fun2.o

Static Libraries in Offloaded Code

Conceptually, the compiler, linker and standard library environments on the host and coprocessor are thinly separated. Code for the host is compiled within the host environment and offloaded code within the coprocessor environment. The coprocessor environment can include libraries and these would be available to be called from offloaded code with no need to use special syntax or runtime features.

When the coprocessor is available, offloaded code is loaded when the host version of the executable is loaded, or when the first offload is attempted. At this time the libraries linked with the offloaded code are initialized. The loaded target executable remains in the target memory until the host program terminates. Thus, any global state maintained by the library is maintained across offload instances.

Separate copies of libraries are linked or loaded with the host program and the offloaded coprocessor code so that there are two sets of global states: one on the host and one on the coprocessor. The host code only sees the host state and the offloaded coprocessor code only sees the state of the library on the coprocessor.

To create static coprocessor libraries, first create MIC object code (by compiling offload decorated modules with "-c" or compiling non-decorated modules with the "-c" and "-mmic" options). Once the MIC.o objects (and the CPU .o objects) have been created, use the "xiar" command with the "-qoffload-build" option to build an archive.

login1$ xiar -qoffload-build ar options archive [member...]
login1$ xild -lib -qoffload-build ar options archive [member...]

When supplying the name of the library and the list of its member files to xiar or xild, it is important to only specify the file names associated with the host library and host object files, such as lib.a and file.o. "xiar" and "xild" automatically manipulate the corresponding coprocessor library and member files, libMIC.a and fileMIC.o, respectively. For example:

login1$ xiar -qoffload-build rcs libsample.a fun1.o fun2.o

This will create the libsample.a library for the host and a libsampleMIC.a library for the coprocessor. The libsample.a will contain the object files fun1.o and fun2.o. The libsampleMIC.a library will contain the object files fun1MIC.o and fun2MIC.o.

Linking against the static library follows the same rules as specifying the object files: the host library should be specified and the compiler will automatically incorporate the coprocessor library. The host library can be specified with the linker options -L and -llibrary name (the liblibrary nameMIC library will be included automatically by the compiler).

Launching Symmetric Jobs

An MPI application can run tasks on both host CPUs and MICs. This is called symmetric computing because both host CPUs and MICs act as though they are separate nodes and may have MPI processes launched ("symmetrically" on both systems, unlike offloading which depends upon the host to distribute work to the MIC. As a reminder, a node contains both a host CPUs and a MIC. Below, references to "CPUs" will refer to the Sandy Bridge component of the node and references to a "MIC" will refer to the Intel Xeon Phi co-processor card.

There are 4 steps of preparation for a symmetric run:

  1. Build separate executables for the host and the MIC
  2. Determine the appropriate number of CPU and MIC MPI tasks per node, as well as the number of host and MIC OpenMP threads per MPI task.
  3. Create a job script that requests resources and specifies the distribution of tasks across hosts and MICs
  4. Launch the two executables (CPU/MIC binaries) using ibrun.symm.

Presently, CPU and MIC MPI executables for symmetric execution can only be built with impi or mvapich2-mic. Since the mvapich2 model is loaded by default, swap out this module for either the impi or mvapich2-mic modules as shown below.

login1$ module swap mvapich2 impi/mvapich2-mic
login1$ mpif90/mpicc/mpic++ -O3 -xhost myprogram.f90/c/cpp -o a.out.cpu
login1$ mpif90/mpicc/mpic++ -O3 -mmic  myprogram.f90/c/cpp -o a.out.mic

Once MIC and CPU executables for an application have been created, named "a.out.mic" and "a.out.cpu", respectively, the two executables may be launched using ibrun.symm in a job script with the following syntax:

ibrun.symm -m ./a.out.mic -c ./a.out.cpu

where the "-m" and "-c" options specify the MIC and CPU executables, respectively. ibrun.symm does not support the "-o" and "-n" options of the regular ibrun command. Use "-2" instead of "-m" to launch an executable on both MICs when using the normal-2mic queue.

If the executables requires command-line arguments, combine the executable and its arguments in quotes, as shown here:

c123-456$ ibrun.symm -m "./a.out.mic args" -c "./a.out.cpu args"

where args are the command-line arguments required by the "a.out.mic" and "a.out.cpu" executables. If the executables require redirection from stdin, create a simple executable shell script to run the executable, e.g.,
a.out.mic args < inputfile
#!/bin/sh args < inputfile

Note: The bash, tcsh, and csh shells are not available on the MIC. Only the sh shell interpreter runs on the MIC, therefore we begin each MIC shell script with "#!/bin/sh" to specify that the sh shell is to be used and remind users that only sh syntax (a subset of Bash) is allowed in the MIC shell script. Any shell can be used for the host script. These scripts are to be used as arguments of the ibrun.symm command:

login1$ ibrun.symm -m ./ -c

No quoting is required for executable arguments in the shell scripts.

The total number of tasks to be executed on the host and the number of hosts to use should be entered as Slurm resource options ("-n total_tasks" and "-N nodes") in the job script. However, only the total number of tasks needs to be specified if 16 tasks per host are to be used. The following Slurm directive assumes 16 tasks per node:

#SBATCH -n total_tasks

A combination of the total number of nodes and total number of tasks should be specified when fewer than 16 tasks per host (CPUs) are required. The following Slurm directive allocates total_tasks/nodes tasks per node across the specified number of nodes.

#SBATCH -N nodes -n total_tasks

Use the "$MIC_PPN" and "$MIC_OMP_NUM_THREADS" environment variables to set the number of tasks and threads for each MIC. The $MIC_PPN and $MIC_OMP_NUM_THREADS environment variables should contain the number of MPI processes (tasks) per MIC and the number of threads per MIC process (MPI task) respectively. The job script snippet below illustrates a job that will execute 8 MPI CPU tasks per node on 4 nodes using 2 threads per CPU task. With the "a.out.mic" executable it will execute 2 MPI processes (tasks) on each MIC, and 60 threads for each MIC process (MIC MPI task).

#SBATCH -N 4 -n 32

export MIC_PPN=2

ibrun.symm -m a.out.mic  -c a.out.cpu

The MPI tasks will be allocated in consecutive order by node (CPU tasks first, the MIC tasks). For example, the task allocation described by the above script snippet will be:

NODE1:  8 host tasks ( 0 - 7) :  2 MIC tasks ( 8 - 9)
NODE2:  8 host tasks (10 -17) :  2 MIC tasks (18 -19)
NODE3:  8 host tasks (20 -27) :  2 MIC tasks (28 -29)
NODE4:  8 host tasks (30 -37) :  2 MIC tasks (38 -39)

Although the total number of host MPI tasks may still be controlled with the MYNSLOTS < /code > environmentvariable, thenumberofMICMPItaskswillbeconsistentlysetacrossMICswiththe < code>MIC_PPN environment variable. If your applications needs a more advanced configuration for the CPU and MIC task topology, please contact TACC Consulting.

Sample Symmetric Slurm Job Script

# Example Slurm job script to run symmetric 
# applications, (Host + MIC) or (MIC only) on TACC's 
# Stampede system.
#SBATCH -J symmetric_job     # Job name
#SBATCH -o symmetric_job.o%j # Name of stdout output file(%j expands to jobId) 
#SBATCH -e symmetric_job.o%j # Name of stderr output file(%j expands to jobId)
#SBATCH -p development       # Submit to the 'normal' or 'development' queue
#SBATCH -N 2                 # Total number of nodes requested (16 cores/node)
#SBATCH -n 16                # Total number of mpi tasks requested
#SBATCH -t 01:30:00          # Run time (hh:mm:ss) - 1.5 hours
# The next line is required if the user has more than one project
# #SBATCH -A A-yourproject   # Allocation name to charge job against

# The number of host tasks is controlled by Slurm
# In this example it is 16.

# Set the number of MPI TASKS per MIC(Default=4)
export MIC_PPN=2

# Set the number of host threads per task(Default=1)

# Set the number of mic threads per task(Default=30)

# This will start 16 host MPI tasks spread across 2 nodes and 
# 2 MPI tasks on 2 MIC cards resulting in 20 total MPI tasks. 
# Each host MPI task will use 2 threads/task and each MIC MPI task 
# will use 60 threads/task.
# The tasks will be allocated in consecutive order on the nodes
# NODE1:  8 host tasks ( 0 - 7) :  2 MIC tasks ( 8 - 9)
# NODE2:  8 host tasks (10 -17) :  2 MIC tasks (18 -19)

# Launch the symmetric application using ibrun.symm
ibrun.symm -c my_host.exe -m my_mic.exe

KNC Coprocessor Programming

Many Fortran or C/C++ applications designed to run on the E5 processor (host) can be modified to automatically execute blocks of code or routines on the Phi coprocessor through directives. The Intel compiler, without requiring any additional options, will interpret the directives and include Phi executable code within the normal executable binary. A binary with Phi executable offload code can be launched on the host in the usual manner (with ibrun for MPI codes, and as a process execution for serial and OpenMP), and the offloaded sections of code will automatically execute on the Phi coprocessor.

There are two points to remember when discussing computations on the host (E5 CPUs) and coprocessor (Phi):

  1. The instruction sets and architectures of the host E5 and Phi coprocessor are quite similar, but are not identical. (Expect differences in performance.)
  2. Host processors and MIC coprocessors have their own memory subsystems. They are effectively separate SMP systems with their own OS and environment.

Programming details for offloading can be found in the respective Intel user guides listed below.

Automatic Offloading

Some of the MKL routines that do large amounts of floating point operations compared to data accesses (having computational complexity O(n3) compared to O(n2) data access; e.g., level 3 BLAS) have been configured with automatic offload (AO) capabilities. This capability allows the user to offload work in the library routines automatically, without any coding changes. No special compiler options are required. Just compile with the usual flags and the MKL library load options (-mkl is the new shortened way to load MKL libraries). Then set the "$MKL_MIC_ENABLE" environment variable to request the automatic offload to occur at run time:

login1$ ifort -mkl -xhost -O3 app_has_MKLdgemm.f90
login1$ export MKL_MIC_ENABLE=1
login1$ idev
c123-456$ ./a.out

For more information and tips on how to integrate the Xeon Phi Automatic Offload capability with common programs, please see Intel Xeon Phi Automatic Offload from C, Fortran, Python, R, and MATLAB.

Depending upon the problem size (e.g., n>2048 for dgemm) the library runtime may choose to run all, part or none of the routine on the coprocessor. Offloading and the work division between the CPU and MIC are transparent to the user; but these may be controlled with environment variables and Fortran/C/C++ APIs (application program interfaces), particularly when compiler-assisted offloading is also employed. Also, MPI applications that use multiple tasks per node will need to adjust the workload division for sharing the coprocessor among all of the tasks. For example, setting the MKL_MIC_WORKDIVISION environment variable or using the support function mkl_mic_set_workdivision() with a fraction value, advises the runtime to give the MIC that fraction of work. Set the OFFLOAD_REPORT environment variable value, or mkl_mic_set_offload_report function argument, to 0-2 to disclose a range of information, as shown below:

login1$ idev
c123-456$ ./a.out

Details and a list of all the automatic offload controls are available in the MKL User Guide document.

Compiler Assisted Offloading

Developers can explicitly direct a block of code or routine to be executed on the MIC, in the base Fortran, C/C++ languages using directives. The code to be executed on the MIC is called an offload region. No special coding is required in an offloaded region and Intel specific and OpenMP threading methods may be used. Code Example 1 illustrates an offload directive for a code block containing an OpenMP loop. The "target(mic:0)" clause specifies that the MIC coprocessor with id=0 should execute the code region.

When the host execution encounters the offload region the runtime performs several offload operations: detection of a target Phi coprocessor, allocation of memory space on the coprocessor, data transfer from the host to the coprocessor, execution of the coprocessor binary on the Phi, transfer of data from the Phi back to the host after the completion of the coprocessor binary, and memory deallocation. The offload model is suitable when the data exchanged between the host and the MIC consists of scalars, arrays, and Fortran derived types and C/C++ structures that can be copied using a simple memcpy. This data characteristic is often described as being flat or bit-wise copyable. The data to be transferred at the offload point need not be declared or allocated in any special way if the data is within scope (as in Code Example 1); although pointer data (arrays pointed to by a pointer) need their size specified (see Advanced Offloading).

Example: Offloaded OpenMP Code Block with Automatic Data Transfer

C code F90 code
  int main(){
  float a[N], b[N], c[N];
  #pragma offload target(mic:0)
  #pragma omp parallel for
  program main
    real :: a(N), b(N), c(N)
    !dir$ offload begin target(mic:0)
    !$omp parallel do
    do i=1,N
    end do
    !$omp end parallel do
    !dir$ end offload
  end program

By default the compiler will recognize any offload directive. During development it is useful to observe the names and sizes of variables tagged for transfer by including the "-opt-report-phase=offload" option as shown here:

login1$ ifort/icc/icpc -openmp -O3 -xhost -opt-report-phase=offload \
login1$ export OMP_NUM_THREADS=16
login1$ idev
c123-456$ ./a.out

The "-openmp" and "-O2" options apply to both the host (E5 CPU) and offload (MIC) code regions, and the -xhost is specific to the host code. Environment variables such as $OMP_NUM_THREADS will normally have different values on the host and the MIC. In these cases variables intended for the MIC should be prefixed with "MIC_" and set on the host as shown above; also the "$MIC_ENV_PREFIX" variable must be set to "MIC". Actually, any prefix may be used; but we strongly recommend using MIC.

Advanced Offloading

A few of the important concepts you will need to develop and optimize offload paradigms are summarized below. The corresponding directives, clauses and qualifiers are explained as well. More details and examples, as well as references to Intel documentation are provided in TACC's Advanced Offloading document.

Data Transfers: in/out/inout: In Code Example 1 the compiler will make sure that the a, b and c arrays are copied over to the MIC before the offloaded region is executed, and are copied back at the end of the execution. Because the a array is only written on the MIC there is no reason to "copy in" the array into the coprocessor; likewise there is no reason to "copy out" b and c out of the coprocessor. To eliminate unnecessary transfers, data intent clauses (in, out, inout) on the offload directive can be used to optimize transfers.

Persistent Data: alloc_if() and free_if():

The automatic data transfers in Code Example 1 allocate storage on the MIC, transfer the data, and deallocate storage for each call. If the same data is to be used in different offloads the data can be made to persist across the offloads by modifying the memory allocation defaults with alloc_if(arg) and free_if(arg) qualifiers within the intent data clauses (in, out, inout). If the argument is false (.false. for Fortran, 0 for C, false for C++) the allocation or deallocation is not performed, respectively.

Data Transfer Directive: offload_transfer:

The programmer can transfer data without offloading executable code. The offload_transfer directive fulfills this function. It is a stand-alone directive (requiring no code block), and uses all the same data clauses and modifiers of a normal offload statement. One common use case is to initially load persistent data (asynchronously) onto the MIC at the beginning of a program.

Asynchronous Offloading: signal and wait:

Often a developer may want to transfer data or do offload work while continuing to do work on the cpu. An offload region can be executed asynchronously when a signal clause is included on the directive. The host process encountering the offload will initiate the offload (offload or offload_transfer), and then immediately continue to execute the program code following the offload region. The offload event is identified by a variable argument within a signal clause, and uses it in the wait clause in a subsequent offload directive or stand-alone wait directive.


TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities. The Usage Policies are documented here:


Help is available 24/7. Please submit a helpdesk ticket via the TACC User Portal