Intel Xeon Phi Automatic Offload

from C, Fortran, Python, R, and MATLAB

Overview

This is an introduction to Automatic Offload (AO) basics for programmers and non-programmers alike. Automatic Offload: Basic Syntax covers information that applies in all circumstances. Subsequent sections describe the specifics of exploiting AO on Stampede from Python, R, MATLAB, and other applications (including those you write in C or Fortran). See the TACC Training Page for additional training materials and lab exercises related to AO.

If your applications perform computationally intensive linear algebra (e.g. large matrix-matrix multiplications or LU factorizations), you will be pleased to know that Intel's Math Kernel Library (MKL) supports a feature called Automatic Offload (AO) that offers the possibility of improving your code's performance by exploiting the Xeon Phi with absolutely no programming effort.

If your code calls supported MKL functions and links appropriately to the library, you need only set a single environment variable (MKL_MIC_ENABLE=1) and launch your code: the MKL will determine whether the computations are demanding enough to justify using the Xeon Phi, then distribute the work between the coprocessor and host CPU so that the two devices work together to complete the calculation. Other environment variables (e.g. $MKL_MIC_WORKDIVISION) give you greater control over the calculation (e.g. setting threads on all devices, or specifying the division of work between host and coprocessors).

Even better, many important software packages perform matrix calculations by calling MKL under the hood, presenting additional opportunities for out-of-the-box automatic offloading to the Xeon Phi. These software packages include Python, R, and MATLAB.

Automatic Offload: Basic Syntax

To enable Automatic Offload from an application that supports it, set the appropriate environment variable before running your code:

export MKL_MIC_ENABLE=1    # bash

...or...

setenv MKL_MIC_ENABLE 1    # csh

Set threading to the max available values for both host and MIC. Note that the default thread count on the host is 1 (a poor choice for MKL!), and the max number of threads in a MIC offload region is 240 (only 60 of the 61 cores are available to the application). From bash, the appropriate commands are:

$ export     OMP_NUM_THREADS=16
$ export MIC_OMP_NUM_THREADS=240

To monitor the offload (in particular, to get real-time feedback regarding whether offload is occurring), use...

$ export OFFLOAD_REPORT=2    # the value 1 is also available

Expect to see offload only for supported functions, and only when the inputs are of sufficient size. A good test problem, well above current thresholds baked into MKL, is a dgemm (double precision matrix-matrix multiplication) involving two 8000 x 8000 matrices.

For finer-grained control of your offload, see http://software.intel.com/en-us/articles/intel-mkl-on-the-intel-xeon-phi-coprocessors. $MKL_MIC_WORKDIVISION is particularly useful. In general, however, the default behavior tends to be fairly close to optimal.

Automatic offload does work out of the box with two MICs; of course there are more factors to consider and a few more settings to master.

Note that the first offload will likely take longer than subsequent offloads: this is because the first offload initializes the environment on the coprocessor, and transfers the MIC-side binaries from the host to the MIC.

Automatic Offload from Python

The python/2.7.6 module, also built with Intel 14, bundles dozens of computational packages, including NumPy and SciPy, all of which are AO ready.

To get the correct Python module, be sure to load the Intel 14 compiler first:

$ module load intel/14.0.1.106
$ module load python

Enable AO (see above) before calling functions that themselves call MKL under the hood.

Note that the epd version of Python does not support AO.

Automatic Offload from R

Our newest Rstats module, built with Intel 14 (the intel/14.0.1.106 module) and mvapich2/2.2b, includes version 3.0.3 of the popular R statistical software. This module supports AO, scalable distributed computing with RMPI, and a number of other new tools and packages.

To get the correct R module, be sure to load the Intel 14 compiler first:

$ module load intel/14.0.1.106
$ module load Rstats

Enable AO (see above) before calling functions that themselves call MKL under the hood.

Note that the R/2.15.1 module does not support AO. Rstats/3.0.2, built with Intel 13, does support AO but is less mature than the newer, default Rstats. The R_mkl module is essentially a beta (and now deprecated) version of Rstats. In theory it supports AO, but it is a less mature and less stable product than Rstats. The Rstudio module is a browser-based interface to R; its presence or absence has no effect on the availability of AO.

Automatic Offload from MATLAB

Stampede's bring-your-own-license MATLAB module is easy to configure for MKL-support and AO; just set the environment variable $BLAS_VERSION to the value $TACC_MKL_LIB/libmkl_rt.so. Enable AO (see above) before calling functions that themselves call MKL under the hood.

Automatic Offload from Other Applications

Any application built with the Intel 13 or Intel 14 compiler that links to the threaded MKL library calls can take advantage of AO. This includes your own apps written in C and Fortran.

In general, compiling with "-mkl" is usually the best way to accomplish the appropriate linking.

Enable AO (see above) before running your code.

References

Last update: September 18, 2014

No comments yet. Be the first.