TensorFlow at TACC
Last update: September 11, 2019

Scientists across domains are actively exploring and adopting deep learning as a cutting-edge methodology to make research breakthrough. At TACC, our mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Thus, we are embracing this new type of application on our high end computing platforms.

TACC supports the Keras+TensorFlow+Horovod stack. This framework exposes high level interfaces for deep learning architecture specification, model training, tuning, and validation. Deep learning practitioners and domain scientists who are exploring the deep learning methodology should consider this framework for their research.

This document details how to install TensorFlow, then download and run benchmarks in both single- and multi-node modes. Due to variations in TensorFlow and Python versions, and their compatabilities with the Intel compilers and CUDA libraries, the installation instructions are quite specific. Pay careful attention to the installation instructions.

Installations at TACC

TensorFlow is installed on TACC's Stampede2 and Maverick2 resources.

  • Parallel Training with Keras, TensorFlow, and Horovod is available on both Stampede2 and Maverick2.
  • TensorFlow v1.13.1 is available on Stampede2.

Before you begin, note that all of the following examples are run on compute, not login, nodes.

Running programs or doing computations on the login nodes may result in account suspension.

Use TACC's idev utility to grab compute node/s when conducting any TensorFlow activities.

TensorFlow on Maverick2

These instructions detail installing and running TensorFlow benchmarks on Maverick2. Maverick2 runs TensorFlow 1.13 with Python 3.6.3 and Intel 17.

Installation

Maverick2 supports CUDA/9.0, CUDA/9.2, CUDA/10.0, and CUDA/10.1. Please use the according CUDA version for your TensorFlow installation with Python 3.6.

c123-456$ module load intel/17.0.4 python3/3.6.3
c123-456$ module load cuda/10.0 cudnn/7.6.2 nccl/2.4.7
c123-456$ pip3 install --user tensorflow-gpu==1.13.2
c123-456$ pip3 install --user keras
c123-456$ pip3 install --user h5py

We suggest installing Horovod version 0.16.4. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions.

c123-456$ CPLUS_INCLUDE_PATH=/opt/apps/intel18/impi18_0/boost/1.66/include \
                CC=gcc HOROVOD_CUDA_HOME=/opt/apps/cuda/10.0 HOROVOD_GPU_ALLREDUCE=NCCL \
                HOROVOD_NCCL_HOME=/opt/apps/cuda10_0/nccl/2.4.7 HOROVOD_WITH_TENSORFLOW=1 \
                HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip3 install \
                --user horovod==0.16.4 --no-cache-dir

Single-Node Mode

Download the tensorflow benchmark to your $WORK directory, then check out the branch that matches your tensorflow version.

c123-456$ cdw; git clone https://github.com/tensorflow/benchmarks.git
c123-456$ cd benchmarks  
c123-456$ git checkout cnn_tf_v1.13_compatible

Benchmark the performance with synthetic dataset on 1 GPU

c123-456$ cd scripts/tf_cnn_benchmarks
c123-456$ module load intel/17.0.4 python3/3.6.3 cuda/10.0 cudnn/7.6.2
c123-456$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200

Benchmark the performance with synthetic dataset on 4 GPUs

c123-456$ cd scripts/tf_cnn_benchmarks
c123-456$ module load cuda/10.0 cudnn/7.6.2 nccl/2.4.7
c123-456$ ibrun -np 4 python3 tf_cnn_benchmarks.py --variable_update=horovod \
            --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True

Multi-Node Mode

Download the TensorFlow benchmark to your $WORK directory. Check out the branch that matches your tensorflow version. This example runs on two nodes in the gtx queue (8 GPUs).

c123-456$ cdw; git clone https://github.com/tensorflow/benchmarks.git
c123-456$ git checkout cnn_tf_v1.132_compatible

Benchmark the performance with synthetic dataset on these two 2 nodes using 8 GPUs

c123-456$ cd scripts/tf_cnn_benchmarks
c123-456$ module load intel/17.0.4 python/3.6.3 cuda/10.0 cudnn/7.6.2 nccl/2.4.7
c123-456$ ibrun -np 8 python3 tf_cnn_benchmarks.py --variable_update=horovod \
            --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True

TensorFlow on Stampede2

These instructions detail installing and running TensorFlow benchmarks on Stampede2. Stampede2 runs TensorFlow 1.13 with Python 3.7 and Intel 18.

Installation

  • Use TACC's idev utility to grab a single compute node for 1 hour in Stampede2's skx-dev queue:

    login1$ idev -p skx-dev -N 1 -n 1 -m 60
  • Install TensorFlow 1.13 using the default intel/18.0.2 compiler and Python 3.7:

      c123-456$ module load intel/18.0.2 python3/3.7.0
      c123-456$ pip3 install --user /home1/apps/tensorflow/builds/intel-18.0.2/tensorflow-1.13.1-cp37-cp37m-linux_x86_64.whl
  • To install Keras:

    c123-456$ pip3 install --user keras
  • You also need to install h5py:

    c123-456$ pip3 install --user --force-reinstall h5py --no-deps
  • Set the following environment variable before using

      c123-456$ export PYTHONPATH=$HOME/.local/lib/python3.7/site-packages:/opt/apps/intel18/impi18_0/python3/3.7.0/lib/python3.7/site-packages
  • To install horovod v0.16.4

      c123-456$ module load boost/1.68
      c123-456$ CPLUS_INCLUDE_PATH=/opt/apps/intel18/boost/1.68/include HOROVOD_WITH_TENSORFLOW=1 \
                  HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip3 install \
                  --user horovod==0.16.4 --no-cache-dir

Running on Stampede2

Single-Node Mode

  • If you're not already on a compute node, then use TACC's idev utility to grab a single compute node for 1 hour:

      login1$ idev -p skx-dev -N 1 -n 1 -m 60
  • Download the TensorFlow benchmark to your $SCRATCH directory. Check out the corresponding branch for your TensorFlow version. In this example we used cnn_tf_v1.10_compatible.

      c123-456$ cd $SCRATCH
      c123-456$ git clone https://github.com/tensorflow/benchmarks.git
      c123-456$ cd benchmarks
      c123-456$ git checkout cnn_tf_v1.13_compatible
  • Benchmark the performance with a synthetic dataset:

      c123-456$ cd scripts/tf_cnn_benchmarks
      c123-456$ export KMP_BLOCKTIME=0
      c123-456$ export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
      c123-456$ export OMP_NUM_THREADS=46
      c123-456$ python3 tf_cnn_benchmarks.py --model resnet50 --batch_size 128 \
          --data_format NCHW --num_intra_threads 46 --num_inter_threads 2 \
          --distortions=False --num_batches 100

Multi-Node Mode

  • If you're not already on a compute node, then use TACC's idev utility to grab two compute nodes for 1 hour:

    login1$ idev -p skx-dev -N 2 -n 2 -m 60
  • Download the TensorFlow benchmark to your $SCRATCH directory. Check out the corresponding branch for your TensorFlow version. In this example, we used cnn_tf_v1.10_compatible.

      c123-456$ cd $SCRATCH
      c123-456$ git clone https://github.com/tensorflow/benchmarks.git
      c123-456$ git checkout cnn_tf_v1.13_compatible
  • Benchmark the performance with a synthetic dataset on 4 nodes

      c123-456$ module load intel/18.0.2 python3/3.7.0 
      c123-456$ cd benchmarks/scripts/tf_cnn_benchmarks 
      c123-456$ export KMP_BLOCKTIME=0
      c123-456$ export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
      c123-456$ export OMP_NUM_THREADS=46
      c123-456$ ibrun -np 2 python3 tf_cnn_benchmarks.py --model resnet50 
          --batch_size 128 --variable_update horovod --data_format NCHW 
          --num_intra_threads 46 --num_inter_threads 2  --num_batches 100

    The parameters for this last command are defined as follows:

    • –model specifies the neural network model
    • –batch_size specifies the number of samples in each iteration
    • –variable_update specifies using horovod to synchronize gradients
    • –data_format informs TF the nested data format comes in the order of sample count, channel, height, and width
    • –num_intra_threads specifies the number of threads used for computation within a single operation
    • –num_inter_threads specifies the number of threads used for independent operations
    • –num_batches specifies the total number of iterations to run

FAQ

Q: I have missing Python packages when using TensorFlow. What shall I do?
A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list. On TACC resources, you can install these packages in user space by running

$ pip install --user package-name

Q: How can I run my Keras+Horovod program in parallel?
A: Start on one node, run:

c123-456$ ibrun -np 1 python app.script

Monitor GPU usage in another terminal with:

c123-456$ watch -n 5 nvidia-smi

Make sure the process only allocates one GPU. Then run multiple processes on one node:

c123-456$ ibrun -np 4 python app.script

If the program crashes, check the standard output file, it may be caused by all processes landing on the same GPU. If this is the case, create a run.sh with the following lines in it:

#!/bin/bash
export RANK=$(($PMI_RANK%4))
python app.py

Then run:

ibrun -np 16 python run.sh

References