Tensorflow at TACC
Last update: November 1, 2018

 

This document is in progress and information is subject to change.

Scientists across domains are actively exploring and adopting deep learning as a cutting-edge methodology to make research breakthrough. At TACC, our mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Thus, we are embracing this new type of application on our high end computing platforms.

TACC supports the Keras+TensorFlow+Horovod stack. This framework exposes high level interfaces for deep learning architecture specification, model training, tuning, and validation. Deep learning practitioners and domain scientists who are exploring the deep learning methodology should consider this framework for their research.

Installation and Versions at TACC

Tensorflow is installed on TACC's Stampede2 and Maverick2 resources.

  • TensorFlow v1.6.0 to v1.10.0 are available on Stampede2.
  • Parallel Training with Keras, TensorFlow, and Horovod is available on both Stampede2 and Maverick2.

Installation

  1. Use TACC's idev utility to grab a single compute node for 1 hour:

    login1$ idev -N 1 -n 1 -m 60
  2. You can install Tensorflow with different Python versions Select one: 2.7 or 3.6.

    • install TensorFlow with Python 2.7

        c123-456$ pip install --user /home1/apps/tensorflow/builds/tensorflow-1.10.0-cp27-cp27mu-linux_x86_64.whl
        c123-456$ pip install --user horovod keras h5py
    • or install TensorFlow with Python 3.6

        c123-456$ module load python3
        c123-456$ pip3 install --user /home1/apps/tensorflow/builds/tensorflow-1.10.0-cp36-cp36m-linux_x86_64.whl
        c123-456$ pip3 install --user horovod keras h5py

Single-Node Mode

  1. Use the idev utility to grab a single compute node for 1 hour:

     login1$ idev -N 1 -n 1 -t -m 60
  2. Download the tensorflow benchmark to your $SCRATCH directory. Check out the corresponding branch for your TensorFlow version. In this example we used cnn_tf_v1.10_compatible.

     $ cd $SCRATCH
     $ git clone https://github.com/tensorflow/benchmarks.git
     $ cd benchmarks
     $ git checkout cnn_tf_v1.10_compatible
  3. Benchmark the performance with synthetic dataset:

    Remember to use your own values for the queue (-p) and projectname ("-A") options. This example runs on KNL nodes. For SKX node, set export OMP_NUM_THREADS=48

     $ cd scripts/tf_cnn_benchmarks
     $ export KMP_BLOCKTIME=0
     $ export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
     $ export OMP_NUM_THREADS=66
     $ ibrun -np 1  taskset -c 0-64,68-132,136-200,204-268 \
         python tf_cnn_benchmarks.py --model resnet50 --batch_size 128 --data_format \
         NCHW --num_intra_threads 64 --num_inter_threads 3 --distortions=False --num_batches 100

Multi-Node Mode

  1. Use the idev utility to grab four compute nodes for one hour:

    login1$ idev -N 4 -n 4 -m 60
  2. Install Horovod

    $ pip install --user horovod
  3. Download tensorflow benchmark to your $SCRATCH directory. Check out the corresponding branch for your TensorFlow version. In this example, we used cnn_tf_v1.10_compatible.

     $ cd $SCRATCH
     $ git clone https://github.com/tensorflow/benchmarks.git
     $ git checkout cnn_tf_v1.10_compatible
  4. Benchmark the performance with a synthetic dataset on 4 nodes

    • "-np" specifies the total number of processes;
    • "–model" specifies the neural network model;
    • "–batch_size" specifies the number of samples in each iteration;
    • "–variable_update" specifies using horovod to synchronize gradients;
    • "–data_format" informs TF the nested data format comes in the order of sample count, channel, height, and width;
    • "–num_intra_threads" specifies the number of threads used for computation within a compute node;
    • "–num_inter_threads" is the number of threads used for communication;
    • "–num_batches" specifies the total number of iterations to run.

    Note: You need to update the queue name ("-p" option) and the project name ("-A" option). The above example runs on KNL nodes. For SKX node, set export OMP_NUM_THREADS=48 and --num_intra_threads 48.

     $ cd benchmark/scripts/tf_cnn_benchmarks 
     $ export KMP_BLOCKTIME=0
     $ export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
     $ export OMP_NUM_THREADS=66
     $ ibrun -np 4 \
         python tf_cnn_benchmarks.py --model resnet50 --batch_size 128 --variable_update \
         horovod --data_format NCHW --num_intra_threads 64 --num_inter_threads 3 \
         --num_batches 100
  5. Run the benchmarks in a batch job with the supplied run-4nodes.slurm script.

    Coming soon

     $ cd $SCRATCH/benchmark/scripts/tf_cnn_benchmarks
     $ cp /home1/apps/tensorflow/test/run-4nodes.slurm .
     $ sbatch run-4nodes.slurm

Installation

  1. Use the idev utility to grab a single compute node for one hour:

    login1$ idev -N 1 -n 1 -m 60
  2. You can install Tensorflow with different Python versions Select one: 2.7 or 3.6.

    • install TensorFlow with Python 2.7

        c123-456$ module load python
        c123-456$ pip install --user tensorflow-gpu
        c123-456$ pip install --user keras h5py
    • or with Python 3.6

        c123-456$ module load python3
        c123-456$ pip3 install --user tensorflow-gpu
        c123-456$ pip3 install --user keras h5py

Single-Node Mode

  1. Use the idev utility to grab a single compute node for one hour:

     login1$ idev -N 1 -n 1 -m 60
  2. Download the tensorflow benchmark to your $WORK directory, then check out the branch that matches your tensorflow version.

     c123-456$ cdw; git clone https://github.com/tensorflow/benchmarks.git
     c123-456$ git checkout branch_name
  3. Benchmark the performance with synthetic dataset on 1 GPU

     c123-456$ cd scripts/tf_cnn_benchmarks
     c123-456$ module load cuda/9.0 cudnn/7.0
     c123-456$ python tf_cnn_benchmarks.py \
         --variable_update=horovod --num_gpus=1 \
         --model resnet50 --batch_size 32 --num_batches 200
  4. Benchmark the performance with synthetic dataset on 4 GPUs

     c123-456$ cd scripts/tf_cnn_benchmarks
     c123-456$ module load cuda/9.0 cudnn/7.0
     c123-456$ ibrun -np 4 \
         python tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \
         --model resnet50 --batch_size 32 --num_batches 200

Multi-Node Mode

  1. Use the idev utility to grab two compute nodes for one hour:

     login1$ idev -N 2 -n 2 -t 01:00:00
  2. Download the Tensorflow benchmark to your $WORK directory. Check out the branch that matches your tensorflow version.

     c123-456$ cdw; git clone https://github.com/tensorflow/benchmarks.git
     c123-456$ git checkout branch_name
  3. Benchmark the performance with synthetic dataset on these two 2 nodes using 8 GPUs

     c123-456$ cd scripts/tf_cnn_benchmarks
     c123-456$ module load cuda/9.0 cudnn/7.0
     c123-456$ ibrun -np 8 python tf_cnn_benchmarks.py --variable_update=horovod 
         --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200
  4. Benchmark the performance with batch job

     $ cd $WORK/benchmarks/scripts/tf_cnn_benchmarks
     $ cp /home1/apps/tensorflow/test/run-2nodes.slurm .
     $ sbatch run-2nodes.slurm

FAQ

Q: I have missing Python packages when using Tensorflow. What shall I do?

A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list. On TACC resources, you can install these packages in user space by running

login1$ pip install --user package-name

Q: How can I run my Keras+Horovod program in parallel?

A: Start on one node, run:

ibrun -np 1 python app.script

Monitor GPU usage in another terminal with:

watch -n 5 nvidia-smi

Make sure the process only allocates one GPU. Then run multiple processes on one node:

ibrun -np 4 python app.script

If the program crashes, check the standard output file, it may be caused by all processes landing on the same GPU. If this is the case, create a run.sh with the following lines in it:

#!/bin/bash
export RANK=$(($PMI_RANK%4))
python app.py

Then run

ibrun -np 16 python run.sh