Deep Learning at TACC
Last update: February 16, 2018

Scientists across domains are actively exploring and adopting deep learning as a cutting-edge methodology to make research breakthrough. At TACC, our mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Thus, we are embracing this new type of application on our high end computing platforms.

TACC supports three deep learning frameworks, namely, Caffe, MXNet, and TensorFlow, on the Stampede2 supercomputer and the Maverick GPU cluster. These frameworks expose high level interfaces for deep learning architecture specification, model training, tuning, and validation. Deep learning practitioners and domain scientists who are exploring the deep learning methodology should consider these frameworks for their research. Wikipedia has a nice comparison among these three frameworks.

Deep Learning Software

Single-Node Mode on Stampede2

  1. Get on a compute node using idev(https://portal.tacc.utexas.edu/software/idev)
    login1$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
  2. Copy model and data, you may copy the test dir to other places
    $ cp -r /work/00946/zzhang/stampede2/test $SCRATCH/
  3. Enter the dir
    $ cd $SCRATCH/test
  4. Load the Caffe module
    $ module load caffe
  5. Train the model with command line
    $ caffe.bin train -engine "MKL2017" --solver=examples/cifar10/cifar10_full_solver.prototxt
  6. Submit a Slurm job
    $ sbatch run-1node.slurm

    NOTES: You need to update the queue name (-p option) and the project name (-A option).

  7. Use the Python interface
     $ python
     $ >>> import caffe
     $ >>> print caffe.__file__

Multi-Node Mode on Stampede2

  1. Get on a compute node using idev(https://portal.tacc.utexas.edu/software/idev)
    login1$ idev -A PROJECT -q QUEUE -N 4 -n 4 -t 01:00:00
  2. Copy Model and Data, you may copy the test dir to other places
    $ cp -r /scratch/00946/zzhang/stampede2/test $SCRATCH/
  3. Enter the dir
    $ cd $SCRATCH/test
  4. Load the Caffe module
    $ module load caffe
  5. Train the model:
    $ ibrun -np 4 caffe.bin train -engine "MKL2017" --solver=examples/cifar10/cifar10_full_solver.prototxt
  6. Quit the idev session:
    $ exit

    NOTES: The "-np" option value in the ibrun command has to be exactly the same as the "-N" and "-n" option in the idev or SLURM script. The point is to launch one process on each node. The Intel Caffe uses OpenMP for single node parallelism and MPI for multiple-node parallelism. Launching the cifar10 example without changing the batch_size in examples/cifar10/cifar10_full_train_test.prototxt is weak scaling. That is to say, by increasing the number of nodes, we are also increasing the batch size. If you are thinking about strong scaling, please decrease the batch size accordingly. From our experience, weak scaling can achieve ~80% efficiency on 512 KNL nodes while the strong scaling efficiency drops at ~50% on 8 KNL nodes.

  7. Submit a Slurm job

    Edit "examples/cifar10/cifar10_full_train_test.prototxt", replace the data source "examples/cifar10/cifar10_train_lmdb" and "examples/cifar10/cifar10_test_lmdb" with "/dev/shm/cifar10_train_lmdb" and "/dev/shm/cifar10_test_lmdb", respectively.

    Then run:

    $ sbatch run-4nodes.slurm

    NOTES: You need to update the queue name ("-p" option) and the project name ("-A" option).

Caffe on Maverick

On Maverick, Caffe is only available in single-node mode.

  1. Get on a compute node using idev
    $ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
  2. Copy Model and Data, you may copy the test dir to other places
    $ cp -r /work/00946/zzhang/maverick/test $WORK/
  3. Enter the dir
    $ cd $WORK/test
  4. Load the Caffe module
    $ module load caffe
  5. Train the model with command line
    $ caffe.bin train --solver=caffe-examples/cifar10/cifar10_full_solver.prototxt
  6. Quit the idev session
    $ exit
  7. Use the Python interface
     $ python
     $ >>> import caffe
     $ >>> print caffe.__file__

Running MXNet on Maverick

MXNet is only available in single-node mode on Maverick.

  1. Get on a compute node using idev
    $ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
  2. Copy Model and Data, you may copy the test dir to other places
    $ cp -r /work/00946/zzhang/maverick/test $WORK/
  3. Enter the dir
    $ cd $WORK/test
  4. Load the MXNet module
    $ module load mxnet
  5. Train the model with command line
    $ python mxnet-examples/image-classification/train_cifar10.py --network resnet --gpus 0
  6. Quit the idev session
    $ exit
  7. Use the Python interface
     $ python
     $ >>> import mxnet
     $ >>> print mxnet.__file__

Running TensorFlow on Maverick

On Maverick, TensorFlow is available in single-node mode.

  1. Get on a compute node using idev
    $ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
  2. Copy Model and Data, you may copy the test dir to other places
    $ cp -r /work/00946/zzhang/maverick/test $WORK/
  3. Enter the dir
    $ cd $WORK/test
  4. Load TensorFlow Module
    $ module load tensorflow-gpu
  5. Train the model with command line
    $ python tf-examples/tf_cnn_benchmarks.py --model=alexnet
  6. Quit the idev session:
    $ exit
  7. Use the Python interface
     $ python
     $ >>> import tensorflow as tf
     $ >>> print tf.__file__

FAQ

Q: I have missing Python packages when using Caffe, MXNet, or Tensorflow. What shall I do?

A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list: https://github.com/intel/caffe/blob/master/python/requirements.txt. On TACC resources, you can install these packages in user space by running

$ pip install --user package-name

Q: I run Caffe on multiple nodes, however the program hangs after loading the model. What is going on?

A: This may be caused by the LMDB concurrent reading issue. To work around this problem, you may want to broadcast the LMDB instance to every node's local SSD, which is mounted at /tmp. Use the lines below in your SLURM script. The following lines let one of the nodes read from the shared file system to /tmp, then broadcast the database files to /tmp on all four nodes in the allocation. Please refer to /work/00946/zzhang/stampede2/test/run-4nodes.slurm for a real usage example.

$ cp -r examples/cifar10/cifar10_*_lmdb /tmp/
$ cp -r examples/cifar10/mean.binaryproto /tmp/

$ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/cifar10_train_lmdb/data.mdb 4
$ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/cifar10_test_lmdb/data.mdb 4
$ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/mean.binaryproto 4

Then update the LMDB path in the corresponding specification file (e.g., https://github.com/intel/caffe/blob/master/examples/cifar10/cifar10_full_train_test.prototxt). Do not forget to set "shuffle: true" in the data_param. Otherwise, you can get degraded accuracy even with large batch size.

Q: I want to train on the ImageNet 1K category dataset, how should I proceed?

A: Users must download their own ImageNet-1K dataset according to ImageNet's term of access. The following instructions assume that you have your ImageNet-1K dataset available on Stampede2, e.g., /work/00946/zzhang/stampede2/imagenet/ilsvrc_compressed.

It is mandatory for users to use our file broadcasting tool to broadcast this 43 GB database across all nodes in an allocation. Concurrently reading this database from a number of nodes can degrade the filesystem's performance, and user can be banned by our administrators.

Please add the following lines (the example of broadcasting the compressed database to /tmp on four nodes) to the SLURM script:

$ cp -r /work/00946/zzhang/stampede2/imagenet/ilsvrc_compressed/ilsvrc12_train_lmdb /tmp/
$ cp /work/00946/zzhang/stampede2/imagenet/ilsvrc_compressed/imagenet_mean.binaryproto /tmp/
$ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/ilsvrc12_train_lmdb/data.mdb 4
$ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/ilsvrc12_val_lmdb/data.mdb 4
$ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/imagenet_mean.binaryproto 4