Last update: April 03, 2018
Scientists across domains are actively exploring and adopting deep learning as a cutting-edge methodology to make research breakthrough. At TACC, our mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Thus, we are embracing this new type of application on our high end computing platforms.
TACC supports three deep learning frameworks, namely, Caffe, MXNet, and TensorFlow, on the Stampede2 supercomputer and the Maverick GPU cluster. These frameworks expose high level interfaces for deep learning architecture specification, model training, tuning, and validation. Deep learning practitioners and domain scientists who are exploring the deep learning methodology should consider these frameworks for their research. Wikipedia has a nice comparison among these three frameworks.
Deep Learning Software
- Intel distribution of Caffe (v1.0.3) is available as a module on Stampede2.
- The official Caffe (v1.0.0), MXNet (v0.10.0), and TensorFlow (v1.0.0 and v0.11.0) are available as modules on Maverick.
- MXNet (v0.10.0) and TensorFlow (v1.3.0-rc2) are also available on Stampede2 upon request.
Running Caffe
Follow the step by step instructions below to:
Single-Node Mode on Stampede2
- Get on a compute node using idev(https://portal.tacc.utexas.edu/software/idev)
login1$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
- Copy model and data, you may copy the test dir to other places
$ cp -r /work/00946/zzhang/stampede2/test $SCRATCH/
- Enter the dir
$ cd $SCRATCH/test
- Load the Caffe module
$ module load caffe
- Train the model with command line
$ caffe.bin train -engine "MKL2017" --solver=examples/cifar10/cifar10_full_solver.prototxt
- Submit a Slurm job
$ sbatch run-1node.slurm
NOTES: You need to update the queue name (-p option) and the project name (-A option).
- Use the Python interface
$ python $ >>> import caffe $ >>> print caffe.__file__
Multi-Node Mode on Stampede2
- Get on a compute node using idev(https://portal.tacc.utexas.edu/software/idev)
login1$ idev -A PROJECT -q QUEUE -N 4 -n 4 -t 01:00:00
- Copy Model and Data, you may copy the test dir to other places
$ cp -r /scratch/00946/zzhang/stampede2/test $SCRATCH/
- Enter the dir
$ cd $SCRATCH/test
- Load the Caffe module
$ module load caffe
- Train the model:
$ ibrun -np 4 caffe.bin train -engine "MKL2017" --solver=examples/cifar10/cifar10_full_solver.prototxt
- Quit the idev session:
$ exit
NOTES: The "
-np
" option value in theibrun
command has to be exactly the same as the "-N
" and "-n
" option in theidev
or SLURM script. The point is to launch one process on each node. The Intel Caffe uses OpenMP for single node parallelism and MPI for multiple-node parallelism. Launching the cifar10 example without changing the batch_size inexamples/cifar10/cifar10_full_train_test.prototxt
is weak scaling. That is to say, by increasing the number of nodes, we are also increasing the batch size. If you are thinking about strong scaling, please decrease the batch size accordingly. From our experience, weak scaling can achieve ~80% efficiency on 512 KNL nodes while the strong scaling efficiency drops at ~50% on 8 KNL nodes. Submit a Slurm job
Edit "
examples/cifar10/cifar10_full_train_test.prototxt
", replace the data source "examples/cifar10/cifar10_train_lmdb
" and "examples/cifar10/cifar10_test_lmdb
" with "/dev/shm/cifar10_train_lmdb
" and "/dev/shm/cifar10_test_lmdb
", respectively.Then run:
$ sbatch run-4nodes.slurm
NOTES: You need to update the queue name ("
-p
" option) and the project name ("-A
" option).
Caffe on Maverick
On Maverick, Caffe is only available in single-node mode.
- Get on a compute node using idev
$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
- Copy Model and Data, you may copy the test dir to other places
$ cp -r /work/00946/zzhang/maverick/test $WORK/
- Enter the dir
$ cd $WORK/test
- Load the Caffe module
$ module load caffe
- Train the model with command line
$ caffe.bin train --solver=caffe-examples/cifar10/cifar10_full_solver.prototxt
- Quit the idev session
$ exit
- Use the Python interface
$ python $ >>> import caffe $ >>> print caffe.__file__
Running MXNet on Maverick
MXNet is only available in single-node mode on Maverick.
- Get on a compute node using idev
$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
- Copy Model and Data, you may copy the test dir to other places
$ cp -r /work/00946/zzhang/maverick/test $WORK/
- Enter the dir
$ cd $WORK/test
- Load the MXNet module
$ module load mxnet
- Train the model with command line
$ python mxnet-examples/image-classification/train_cifar10.py --network resnet --gpus 0
- Quit the idev session
$ exit
- Use the Python interface
$ python $ >>> import mxnet $ >>> print mxnet.__file__
Running TensorFlow on Maverick
On Maverick, TensorFlow is available in single-node mode.
- Get on a compute node using idev
$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00
- Copy Model and Data, you may copy the test dir to other places
$ cp -r /work/00946/zzhang/maverick/test $WORK/
- Enter the dir
$ cd $WORK/test
- Load TensorFlow Module
$ module load tensorflow-gpu
- Train the model with command line
$ python tf-examples/tf_cnn_benchmarks.py --model=alexnet
- Quit the idev session:
$ exit
- Use the Python interface
$ python $ >>> import tensorflow as tf $ >>> print tf.__file__
FAQ
Q: I have missing Python packages when using Caffe, MXNet, or Tensorflow. What shall I do?
A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list: https://github.com/intel/caffe/blob/master/python/requirements.txt. On TACC resources, you can install these packages in user space by running
$ pip install --user package-name
Q: I run Caffe on multiple nodes, however the program hangs after loading the model. What is going on?
A: This may be caused by the LMDB concurrent reading issue. To work around this problem, you may want to broadcast the LMDB instance to every node's local SSD, which is mounted at /tmp
. Use the lines below in your SLURM script. The following lines let one of the nodes read from the shared file system to /tmp
, then broadcast the database files to /tmp
on all four nodes in the allocation. Please refer to /work/00946/zzhang/stampede2/test/run-4nodes.slurm
for a real usage example.
$ cp -r examples/cifar10/cifar10_*_lmdb /tmp/ $ cp -r examples/cifar10/mean.binaryproto /tmp/ $ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/cifar10_train_lmdb/data.mdb 4 $ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/cifar10_test_lmdb/data.mdb 4 $ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/mean.binaryproto 4
Then update the LMDB path in the corresponding specification file (e.g., https://github.com/intel/caffe/blob/master/examples/cifar10/cifar10_full_train_test.prototxt). Do not forget to set "shuffle: true
" in the data_param
. Otherwise, you can get degraded accuracy even with large batch size.
Q: I want to train on the ImageNet 1K category dataset, how should I proceed?
A: Users must download their own ImageNet-1K dataset according to ImageNet's term of access. The following instructions assume that you have your ImageNet-1K dataset available on Stampede2, e.g., /work/00946/zzhang/stampede2/imagenet/ilsvrc_compressed.
It is mandatory for users to use our file broadcasting tool to broadcast this 43 GB database across all nodes in an allocation. Concurrently reading this database from a number of nodes can degrade the filesystem's performance, and user can be banned by our administrators.
Please add the following lines (the example of broadcasting the compressed database to /tmp
on four nodes) to the SLURM script:
$ cp -r /work/00946/zzhang/stampede2/imagenet/ilsvrc_compressed/ilsvrc12_train_lmdb /tmp/ $ cp /work/00946/zzhang/stampede2/imagenet/ilsvrc_compressed/imagenet_mean.binaryproto /tmp/ $ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/ilsvrc12_train_lmdb/data.mdb 4 $ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/ilsvrc12_val_lmdb/data.mdb 4 $ /work/00946/zzhang/stampede2/imagenet/broadcast-mpi.sh /tmp/imagenet_mean.binaryproto 4