Caffe on TACC's Stampede2
Last update: November 30, 2018

Scientists across domains are actively exploring and adopting deep learning as a cutting-edge methodology to make research breakthrough. At TACC, our mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Thus, we are embracing this new type of application on our high end computing platforms.

TACC supports two deep learning stacks, namely, Caffe+Intel MLSL and Keras+TensorFlow+Horovod. These frameworks expose high level interfaces for deep learning architecture specification, model training, tuning, and validation. Deep learning practitioners and domain scientists who are exploring the deep learning methodology should consider these frameworks for their research.

Installation and Versions

Caffe currently is only installed on TACC's Stampede2 resource. Intel's Caffe distribution (v1.1.1) too? is available as a module on Stampede2.

Training with the CIFAR-10 Dataset

The Caffe installation includes the CIFAR-10 dataset, a collection of 60K 32x32 color images in 10 classes.

Follow the step by step instructions below to train your model with the CIFAR-10 dataset:

Copy the model and data into your own workspace

Set up your caffe environment by copying the model and data part of the Caffe installation into your own workspace. We recommend creating a directory in your $SCRATCH folder.

login1$ cp -r /home1/apps/caffe/test $SCRATCH/mycaffedir
login1$ cd $SCRATCH/mycaffedir

Train in an idev Session

  1. Use the idev utility to grab a compute node. Update the queue ("-p" option) and the project name ("-A" option).

    login1$ idev -A projectname -p queue -N 1 -n 1 -t 01:00:00
  2. Set up the local environment:

     login1$ module load hdf5/1.8.16 caffe/1.1.1
     login1$ export OMP_THREADS=64
  3. Copy the training dataset to the compute node's local RAM disk:

    $ cp -r examples/cifar10/cifar10_*_lmdb \
         examples/cifar10/mean.binaryproto /dev/shm/
  4. Train the model on the command line:

    $ caffe.bin train -engine "MKL2017" \
  5. Exit idev and the relinquish the compute node:

    $ exit

Train Via a Batch Job

The examples directory contains two sample scripts, run-1node.slurm and run-4nodes.slurm for single-node and multi-node (4) jobs respectively. Edit either of these scripts, altering the "-A" options and "-p" options as necessary, then submit the job.

#SBATCH -J Caffe-1nodes         # job name
#SBATCH -o Caffe-1nodes-%j.out  # output and error file name (%j expands to jobID)
#SBATCH -N 1                    # total number of nodes
#SBATCH -n 1                    # 1 task per node
#SBATCH -p development          # queue (partition) -- normal, development, etc.
#SBATCH -t 1:00:00              # run time (hh:mm:ss) - 1 hour
#SBATCH -A ProjectName          # enter project/allocation name

# Set up the Caffe environment
module load hdf5/1.8.16 caffe/1.1.1

# Copy the dataset to the local RAM disk
cp -r examples/cifar10/cifar10_*_lmdb  examples/cifar10/mean.binaryproto /dev/shm/

# Train using Intel's MKL2017 engine.
ibrun -np 1 caffe.bin train -engine "MKL2017" --solver=examples/cifar10/cifar10_full_solver.prototxt

NOTES: The "-np" option value in the ibrun command has to be exactly the same as the "-N" and "-n" option in the idev session or Slurm script. The point is to launch one process on each node. The Intel Caffe uses OpenMP for single node parallelism and MPI for multiple-node parallelism. Launching the cifar10 example without changing the batch_size in examples/cifar10/cifar10_full_train_test.prototxt is weak scaling. That is to say, by increasing the number of nodes, we are also increasing the batch size. If you are thinking about training with large batch size, consult the Layer-wise Adaptive Rate Scaling (LARS) algorithm in Intel Caffe.

login1$ sbatch run-1node.slurm

Training on the ImageNet-1K Dataset

If you want to train on the ImageNet-1k dataset, then you may download your own dataset according to ImageNet's term of access. The following instructions assume that you have your ImageNet-1K dataset available on Stampede2, e.g., /path/to//ilsvrc12_train_lmdb.

It is mandatory for users to use our file broadcasting tool to broadcast this 43 GB database across all nodes in the job. Concurrently reading this database from a number of nodes can degrade the filesystem's performance, and user can be banned by our administrators.

Add the following lines (the example of broadcasting the compressed database to /tmp on four nodes) to your job script:

/home1/apps/dl-tools/bin/ /path/to/ilsvrc12_train_lmdb/data.mdb /tmp/ilsvrc12_train_lmdb/data.mdb 4
/home1/apps/dl-tools/bin/ /path/to/ilsvrc12_val_lmdb/data.mdb /tmp/ilsvrc12_val_lmdb/data.mdb 4
/home1/apps/dl-tools/bin/ /path/to/imagenet_mean.binaryproto /tmp/imagenet_mean.binaryproto 4


Q: I have missing Python packages when using Caffe or Tensorflow. What shall I do?

A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list. On TACC resources, you can install these packages in user space by running
login1$ pip install --user package-name

Q: I run Caffe on multiple nodes, however the program hangs after loading the model. What is going on?

A: This may be caused by the LMDB concurrent reading issue. To work around this problem, you may want to broadcast the LMDB instance to every node's local SSD, which is mounted at "/tmp". Use the lines below in your Slurm script. The following lines let one of the nodes read from the shared file system to /tmp, then broadcast the database files to /tmp on all four nodes in the allocation. Refer to /home1/apps/caffe/test/run-4nodes.slurm for a real usage example.

Then update the LMDB path in the corresponding specification file (e.g., Do not forget to set "shuffle: true" in the data_param. Otherwise, you can get degraded accuracy even with large batch size.