Caffe on TACC's Stampede2
Last update: October 12, 2018

Caffe currently is only installed on TACC's Stampede2 resource. Intel's Caffe distribution (v1.1.1) is available as a module on Stampede2.

Follow the step by step instructions below to:

Single-Node Mode

You can train your model in single-node mode on the command line or via a batch job.

Command Line

  1. Use the idev utility to grab a compute node. Update the queue ("-p" option) and the project name ("-A" option).

    login1$ idev -A projectname -p queue -N 1 -n 1 -t 01:00:00
  2. Copy model and data, you may copy the test dir to other places

    $ cp -r /home1/apps/caffe/test $SCRATCH/
  3. Enter the dir

    $ cd $SCRATCH/test
  4. Load the Caffe module

    $ module load hdf5/1.8.16 caffe/1.1.1
  5. Copy the dataset to RAM disk

    $ cp -r examples/cifar10/cifar10_*_lmdb \
         examples/cifar10/mean.binaryproto /dev/shm/
  6. Train the model with command line

    $ caffe.bin train -engine "MKL2017" \
         --solver=examples/cifar10/cifar10_full_solver.prototxt
  7. Exit idev and the relinquish the compute node:

    $ exit

Single-Node via Batch Job

Do the same thing with except in a batch job. Submit a Slurm job on login node.

login1$ sbatch run-1node.slurm

Multi-Node Mode

  1. Copy Model and Data, you may copy the test dir to other places. NOTES: You need to update the queue name ("-p" option) and the project name ("-A" option).

    $ cp -r /home1/apps/caffe/test $SCRATCH/
  2. Enter the dir

    $ cd $SCRATCH/test
  3. Submit a Slurm job using the batch script in the "/examples" subdirectory.

    NOTES: The "-np" option value in the ibrun command has to be exactly the same as the "-N" and "-n" option in the idev or Slurm script. The point is to launch one process on each node. The Intel Caffe uses OpenMP for single node parallelism and MPI for multiple-node parallelism. Launching the cifar10 example without changing the batch_size in examples/cifar10/cifar10_full_train_test.prototxt is weak scaling. That is to say, by increasing the number of nodes, we are also increasing the batch size. If you are thinking about training with large batch size, consult the Layer-wise Adaptive Rate Scaling (LARS) algorithm in Intel Caffe.

    $ sbatch run-4nodes.slurm

FAQ

Q: I have missing Python packages when using Caffe or Tensorflow. What shall I do?

A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list. On TACC resources, you can install these packages in user space by running
login1$ pip install --user package-name

Q: I run Caffe on multiple nodes, however the program hangs after loading the model. What is going on?

A: This may be caused by the LMDB concurrent reading issue. To work around this problem, you may want to broadcast the LMDB instance to every node's local SSD, which is mounted at "/tmp". Use the lines below in your Slurm script. The following lines let one of the nodes read from the shared file system to /tmp, then broadcast the database files to /tmp on all four nodes in the allocation. Refer to /home1/apps/caffe/test/run-4nodes.slurm for a real usage example.

Then update the LMDB path in the corresponding specification file (e.g., https://github.com/intel/caffe/blob/master/examples/cifar10/cifar10_full_train_test.prototxt). Do not forget to set "shuffle: true" in the data_param. Otherwise, you can get degraded accuracy even with large batch size.

Q: I want to train on the ImageNet-1k category dataset, how should I proceed?

A: Users must download their own ImageNet-1K dataset according to ImageNet's term of access. The following instructions assume that you have your ImageNet-1K dataset available on Stampede2, e.g., /path/to//ilsvrc12_train_lmdb.

It is mandatory for users to use our file broadcasting tool to broadcast this 43 GB database across all nodes in an allocation. Concurrently reading this database from a number of nodes can degrade the filesystem's performance, and user can be banned by our administrators.

Add the following lines (the example of broadcasting the compressed database to /tmp on four nodes) to the Slurm script:

/home1/apps/dl-tools/bin/broadcast-mpi.sh /path/to/ilsvrc12_train_lmdb/data.mdb /tmp/ilsvrc12_train_lmdb/data.mdb 4
/home1/apps/dl-tools/bin/broadcast-mpi.sh /path/to/ilsvrc12_val_lmdb/data.mdb /tmp/ilsvrc12_val_lmdb/data.mdb 4
/home1/apps/dl-tools/bin/broadcast-mpi.sh /path/to/imagenet_mean.binaryproto /tmp/imagenet_mean.binaryproto 4