Last update: November 30, 2018
Scientists across domains are actively exploring and adopting deep learning as a cutting-edge methodology to make research breakthrough. At TACC, our mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Thus, we are embracing this new type of application on our high end computing platforms.
TACC supports two deep learning stacks, namely, Caffe+Intel MLSL and Keras+TensorFlow+Horovod. These frameworks expose high level interfaces for deep learning architecture specification, model training, tuning, and validation. Deep learning practitioners and domain scientists who are exploring the deep learning methodology should consider these frameworks for their research.
Installation and Versions
Caffe currently is only installed on TACC's Stampede2 resource. Intel's Caffe distribution (v1.1.1) 1.1.1.3 too? is available as a module on Stampede2.
Training with the CIFAR-10 Dataset
The Caffe installation includes the CIFAR-10 dataset, a collection of 60K 32x32 color images in 10 classes.
Follow the step by step instructions below to train your model with the CIFAR-10 dataset:
Copy the model and data into your own workspace
Set up your caffe environment by copying the model and data part of the Caffe installation into your own workspace. We recommend creating a directory in your $SCRATCH
folder.
login1$ cp -r /home1/apps/caffe/test $SCRATCH/mycaffedir login1$ cd $SCRATCH/mycaffedir
Train the Model
You can train the model in an idev
session in single-node mode, or submit batch jobs for single- or multi-node runs.
Train in an idev
Session
Use the
idev
utility to grab a compute node. Update the queue ("-p" option) and the project name ("-A" option).login1$ idev -A projectname -p queue -N 1 -n 1 -t 01:00:00
Set up the local environment:
login1$ module load hdf5/1.8.16 caffe/1.1.1 login1$ export OMP_THREADS=64
Copy the training dataset to the compute node's local RAM disk:
$ cp -r examples/cifar10/cifar10_*_lmdb \ examples/cifar10/mean.binaryproto /dev/shm/
Train the model on the command line:
$ caffe.bin train -engine "MKL2017" \ --solver=examples/cifar10/cifar10_full_solver.prototxt
Exit
idev
and the relinquish the compute node:$ exit
Train Via a Batch Job
The examples
directory contains two sample scripts, run-1node.slurm
and run-4nodes.slurm
for single-node and multi-node (4) jobs respectively. Edit either of these scripts, altering the "-A
" options and "-p
" options as necessary, then submit the job.
#!/bin/bash #SBATCH -J Caffe-1nodes # job name #SBATCH -o Caffe-1nodes-%j.out # output and error file name (%j expands to jobID) #SBATCH -N 1 # total number of nodes #SBATCH -n 1 # 1 task per node #SBATCH -p development # queue (partition) -- normal, development, etc. #SBATCH -t 1:00:00 # run time (hh:mm:ss) - 1 hour #SBATCH -A ProjectName # enter project/allocation name # Set up the Caffe environment module load hdf5/1.8.16 caffe/1.1.1 export OMP_NUM_THREADS=64 # Copy the dataset to the local RAM disk cp -r examples/cifar10/cifar10_*_lmdb examples/cifar10/mean.binaryproto /dev/shm/ # Train using Intel's MKL2017 engine. ibrun -np 1 caffe.bin train -engine "MKL2017" --solver=examples/cifar10/cifar10_full_solver.prototxt
NOTES: The "-np
" option value in the ibrun command has to be exactly the same as the "-N
" and "-n
" option in the idev session or Slurm script. The point is to launch one process on each node. The Intel Caffe uses OpenMP for single node parallelism and MPI for multiple-node parallelism. Launching the cifar10 example without changing the batch_size
in examples/cifar10/cifar10_full_train_test.prototxt
is weak scaling. That is to say, by increasing the number of nodes, we are also increasing the batch size. If you are thinking about training with large batch size, consult the Layer-wise Adaptive Rate Scaling (LARS) algorithm in Intel Caffe.
login1$ sbatch run-1node.slurm
Training on the ImageNet-1K Dataset
If you want to train on the ImageNet-1k dataset, then you may download your own dataset according to ImageNet's term of access. The following instructions assume that you have your ImageNet-1K dataset available on Stampede2, e.g., /path/to//ilsvrc12_train_lmdb
.
It is mandatory for users to use our file broadcasting tool to broadcast this 43 GB database across all nodes in the job. Concurrently reading this database from a number of nodes can degrade the filesystem's performance, and user can be banned by our administrators.
Add the following lines (the example of broadcasting the compressed database to /tmp
on four nodes) to your job script:
/home1/apps/dl-tools/bin/broadcast-mpi.sh /path/to/ilsvrc12_train_lmdb/data.mdb /tmp/ilsvrc12_train_lmdb/data.mdb 4 /home1/apps/dl-tools/bin/broadcast-mpi.sh /path/to/ilsvrc12_val_lmdb/data.mdb /tmp/ilsvrc12_val_lmdb/data.mdb 4 /home1/apps/dl-tools/bin/broadcast-mpi.sh /path/to/imagenet_mean.binaryproto /tmp/imagenet_mean.binaryproto 4
FAQ
Q: I have missing Python packages when using Caffe or Tensorflow. What shall I do?
A: These deep learning frameworks usually depend on many other packages. e.g., the Caffe package dependency list. On TACC resources, you can install these packages in user space by runninglogin1$ pip install --user package-name
Q: I run Caffe on multiple nodes, however the program hangs after loading the model. What is going on?
A: This may be caused by the LMDB concurrent reading issue. To work around this problem, you may want to broadcast the LMDB instance to every node's local SSD, which is mounted at "/tmp
". Use the lines below in your Slurm script. The following lines let one of the nodes read from the shared file system to /tmp
, then broadcast the database files to /tmp
on all four nodes in the allocation. Refer to /home1/apps/caffe/test/run-4nodes.slurm
for a real usage example.
Then update the LMDB path in the corresponding specification file (e.g., https://github.com/intel/caffe/blob/master/examples/cifar10/cifar10_full_train_test.prototxt). Do not forget to set "shuffle: true
" in the data_param. Otherwise, you can get degraded accuracy even with large batch size.