## Install Caffe

On the official document page, BVLC already provides detailed instruction to guide users. However there are so many prerequsites to install. Luckily, I use Arch Linux, and there are advanced tools(pacman & aur tool) on Arch Linux to resolve these dependency problems. On AUR, there are serveral Caffe repositories to install Caffe using only one command. But they are different in configuration. For me, normally I develop basic model on laptop and then go further to train larger model on remote GPU servers with preinstalled Caffe. The repo caffe-cpu-git is enough for me. After choose the repo, just install Caffe by executing:

yaourt -S caffe-cpu-git


Ideally, yaourt will automatically detect and resolve prerequisites like BLAS, Boost, OpenCV etc. After installation finished, you’ll get environment ready with C++ and Python interfaces of Caffe. Notably, some Caffe tools like convert_mnist_data, convert_cifar_data etc. are installed under /usr/bin/, so that you can use it globally.

## Key Concepts of Caffe

Before we begin to train model for MNIST, we’d better spend some time on the philosophy and key concepts of Caffe. The philosophy of Caffe is expression, speed, modularity, openness and community. Except the first one, other three are simple for a Caffe newbie to understand. The official explanation is models and optimizations in Caffe are defined as plaintext schemas instead of code. What a great feature it is! Interestingly, Caffe even provides a Python script python/draw_net.py to visualize plaintext schema file to graph using GraphViz.

Some key concepts:

• Blob: Caffe communicates, and manipulates the information as blobs. We can use mutable way orconst way to operate with blobs.
• Layer: the fundamental unit of computation layers and connections. A layer has functions for setup, forward and backward.
• Net: define a function and its gradient by composition and auto-differentiation.
• Forward: pass computes the output given the input for inference.
• Backward: pass computes the gradient given the loss for learning.
• Solver: orchestrate model optimization by coordinating the networks’s forward inference and backward gradients to form parameter updates that attempt to imporve the loss.

Some are learn theory related terms. For example, the layer catalogue of Caffe are grouped by its functionality like vision layer, loss layers, activation/neuron layers, data layers, etc.

## Prepare LMDB Dataset for MNIST

After installed the repo caffe-cpu-git, the steps to prepare lmdb dataset is a little different with official LeNet guide because of the different installation directory. For simplicity, you can just copy and execute following commands step by step. Get Caffe code and MNIST dataset.

git clone https://github.com/BVLC/caffe
cd caffe
./data/mnist/get_mnist.sh


Convert MNIST train data to lmdb data.

convert_mnist_data data/mnist/train-images-idx3-ubyte data/mnist/train-labels-idx1-ubyte examples/mnist/mnist_train_lmdb --backend lmdb


Convert MNIST test data to lmdb data.

convert_mnist_data data/mnist/t10k-images-idx3-ubyte data/mnist/t10k-labels-idx1-ubyte examples/mnist/mnist_test_lmdb --backend lmdb


Actually, Caffe supports many types of data layer. Here we try to access data from LMDB, one of the lightning and efficient key-value databases. Comparing to HDF5, another type of data layer, it uses memory-mapped files, and doesn’t need to load whole dataset in memory, so it’s suitable for large datasets. The tool convert_mnist_data implementation is written in C++. The basic idea of the tool is to read from MNIST byte file format to Datum objects and serialize to string. Then use item id as key, the serialized string as value, store this new item to LMDB.

## LeNet

LeNet is one of the popular convolutional networks, and works well on digit classification tasks. The model is illustrated as following graph:

Subsampling is actually pooling in newest terminology. In Caffe codebase, under the folder examples/mnist, there are many .prototxt files, i.e. plaintext model files. lenet_train_test.prototxt can be visualized by python/draw.py.

cd examples/mnist/

python2 /opt/caffe/python/draw_net.py --rankdir TB --phrase TRAIN lenet_train_test.prototxt lenet_train_test.png


lenet_train_test.prototxt is almost the direct translation of the LeNet model on paper except the slight difference in the output layer. We replace the Gaussian connections to Rectified Linear Unit(ReLU) activation.

Next, Let’s review the details of lenet_train_test.prototxt.

### Data Layer

name: "LeNet"
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include: {
phase: TRAIN
}
transform_param: {
scale: 0.00390625
}
data_param: {
source: "examples/mnist/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}


The .prototxt file describles Caffe model from bottom to top. So in data layer, we need to define two top, data and label. And the type entry define the layer category, it can be Data, Convolution, Pooling, InnerProduct, ReLU, etc. By defining include, we can define which phase to include this layer so that we can set different data_param for training and testing.

Specially, the transform_param entry sets scale to 1/256(0.00390625), making Caffe transform color value from 0-255 to 0-1. Why do we need feature scaling? According to Wikipedia, there are two major motivations to do feature scaling. In some scenarios, feature scaling makes objective functions work properly. On the another hand, feature scaling makes gradient descent converge faster than without it. In another words, we can train the model quicker with feature scaling.

### Convolution Layer

layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
# learning rate for the filters
param: { lr_mult: 1 }
# learning rate for the biases
param: { lr_mult: 2 }
convolution_param: {
num_output: 20
kernel_size: 5
stride: 1
weight_filler: { type: "xavier" }
bias_filler: { type: "constant" }
}
}


For learning rate, it decides the trade-off between speed and accuracy. Learning rate is one of the parameters of gradient descent, and setting biases’ learning rate twice as learning rate of filters is just an experience affair to better convergence rates. For other convolution parameters, we’ll talk more in future posts.

### Pooling Layer

layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param: {
pool: MAX
kernel_size: 2
stride: 2
}
}


Pooling is a procedure that takes input over a certain area and reduces that to a single value. Accroding to the types of operation which applied, there are different pooling, for example, max-pooling, average-pooling, etc. Here, we do a max-pooling on a 2x2 subsample, which means using the largest one to represent whole 2x2 subsample.

### Fully Connected Layer

layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param: { lr_mult: 1 }
param: { lr_mult: 2 }
inner_product_param: {
num_output: 500
weight_filler: { type: "xavier" }
bias_filler: { type: "constant" }
}
}


Fully connected layer in Caffe is named with InnerProduct. Because initialization really matters for non-convex optimization algorithm, like stochastic gradient descent, here, we use Xavier algorithm to initialize the deep network. Normally, we use Gaussian or uniform distributions with fairly arbitrarily set variances. Unluckily, it’s difficult to find right stuff about Xavier algorithm until I found the post about Xavier initialization on tumblr. Hope it will help you too.

### Loss Layer & ReLu

In LeNet, we use ReLu as activation function, and softmax regression function as loss function.

### LeNet Solver

net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: CPU


In LeNet solver, I want to highlight the two entry, momentum and weight_decay. Momentum is a method to optimize the SGD steep walking on long shadow ravine. With momentum, we update the parameter vector $\theta$ in following way:

\begin{align} v &= \gamma v+ \alpha \nabla_{\theta} J(\theta; x^{(i)},y^{(i)}) \end{align}

\begin{align} \theta &= \theta - v \end{align}

$\alpha$ is the learning base learning rate, $\gamma \in (0, 1]$ is momentum which determines for how many iterations the previous gradients are incorporated into the current update.

## Training and Testing LeNet

caffe train -solver=examples/mnist/lenet_solver.prototxt


Caffe trainer is powerful, as your config in LeNet solver, Caffe saves snapshots for every 5000 iterations. You can also stop training with Ctrl-C and Caffe will output its current state as well.

To resume training from a snapshot, for example, we manually stop training at 350th iteration and get a snapshot lenet_iter_350.solverstate, then we can do:

cd examples/mnist/
caffe train -solver=lenet_solver.prototxt -snapshot=lenet_iter_350.solverstate