## The Problem Description

Object detection is the process that deal with detecting instances of semantic objects of a certain class in digital images and videos. The object recoognition task degrades into a object detection task if we know what we are looking for.

If we apply a recognition algorithm to every possible sub-window in given image, it’s likely to be both slow and error-prone. More effective approach is constructing a specified detector that can rapidly find likely regions where the particular objects may occur. The most widely used applications of object detection is face detection and pedestrian detection.

## Conventional Approaches

### Face Detection

At the begining of 21 century, some review gave the taxonomy of face detection techniques as feature-based, template-based and appearance-based. Here, I’d like to only talk about the feature-based approaches as they are mostly related to current approaches. Main idea of feature-based techniques at that time is finding the locations of distinctive image features and use a conventional machine learning classifier to verify them. OpenCV uses Haar cascades classifier to do face detection. The paper of this approach is still inspiring at today, such as:

• When your image processing and operation is slow, try some new image representations, e.g. image to column for convolution, integral image (close to summed area table) for rectangle features.
• Cascade classifier uses object specific focus-of-attention mechanism, i.e. spending more computation on more valuable things.

### Pedestrian Detection

#### HOG Features

A well-known feature description for pedestrian detection is histogram of oriented gradients (HOG). The pipline of pedestrian detection is shown as following figure. In the discussion part of the paper, the auther argued that more channel of pixel representation (RGB, LAB, grayscale etc.) does improve the detection performance, whereas colour normalization has only a modest effect. The performance is sensitive to the way in which gradients are computed and fine orientation coding (the optimal number of orientation bin is 9). In real world, because of the local variations in illumination and foreground-background contrast, the gradient strengths vary over a wide range which can largely influence the weighted orientation vote. Thus, local contrast normalization turns out to be essential as well. HOG descriptor is not suitable for nonrigid deformations situation. Hence, HOG is a kind of rigid template features.

#### Deformable Part Model (DPM)

The paper, discriminatively trained part-based models, proposed to represent highly variable objects using mixtures of multiscale deformable part models. It aims at detecting objects that vary greatly in appearance and are hard to detect using rigid templates. For object detection task, variations come from changes in illumination and viewpoint, nonrigid deformations and intraclass variability in shape and other visual properties. The main idea of DPM is modeling different parts separately and introducing deformation cost to allow some variations of objects. The practical issues are considered to implement the model, such as the part filters are placed at twice the spatial resolution of the placement of the root filters. The objects matching is illustrated as following figure. To avoid elaborate labeling multiple parts, DPM treats the part locations as latent variable and uses latent SVM as the classifier.

Furthermore, the DPM can be appiled into a mixture model to detect objects that vary in appearance significantly from different viewpoints like bike.

## Deep Learning Approaches

### R-CNN

R-CNN: Regions with CNN features, are proposed by Girshick et al. and made a considerable improvement of mean average precision (mAP) on PASCAL VOC dataset. The objection detection system of R-CNN is demonstrated as following figure from the paper:

R-CNN follows the “recognition using regions” paradigm. At the region proposals module, it uses selective search to extract around 2000 category-independent region proposals. Next R-CNN uses AlexNet to extract higher layer feature vectors such as FC-7 with 4096 dimenstional feature vector. Finally a multiple class SVM classifier is applied to evaluate the region proposal. A threshold of score is learned to reject region proposals with low intersection-over-union (IoU) overlap. Several region proposal maked as positive might overlap and bound the same object, non-maximum suppression is used to merge them.

As mentioned in the paper, one of the challenge is that labeled data is scarce and the amount is insufficient for training a large CNN. To conquer this, the author tried and showed that supervised pretraining on a large auxiliary dataset, followed by domain-specific fine-tuning on a small dataset is an effective paradigm for this situation. In practice, at the fine-tuning stage, the classification layer of AlexNet is changed to 21-way (for 20 VOC classes plus background). The learning rate during the domain-specific fine-tuning stage should be 1/10th of the learning rate of supervised pre-training stage, which allows fine-tuning to make progress while not clobbering the initialization.

Furthermore, the paper analyzed detection errors using the method proposed by D. Hoimen et al. It’s quite useful to optimize your algorithm. In brief, the false positives in object detection are grouped into four types: Loc (poor localization), Sim (confusion with similar category), Oth (confusion with dissimilar category) and BG (background). In this way, the performace of R-CNN selecting different feature layers can be distinguished using following graph. As we can see, bounding box regression largely reduces localization errors.

### SPP-net

R-CNN does improve the detection precision, but it has several notable drawbacks including training with multi-stage pipline and feature caching which is expensive in time and space. Worsely, the object detection is slow. Spartial pyramid pooling networks (SPP-nets) proposed to speed up R-CNN by sharing compution. SPPnet is implemented via spartial pyramid pooling layer and its structure is illustrated as following figure.

The trick of SPP-net is that as the requirement of fixed sizes for conventional CNNs is only due to fully-connected layers demanding fixed-length vectors as inputs, we can use convolutional layers to get feature maps firstly and then pool features in arbitrary region. In other words, spartial pyramind pooling (SPP) shares computation over all region proposals which makes SPP-net 24-102X faster than the R-CNN. However, the back-propagation through the SPP layer is highly inefficient when each training sample comes from a different image. This directly causes that SPP-net cannot update weights below SPP layer and the fine-tuning only works on fully connected layers leading to limited accuracy improvement.

### Fast R-CNN

Fast R-CNN uses the idea of feature maps from SPP-net, and integrates the training into a single-stage using a multi-task loss. Since it’s single-stage training, no extra disk space for feature caching is required. Following figure shows the architecture of Fast R-CNN where $H$ and $W$ is relatively small fixed size of regions of interest (RoI) pooling layer output, $h$ and $w$ are variable to different input images.

In Fast R-CNN, stochastic gradient descent mini-batches are sampled hierarchically, first by sampling $N$ images and then by sampling $R/N$ RoIs from each image. More specifically, 25% of RoIs from object proposals with IoU larger than 0.5 is treated as positive proposal with category label, and remaining RoIs are sampled with RoI in $[0.1, 0.5)$ which are treated as background. Images are horizontally flipped with 0.5 probability.

A Fast R-CNN network has two sibling outputs, a discrete probability distribution (per RoI) and bounding-box regression offsets for each object classes. A joint loss is used to improve these two tasks performance together. Moreover, because more computing is spent at the fully connection layers for detection large number of RoIs, this procedure is accelerated by compressing FC layer with truncated SVD which is implemented using two FC layer without a non-linearity between them.

### Faster R-CNN

R-CNN provides an efficient method to leverage CNN in object detection. To improve its performance, new approaches like SPP-net, Fast R-CNN try to improve and reduce the running time of detection network. Whereas Faster R-CNN attempts to improve the region proposals generation by introducing a Region Proposal Network (RPN) which simultaneously predicts object bounds and objectness scores at each position. The Faster R-CNN framework is illustrated as following:

There are several significant tricks or techniques used by Faster R-CNN. First, Faster R-CNN uses translation-invariant anchors to use the convolutional features computed on a single-scale image for addressing multi-scales. Thus this model benefits running speed. Moreover, the anchors are generated using “image-centric” sampling strategy, and the anchors that cross the image boundaries are ignored during training. As region proposal network proposals highly overlap with each other, non-maximum suppression (NMS) is adopted on the them based on their $cls$ scores. Second, Faster R-CNN uses alternating training to share convolutional layers features for RPN and Fast R-CNN. In this solution, RPN is trained at first, and its generated proposals are used to train Fast R-CNN. The network tuned by Fast R-CNN is then used to initialize RPN. This process is iterated. In this way, Faster R-CNN avoids to compute two different feature maps for RPN and detection network. Finally, the detection system has a frame rate of 5fps on a GPU. It’s a significant step towards real time object detection.