R-CNN 是使用深度学习解决目标检测问题的里程碑，让我们感受一些这篇开山之作，回顾历史的开端。

核心思想

• 区域选择不再是滑窗提取特征，而是采用启发式的候选区域生成算法（Selective Search）
• 提取特征也从手工，变成了卷积神经网络自动提取特征，增强了特征的鲁棒性

流程步骤

1、使用 Selective Search 算法生成候选区域
2、使用 CNN 提取候选区域图像的特征
3、将特征输入到 SVM 分类器，判别输入类别
4、以回归的方式精修候选框

核心部分

• 使用一种过分割手段，将图像分割成小区域
• 查看现有小区域，合并可能性最高的两个区域，重复直到整张图像合并成一个区域位置
• 输出所有处理完的区域，即候选区域

合并规则

• 颜色（颜色直方图）相近的
• 纹理（梯度直方图）相近的
• 合并后总面积小的，并保证合并操作的尺度较为均匀
• 合并后，总面积在其 Bbox 中所占比例大的，并保证形状比较规则

Selective Search 为例的候选区域生成算法，都是基于 Low Level 的图像特征，比如纹理、颜色等，比较低等不具有抽象性，效果比较局限。

Feature extraction

Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers

Neural Networks

Aside from replacing the CNN’s ImageNet-specific 1000-way classification layer with a randomly initialized (N + 1)-way classification layer (where N is the number of object classes, plus 1 for background), the CNN architecture is unchanged.

Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [17, 37]. Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.

The overlap threshold, 0.3, was selected by a grid search over {0,0.1,…,0.5} on a validation set.

Classification

This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.

Bounding Box Refine

1、先做平移 $(\delta{x},\delta{y})$，$\delta{x}=P_{w}d_{x}(P)$，$\delta{y}=P_{h}d_{y}(P)$

$$\hat{G_{x}} = P_{w}d_{x}(P)+P_{x}$$

$$\hat{G_{y}} = P_{h}d_{y}(P)+P_{y}$$

2、再做尺度缩放 $(S_{w},S_{h})$，$S_{w}=P_{w}exp(d_{w}(P))$，$S_{h}=P_{h}exp(d_{h}(P))$

$$\hat{G_{w}} = P_{w}exp(d_{w}(P))$$

$$\hat{G_{h}} = P_{h}exp(d_{h}(P))$$

$$w_{*} = \arg\min_{\hat{w_{*}}} \sum_{i}^{n} (t_{*}^{i} − w_{*}^{T} * \phi_{5}(P^{i}))^2 + \lambda ||\hat{w_{*}}||^2$$

$t_{x}=(G_{x} - P_{x})/P_{w}$
$t_{y}=(G_{y} - P_{y})/P_{h}$
$t_{w}=log(G_{w}/P_{w})$
$t_{h}=log(G_{w}/P_{h})$