DL 入门：破解 Captcha 验证码 II

上一篇验证码识别所用的训练数据都是初级难度的，图片是规定尺寸的，面对一些挑战性的不定尺寸的验证码就很头疼了，今天我们就来尝试一下难度较高的验证码。

本文基于 Keras 2.0.4 编写的代码

我们本次学习需要识别的验证码规则如下：

图片大小不固定
图片中的某一块区域为公式部分
图片中包含二行或者三行的公式
公式类型有两种：赋值和四则运算的公式。两行的包括由一个赋值公式和一个计算公式，三行的包括两个赋值公式和一个计算公式。加号（+）即使旋转为 x 仍为加号， * 是乘号
赋值类的公式，变量名为一个汉字。汉字来自两句诗（不包括逗号）：君不见，黄河之水天上来，奔流到海不复回烟锁池塘柳，深圳铁板烧
四则运算的公式包括加法、减法、乘法、分数、括号。其中的数字为多位数字，汉字为变量，由上面的语句赋值。

高级验证码

这次破解的验证码难度较高，直接塞图片进去端到端识别难度太大，其一是尺寸太大、训练时间过长，其二是不易收敛。

本次破解首先需要进行图片的预处理（剪裁、拼接），然后再放入 CNN + CTC 网络进行训练，其中 CNN 的网络结构需要调整，我们将在下文详述。

这次的验证码依旧属于不定长的验证码，我们只需要按照验证码的最大长度作为输出数即可，在此的输出数是 48 。

最后该类验证码的准确率最高可以达到 0.9986 ，本文代码最高可到 0.9908

训练数据下载地址： -> 戳我下载 <-

0x01 训练样本预处理

为了便于机器学习，以及加速收敛，我们不得不对原图进行剪裁。

首先我们导入一些必要参数，以及一些小工具：

import cv2
import seaborn
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

def disp(img, i, txt=None):
    plt.subplot(3, 3, i)
    if len(img.shape) == 2:
        plt.imshow(img, cmap='gray')
    else:
        plt.imshow(img[:,:,::-1])
    if txt:
        plt.title(txt)

利用 opencv 对图片进行剪裁：

plt.figure(figsize=(16, 10))

pic_id = 0
img_raw = cv2.imread('./data_validate/{0}.png'.format(pic_id))
m, n, _ = img_raw.shape
img_gray = cv2.cvtColor(img_raw, cv2.COLOR_BGR2GRAY)

img_median = cv2.medianBlur(img_gray, 9)

img_hist = cv2.equalizeHist(img_median)
_, img_bw = cv2.threshold(img_hist, 127, 255, cv2.THRESH_BINARY)

img_close = cv2.morphologyEx(img_bw, cv2.MORPH_CLOSE, np.ones((4, 3)))
img_open = cv2.morphologyEx(img_close, cv2.MORPH_OPEN, np.ones((30, 60)))

img_copy = img_raw.copy()
margin = 5

contours, hierarchy = cv2.findContours(cv2.bitwise_not(img_open), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
new_contours = []

for contour in contours:
    area = cv2.contourArea(contour)
    if area < 1000:
        continue
    new_contours.append(contour)
    x, y, w, h = cv2.boundingRect(contour)
    cv2.rectangle(img_copy, (x, y), (x+w, y+h), (0, 255, 0), 3)

for j, contour in enumerate(new_contours):
    x, y, w, h = cv2.boundingRect(contour)
    img_crop = img_raw[max(0, y-margin):min(m, y+h+margin),max(0, x-margin):min(n, x+w+margin)]
        
x, y, w, h = cv2.boundingRect(np.vstack(new_contours))
img_crop = img_copy[max(0, y-margin):min(m, y+h+margin),max(0, x-margin):min(n, x+w+margin)]

disp(img_raw, 1, 'raw img')
disp(img_gray, 2, 'img_gray')
disp(img_median, 3, 'img_median')
disp(img_bw, 4, 'bw')
disp(img_close[max(0, y-margin):min(m, y+h+margin),max(0, x-margin):min(n, x+w+margin)], 5, 'close')
disp(img_open[max(0, y-margin):min(m, y+h+margin),max(0, x-margin):min(n, x+w+margin)], 6, 'open')
disp(img_copy, 7, 'rect')
disp(img_crop, 8, 'crop')

由于能力有限，裁剪算法的参数并非最优，仍然有一些图片切割需要手动调参

裁剪验证码原图

通过OpenCV对原图进行识别处理，最终抠出来三个小图，然后并列到一起进行端到端识别。

拼接小图

其实我在预处理的过程还有很多值得改进的地方：

一个是剪切的三个小图直接拼接生成一个大图，这样可以减少载入图片的时间
另一个是数据增强，即汉字相同的小图部分可以相互替换，这样我们可以生成多余给定数据集的数量（10万）

由于在做实验的时候没有考虑优化，所以就没有想到这两个点，在此提出，以后有机会可以再次尝试。

0x02 网络结构调优

我们在初级验证码的实验很明确的告诉我们使用 CNN + CTC 模型的效果最好，毋庸置疑，我们这次就直接使用这个模型。

当我们沿用上一次初级验证码的网络结构的时候，发现上次的网络结构太过于简单粗暴以至于效果太差，为了优化网络结构我们参考了一些性能优异的模型。

我们将基于 CIFAR10 的准确率来做出选择，于是我们发现了一个宝贝。

Model	Acc.
VGG16	92.64%
ResNet18	93.02%
ResNet50	93.62%
ResNet101	93.75%
ResNeXt29(32x4d)	94.73%
ResNeXt29(2x64d)	94.82%
DenseNet121	95.04%
ResNet18(pre-act)	95.11%
DPN92	95.16%

我们首先对上述模型进行一番搜索，发现 Keras 有 VGG16 的模型，于是我们就拍脑袋决定先采用 VGG16 来做。

Keras - Github

VGG16

但是我们经过测试发现，池化数量太多会导致GRU输入太短，导致识别效果不好，于是我们就减少了池化。

另外深度学习，要深才叫做深度，所以我们在减少池化的同时加上一些卷积层来加深我们的网络。

x = Conv2D(32, (3, 3), padding='same', name='block1_conv1')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(32, (3, 3), padding='same', name='block1_conv2')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling2D((2, 2), name='block1_pool')(x)

# Block 2
x = Conv2D(64, (3, 3), padding='same', name='block2_conv1')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(64, (3, 3), padding='same', name='block2_conv2')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(64, (3, 3), padding='same', name='block2_conv3')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling2D((2, 2), name='block2_pool')(x)

# Block 3
x = Conv2D(128, (3, 3), padding='same', name='block3_conv1')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(128, (3, 3), padding='same', name='block3_conv2')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(128, (3, 3), padding='same', name='block3_conv3')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(128, (3, 3), padding='same', name='block3_conv4')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(128, (3, 3), padding='same', name='block3_conv5')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(128, (3, 3), padding='same', name='block3_conv6')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling2D((2, 2), name='block3_pool')(x)

其他的就和破解初级验证码的方式一样了，只需要注重训练方法就差不多了。

CNN + CTC 模型

训练方法

我们手动的修改 adam 的 lr 参数:

0.001 for epoch [0,25)
0.0001 for epoch [25,50)

0x03 优化方法

优化后的 NoteBook 可点击这里获取。

加深模型深度

我们这里使用的模型深度是 2 / 3 / 6 ，其实可以再加深一下变成 2 / 4 / 6，效果会提升几个千分点，但是训练的时间后者会比前者多很多。

修改层初始化方法

事实上在 Keras 封装的过程中，为我们设置好了层的初始化方法，即 kernel_initializer 和 bias_initializer，一个好的初始化参数有助于加速我们模型的收敛。

我们只需在每个 Conv2D 中加上 kernel_initializer='he_normal' 即可。

Initializers - Keras Documentation

控制过拟合

另外一个就是要控制过拟合，我们再训练集可以取得很好地效果，但是在测试集却上不去，这就是明显的过拟合。

所以我们需要使用 L2 正则和增加 Dropout 来控制过拟合。

from keras.regularizers import *

weight_decay = 1e-4

x = Conv2D(128, (3, 3), padding='same', name='block3_conv1', kernel_regularizer=l2(weight_decay))(x)
...
x = Dense(64, kernel_regularizer=l2(weight_decay), bias_regularizer=l2(weight_decay))(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.25)(x)
...
x = concatenate([gru_2, gru_2b])
x = Dropout(0.25)(x)
x = Dense(n_class, kernel_initializer='he_normal', activation='softmax', 
          kernel_regularizer=l2(weight_decay), bias_regularizer=l2(weight_decay))(x)

最后经过优化的代码，多模型融合后的极限大概是 0.9986 ，在工业应用上已经很好地破解了这类验证码。

0x04 Reference

使用深度学习来破解 captcha 验证码 - 杨培文

pytorch-cifar - GitHub

Keras - Github