DL 入门：破解 Captcha 验证码 I

在深度学习入门阶段，我们不可避免的会遇到如猫狗大战、验证码识别这样经典的题目，为了牢固深度学习入门的基础，特此写本篇验证码破解学习的总结。

** 本文基于 Keras 2.0.4 编写的代码 **

我们本次学习需要识别的验证码规则如下：

3 个运算数：3 个 0 到 9 的整型数字；
2 个运算符：可以是+、-、*，分别代表加法、减法、乘法
0 或 1 对括号：括号可能是 0 对或者 1 对

初级验证码

这次破解的验证码的难度属于初级，只需要喂足够量的数据即可达到 90% 及以上的准确度，利用下面的 CTC Loss 可以达到 99.9% 及以上的准确度。

这次的验证码属于不定长的验证码，我们只需要按照验证码的最大长度作为输出数即可，在此的输出数是 7 。

训练数据下载地址： -> 戳我下载 <-

0x01 载入训练样本

本文代码从 Jupyter Notebook 的笔记中截取下来

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import random
import numpy as np
import matplotlib.pyplot as plt

from PIL import Image
from keras.models import *
from keras.layers import *
from keras.callbacks import *
from keras import backend as K
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

characters = '0123456789+-*()'

width, height, n_len, n_class = 180, 60, 7, len(characters) + 1

上面代码是开始的初始化，n_class + 1 是加多一个空白符的意思。

180 和 60 代表着训练图片的尺寸大小。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def decode(y):
    y = np.argmax(np.array(y), axis=2)[:,0]
    return ''.join([characters[x] for x in y])


def get_data(filename):
    file_handle = open(filename)
    file_content = file_handle.read().split('\n')
    
    if file_content[-1] == '':
        del file_content[-1]
    
    batch_size = len(file_content)
    
    X = np.zeros((batch_size, height, width, 3), dtype=np.uint8)
    y = [np.zeros((batch_size, n_class), dtype=np.uint8) for i in range(n_len)]

    
    for i, line in enumerate(file_content):

        tmp_label = line.split(' ')
        X[i] = Image.open('./data_train/{}.png'.format(i))
        
        y[5][i, 15] = 1
        y[6][i, 15] = 1
        for j, ch in enumerate(tmp_label[0]):
            y[j][i, :] = 0
            y[j][i, characters.find(ch)] = 1
    
    return X, y

X_train, y_train = get_data('./data_train/labels.txt')

上面代码的 decode 函数就是对编码后的结果进行解密，而 get_data 就是加载训练数据，至此我们的前戏做好了。

0x02 CNN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
input_tensor = Input((height, width, 3))
x = input_tensor

for i in range(4):
	x = Conv2D(32*2**i, (3, 3), padding='same', kernel_initializer='he_normal')(x)
	x = BatchNormalization()(x)
	x = Activation('relu')(x)
	x = Conv2D(32*2**i, (3, 3), padding='same', kernel_initializer='he_normal')(x)
	x = BatchNormalization()(x)
	x = Activation('relu')(x)
    x = MaxPooling2D((2, 2))(x)

x = Flatten()(x)
x = Dropout(0.25)(x)
x = [Dense(n_class, activation='softmax', name='c%d'%(i+1))(x) for i in range(n_len)]

model = Model(input_tensor, x)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, y_train, batch_size=128, epochs=25, validation_split=0.25, shuffle=True)

我们非常简单暴力的将图片塞进去，训练大概 25 个 epoch 就有 0.98 x 接近 0.99 的结果，模型结构如下。

模型可视化

CNN

如果我们还想再继续提升，那么就需要换一种方法实现了，下面我们将使用循环神经网络来提升我们识别的准确率。

0x03 CNN + CTC

CTC ( Connectionist Temporal Classification ) 作为一个损失函数，用于在序列数据上进行监督式学习。

CTC Loss 是一个特别神奇的 loss，它可以在只知道序列的顺序，不知道具体位置的情况下，让模型收敛。

CTC

由于在 Keras 里面已经内置了 CTC Loss，我们只需定义如下的一个函数即可实现 CTC Loss。

又因为我们使用的是循环神经网络，所以默认丢掉前面两个输出，因为它们通常无意义，且会影响模型的输出。

labels 是验证码，是 7 个字符（数字或符号）；
y_pred 是模型的输出，是按顺序输出的 16 个字符的概率，因为我们这里用到了循环神经网络，所以需要一个空白字符的概念；
input_length 表示 y_pred 的长度，我们这里是 20（22-2）；
label_length 表示 labels 的长度，我们这里是 7。

实际上 label_length 应该填写不定长，但是这里的验证码复杂程度比较低，影响不是特别的大。

1
2
3
4
def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    y_pred = y_pred[:, 2:, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

模型结构的设计大致是，卷积神经网络识别特征，然后通过一个全连接降维，最后再按水平顺序输入到一种叫 GRU 的特殊循环神经网络。

按照培神的文章，在工程实践中我们发现 GRU 比 LSTM 好，所以我们在此采用了 GRU 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from keras.models import *
from keras.layers import *

rnn_size = 128

input_tensor = Input((width, height, 3))
x = input_tensor

for i in range(3):
    x = Conv2D(32, (3, 3), padding='same', kernel_initializer='he_normal')(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = Conv2D(32, (3, 3), padding='same', kernel_initializer='he_normal')(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = MaxPooling2D((2, 2))(x)
    
cnn_model = Model(input_tensor, x, name='CNN')
x = cnn_model(input_tensor)

conv_shape = x.get_shape()
x = Reshape((int(conv_shape[1]), int(conv_shape[3] * conv_shape[2])))(x)

x = Dense(32)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.25)(x)

gru_1 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru1')(x)
gru_1b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru1_b')(x)
gru1_merged = add([gru_1, gru_1b])

gru_2 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru2')(gru1_merged)
gru_2b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru2_b')(gru1_merged)

x = concatenate([gru_2, gru_2b])
x = Dropout(0.25)(x)
x = Dense(n_class, kernel_initializer='he_normal', activation='softmax')(x)

base_model = Model(input_tensor, x)

labels = Input(name='the_labels', shape=[n_len], dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([x, labels, input_length, label_length])

model = Model(inputs=[input_tensor, labels, input_length, label_length], outputs=loss_out)
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adam')

模型可视化

CNN & CTC

对比上一个模型，我们现在的模型明显复杂了许多，但其实只是输入变多了。

唯一需要注意的一点就是我们图片的输入，在上面卷积的时候用的是 numpy 的默认格式，即 (height, width, 3)，而我们的 CTC 变成了 (width, height, 3) 。

这是因为我们希望以水平方式输入，然后经过各种卷积核降维变成 (22, 32)，这里的每个长度为 22 的向量都代表一个竖条的图片的特征，从左到右，一共有 32 条。

然后我们兵分两路，一路从左到右输入到 GRU，一路从右到左输入到 GRU，然后将他们输出的结果加起来。再兵分两路，还是一路正方向，一路反方向，只不过第二次我们直接将它们的输出连起来，然后经过一个全连接，输出每个字符的概率。

使用 CTC Loss & CNN 的模型，最后的准确率可以达到 99.989% ，当然还有可能继续提高，限于时间关系就没有继续跑了。

然而，在现实应用中，达到 99% 就已经算是完全破解了这个验证码，也就是说这类型的人机验证已经失效了。

0x04 Reference

使用深度学习来破解 captcha 验证码 - 杨培文