Introduction¶

In this project we will reproduce the results of Deep Speech: Scaling up end-to-end speech recognition. The core of the system is a bidirectional recurrent neural network (BRNN) trained to ingest speech spectrograms and generate English text transcriptions.

Let a single utterance $$x$$ and label $$y$$ be sampled from a training set

$S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}.$

Each utterance, $$x^{(i)}$$ is a time-series of length $$T^{(i)}$$ where every time-slice is a vector of audio features, $$x^{(i)}_t$$ where $$t=1,\ldots,T^{(i)}$$. We use MFCC as our features; so $$x^{(i)}_{t,p}$$ denotes the $$p$$-th MFCC feature in the audio frame at time $$t$$. The goal of our BRNN is to convert an input sequence $$x$$ into a sequence of character probabilities for the transcription $$y$$, with $$\hat{y}_t =\mathbb{P}(c_t \mid x)$$, where $$c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}$$. (The significance of $$blank$$ will be explained below.)

Our BRNN model is composed of $$5$$ layers of hidden units. For an input $$x$$, the hidden units at layer $$l$$ are denoted $$h^{(l)}$$ with the convention that $$h^{(0)}$$ is the input. The first three layers are not recurrent. For the first layer, at each time $$t$$, the output depends on the MFCC frame $$x_t$$ along with a context of $$C$$ frames on each side. (We typically use $$C \in \{5, 7, 9\}$$ for our experiments.) The remaining non-recurrent layers operate on independent data for each time step. Thus, for each time $$t$$, the first $$3$$ layers are computed by:

$h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})$

where $$g(z) = \min\{\max\{0, z\}, 20\}$$ is a clipped rectified-linear (ReLu) activation function and $$W^{(l)}$$, $$b^{(l)}$$ are the weight matrix and bias parameters for layer $$l$$. The fourth layer is a bidirectional recurrent layer [1]. This layer includes two sets of hidden units: a set with forward recurrence, $$h^{(f)}$$, and a set with backward recurrence $$h^{(b)}$$:

\begin{align}\begin{aligned}h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})\\h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})\end{aligned}\end{align}

Note that $$h^{(f)}$$ must be computed sequentially from $$t = 1$$ to $$t = T^{(i)}$$ for the $$i$$-th utterance, while the units $$h^{(b)}$$ must be computed sequentially in reverse from $$t = T^{(i)}$$ to $$t = 1$$.

The fifth (non-recurrent) layer takes both the forward and backward units as inputs

$h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})$

where $$h^{(4)} = h^{(f)} + h^{(b)}$$. The output layer are standard logits that correspond to the predicted character probabilities for each time slice $$t$$ and character $$k$$ in the alphabet:

$h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k$

Here $$b^{(6)}_k$$ denotes the $$k$$-th bias and $$(W^{(6)} h^{(5)}_t)_k$$ the $$k$$-th element of the matrix product.

Once we have computed a prediction for $$\hat{y}_{t,k}$$, we compute the CTC loss [2] $$\cal{L}(\hat{y}, y)$$ to measure the error in prediction. During training, we can evaluate the gradient $$\nabla \cal{L}(\hat{y}, y)$$ with respect to the network outputs given the ground-truth character sequence $$y$$. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use the Adam method for training [3].

The complete BRNN model is illustrated in the figure below.