Welcome to DeepSpeech’s documentation!¶

DeepSpeech.
BiRNN
(batch_x, seq_length, dropout)[source]¶ That done, we will define the learned variables, the weights and biases, within the method
BiRNN()
which also constructs the neural network. The variables namedhn
, wheren
is an integer, hold the learned weight variables. The variables namedbn
, wheren
is an integer, hold the learned bias variables. In particular, the first variableh1
holds the learned weight matrix that converts an input vector of dimensionn_input + 2*n_input*n_context
to a vector of dimensionn_hidden_1
. Similarly, the second variableh2
holds the weight matrix converting an input vector of dimensionn_hidden_1
to one of dimensionn_hidden_2
. The variablesh3
,h5
, andh6
are similar. Likewise, the biases,b1
,b2
…, hold the biases for the various layers.

class
DeepSpeech.
Epoch
(index, num_jobs, set_name='train', report=False)[source]¶ Represents an epoch that should be executed by the Training Coordinator. Creates num_jobs WorkerJob instances in state ‘open’.
 Args:
 index (int): the epoch index of the ‘parent’ epoch num_jobs (int): the number of jobs in this epoch
 Kwargs:
 set_name (str): the name of the dataset  one of ‘train’, ‘dev’, ‘test’ report (bool): if this job should produce a WER report

done
()[source]¶ Checks, if all jobs of the epoch are in state ‘done’. It also lazyprepares a WER report from the result data of all jobs.
 Returns:
 bool. if all jobs of the epoch are ‘done’

finish_job
(job)[source]¶ Finishes a running job. Removes it from the running jobs list and adds it to the done jobs list.
 Args:
 job (WorkerJob): the job to put into state ‘done’

get_job
(worker)[source]¶ Gets the next open job from this epoch. The job will be marked as ‘running’.
 Args:
 worker (int): index of the worker that takes the job
 Returns:
 WorkerJob. job that has been marked as running for this worker

DeepSpeech.
average_gradients
(tower_gradients)[source]¶ A routine for computing each variable’s average of the gradients obtained from the GPUs. Note also that this code acts as a synchronization point as it requires all GPUs to be finished with their minibatch before it can run to completion.

DeepSpeech.
calculate_mean_edit_distance_and_loss
(model_feeder, tower, dropout)[source]¶ This routine beam search decodes a minibatch and calculates the loss and mean edit distance. Next to total and average loss it returns the mean edit distance, the decoded result and the batch’s original Y.

DeepSpeech.
calculate_report
(results_tuple)[source]¶ This routine will calculate a WER report. It’ll compute the mean WER and create
Sample
objects of thereport_count
top lowest loss items from the provided WER results tuple (only items with WER!=0 and ordered by their WER).

DeepSpeech.
collect_results
(results_tuple, returns)[source]¶ This routine will help collecting partial results for the WER reports. The
results_tuple
is composed of an array of the original labels, an array of the corresponding decodings, an array of the corrsponding distances and an array of the corresponding losses.returns
is built up in a similar way, containing just the unprocessed results of onesession.run
call (effectively of one batch). Labels and decodings are converted to text before splicing them into their corresponding results_tuple lists. In the case of decodings, for now we just pick the first available path.

DeepSpeech.
export
()[source]¶ Restores the trained variables into a simpler graph that will be exported for serving.

DeepSpeech.
format_duration
(duration)[source]¶ Formats the result of an even stopwatch call as hours:minutes:seconds

DeepSpeech.
get_tower_results
(model_feeder, optimizer)[source]¶ With this preliminary step out of the way, we can for each GPU introduce a tower for which’s batch we calculate
 The CTC decodings
decoded
,  The (total) loss against the outcome (Y)
total_loss
,  The loss averaged over the whole batch
avg_loss
,  The optimization gradient (computed based on the averaged loss),
 The Levenshtein distances between the decodings and their transcriptions
distance
,  The mean edit distance of the outcome averaged over the whole batch
mean_edit_distance
and retain the original
labels
(Y).decoded
,labels
, the optimization gradient,distance
,mean_edit_distance
,total_loss
andavg_loss
are collected into the corresponding arraystower_decodings
,tower_labels
,tower_gradients
,tower_distances
,tower_mean_edit_distances
,tower_total_losses
,tower_avg_losses
(dimension 0 being the tower). Finally this new methodget_tower_results()
will return those tower arrays. In case oftower_mean_edit_distances
andtower_avg_losses
, it will return the averaged values instead of the arrays. The CTC decodings

DeepSpeech.
log_grads_and_vars
(grads_and_vars)[source]¶ Let’s also introduce a helper function for logging collections of gradient/variable tuples.

DeepSpeech.
log_variable
(variable, gradient=None)[source]¶ We introduce a function for logging a tensor variable’s current state. It logs scalar values for the mean, standard deviation, minimum and maximum. Furthermore it logs a histogram of its state and (if given) of an optimization gradient.

DeepSpeech.
new_id
()[source]¶ Returns a new ID that is unique on process level. Not threadsafe.
 Returns:
 int. The new ID

DeepSpeech.
stopwatch
(start_duration=0)[source]¶ This function will toggle a stopwatch. The first call starts it, second call stops it, third call continues it etc. So if you want to measure the accumulated time spent in a certain area of the code, you can surround that code by stopwatchcalls like this:
fun_time = 0 # initializes a stopwatch [...] for i in range(10): [...] # Starts/continues the stopwatch  fun_time is now a point in time (again) fun_time = stopwatch(fun_time) fun() # Pauses the stopwatch  fun_time is now a duration fun_time = stopwatch(fun_time) [...] # The following line only makes sense after an even call of :code:`fun_time = stopwatch(fun_time)`. print 'Time spent in fun():', format_duration(fun_time)

DeepSpeech.
train
(server=None)[source]¶ Trains the network on a given server of a cluster. If no server provided, it performs single process training.

DeepSpeech.
variable_on_worker_level
(name, shape, initializer)[source]¶ Next we concern ourselves with graph creation. However, before we do so we must introduce a utility function
variable_on_worker_level()
used to create a variable in CPU memory.

util.audio.
audiofile_to_input_vector
(audio_filename, numcep, numcontext)[source]¶ Given a WAV audio file at
audio_filename
, calculatesnumcep
MFCC features at every 0.01s time step with a window length of 0.025s. Appendsnumcontext
context frames to the left and right of each time step, and returns this data in a numpy array.

util.text.
sparse_tensor_value_to_texts
(value, alphabet)[source]¶ Given a
tf.SparseTensor
value
, return an array of Python strings representing its values.

util.text.
sparse_tuple_from
(sequences, dtype=<class 'numpy.int32'>)[source]¶ Creates a sparse representention of
sequences
. Args: sequences: a list of lists of type dtype where each element is a sequence
Returns a tuple with (indices, values, shape)

util.text.
text_to_char_array
(original, alphabet)[source]¶ Given a Python string
original
, remove unsupported characters, map characters to integers and return a numpy array representing the processed string.

util.text.
wer
(original, result)[source]¶ The WER is defined as the editing/Levenshtein distance on word level divided by the amount of words in the original text. In case of the original having more words (N) than the result and both being totally different (all N words resulting in 1 edit operation each), the WER will always be 1 (N / N = 1).

class
util.stm.
STMSegment
(stm_line)[source]¶ Representation of an individual segment in an STM file.

util.stm.
parse_stm_file
(stm_file)[source]¶ Parses an STM file at
stm_file
into a list ofSTMSegment
.