Welcome to DeepSpeech’s documentation!

DeepSpeech.BiRNN(batch_x, seq_length, dropout, reuse=False, batch_size=None, n_steps=-1, previous_state=None)[source]

That done, we will define the learned variables, the weights and biases, within the method BiRNN() which also constructs the neural network. The variables named hn, where n is an integer, hold the learned weight variables. The variables named bn, where n is an integer, hold the learned bias variables. In particular, the first variable h1 holds the learned weight matrix that converts an input vector of dimension n_input + 2*n_input*n_context to a vector of dimension n_hidden_1. Similarly, the second variable h2 holds the weight matrix converting an input vector of dimension n_hidden_1 to one of dimension n_hidden_2. The variables h3, h5, and h6 are similar. Likewise, the biases, b1, b2…, hold the biases for the various layers.

class DeepSpeech.Epoch(index, num_jobs, set_name='train', report=False)[source]

Represents an epoch that should be executed by the Training Coordinator. Creates num_jobs WorkerJob instances in state ‘open’.

index (int): the epoch index of the ‘parent’ epoch num_jobs (int): the number of jobs in this epoch
set_name (str): the name of the data-set - one of ‘train’, ‘dev’, ‘test’ report (bool): if this job should produce a WER report

Checks, if all jobs of the epoch are in state ‘done’. It also lazy-prepares a WER report from the result data of all jobs.

bool. if all jobs of the epoch are ‘done’

Finishes a running job. Removes it from the running jobs list and adds it to the done jobs list.

job (WorkerJob): the job to put into state ‘done’

Gets the next open job from this epoch. The job will be marked as ‘running’.

worker (int): index of the worker that takes the job
WorkerJob. job that has been marked as running for this worker

Provides a printable overview of the states of the jobs of this epoch.

str. printable overall job state

Gets a printable name for this epoch.

str. printable name for this epoch
class DeepSpeech.Sample(src, res, loss, mean_edit_distance, sample_wer)[source]

Represents one item of a WER report.

src (str): source text res (str): resulting text loss (float): computed loss of this item mean_edit_distance (float): computed mean edit distance of this item
class DeepSpeech.TrainingCoordinator[source]

Central training coordination class. Used for distributing jobs among workers of a cluster. Instantiated on all workers, calls of non-chief workers will transparently HTTP-forwarded to the chief worker instance.

class TrainingCoordinationHandler(request, client_address, server)[source]

Handles HTTP requests from remote workers to the Training Coordinator.

log_message(format, *args)[source]

Overriding base method to suppress web handler messages on stdout.


Retrieves the first job for a worker.

worker (int): index of the worker to get the first job for
WorkerJob. a job of one of the running epochs that will get
associated with the given worker and put into state ‘running’

Retrives a new cluster-unique batch index for a given set-name. Prevents applying one batch multiple times per epoch.

set_name (str): name of the data set - one of ‘train’, ‘dev’, ‘test’
int. new data set index

Sends a finished job back to the coordinator and retrieves in exchange the next one.

job (WorkerJob): job that was finished by a worker and who’s results are to be
digested by the coordinator
WorkerJob. next job of one of the running epochs that will get
associated with the worker from the finished job and put into state ‘running’

Starts Training Coordinator. If chief, it starts a web server for communication with non-chief instances.

start_coordination(model_feeder, step=0)[source]

Starts to coordinate epochs and jobs among workers on base of data-set sizes, the (global) step and FLAGS parameters.

model_feeder (ModelFeeder): data-sets to be used for coordinated training
step (int): global step of a loaded model to determine starting point

Stops Training Coordinator. If chief, it waits for all epochs to be ‘done’ and then shuts down the web server.

class DeepSpeech.WorkerJob(epoch_id, index, set_name, steps, report)[source]

Represents a job that should be executed by a worker.

epoch_id (int): the ID of the ‘parent’ epoch index (int): the epoch index of the ‘parent’ epoch set_name (str): the name of the data-set - one of ‘train’, ‘dev’, ‘test’ steps (int): the number of session.run calls report (bool): if this job should produce a WER report

A routine for computing each variable’s average of the gradients obtained from the GPUs. Note also that this code acts as a synchronization point as it requires all GPUs to be finished with their mini-batch before it can run to completion.

DeepSpeech.calculate_mean_edit_distance_and_loss(model_feeder, tower, dropout, reuse)[source]

This routine beam search decodes a mini-batch and calculates the loss and mean edit distance. Next to total and average loss it returns the mean edit distance, the decoded result and the batch’s original Y.


This routine will calculate a WER report. It’ll compute the mean WER and create Sample objects of the report_count top lowest loss items from the provided WER results tuple (only items with WER!=0 and ordered by their WER).

DeepSpeech.collect_results(results_tuple, returns)[source]

This routine will help collecting partial results for the WER reports. The results_tuple is composed of an array of the original labels, an array of the corresponding decodings, an array of the corrsponding distances and an array of the corresponding losses. returns is built up in a similar way, containing just the unprocessed results of one session.run call (effectively of one batch). Labels and decodings are converted to text before splicing them into their corresponding results_tuple lists. In the case of decodings, for now we just pick the first available path.


Restores the trained variables into a simpler graph that will be exported for serving.


Formats the result of an even stopwatch call as hours:minutes:seconds

DeepSpeech.get_tower_results(model_feeder, optimizer)[source]

With this preliminary step out of the way, we can for each GPU introduce a tower for which’s batch we calculate

  • The CTC decodings decoded,
  • The (total) loss against the outcome (Y) total_loss,
  • The loss averaged over the whole batch avg_loss,
  • The optimization gradient (computed based on the averaged loss),
  • The Levenshtein distances between the decodings and their transcriptions distance,
  • The mean edit distance of the outcome averaged over the whole batch mean_edit_distance

and retain the original labels (Y). decoded, labels, the optimization gradient, distance, mean_edit_distance, total_loss and avg_loss are collected into the corresponding arrays tower_decodings, tower_labels, tower_gradients, tower_distances, tower_mean_edit_distances, tower_total_losses, tower_avg_losses (dimension 0 being the tower). Finally this new method get_tower_results() will return those tower arrays. In case of tower_mean_edit_distances and tower_avg_losses, it will return the averaged values instead of the arrays.


Let’s also introduce a helper function for logging collections of gradient/variable tuples.

DeepSpeech.log_variable(variable, gradient=None)[source]

We introduce a function for logging a tensor variable’s current state. It logs scalar values for the mean, standard deviation, minimum and maximum. Furthermore it logs a histogram of its state and (if given) of an optimization gradient.


Returns a new ID that is unique on process level. Not thread-safe.

int. The new ID

This function will toggle a stopwatch. The first call starts it, second call stops it, third call continues it etc. So if you want to measure the accumulated time spent in a certain area of the code, you can surround that code by stopwatch-calls like this:

fun_time = 0 # initializes a stopwatch
for i in range(10):
  # Starts/continues the stopwatch - fun_time is now a point in time (again)
  fun_time = stopwatch(fun_time)
  # Pauses the stopwatch - fun_time is now a duration
  fun_time = stopwatch(fun_time)
# The following line only makes sense after an even call of :code:`fun_time = stopwatch(fun_time)`.
print 'Time spent in fun():', format_duration(fun_time)

Trains the network on a given server of a cluster. If no server provided, it performs single process training.

DeepSpeech.variable_on_worker_level(name, shape, initializer)[source]

Next we concern ourselves with graph creation. However, before we do so we must introduce a utility function variable_on_worker_level() used to create a variable in CPU memory.

util.audio.audiofile_to_input_vector(audio_filename, numcep, numcontext)[source]

Given a WAV audio file at audio_filename, calculates numcep MFCC features at every 0.01s time step with a window length of 0.025s. Appends numcontext context frames to the left and right of each time step, and returns this data in a numpy array.

util.text.levenshtein(a, b)[source]

Calculates the Levenshtein distance between a and b.

util.text.sparse_tensor_value_to_texts(value, alphabet)[source]

Given a tf.SparseTensor value, return an array of Python strings representing its values.

util.text.sparse_tuple_from(sequences, dtype=<class 'numpy.int32'>)[source]

Creates a sparse representention of sequences. Args:

  • sequences: a list of lists of type dtype where each element is a sequence

Returns a tuple with (indices, values, shape)

util.text.text_to_char_array(original, alphabet)[source]

Given a Python string original, remove unsupported characters, map characters to integers and return a numpy array representing the processed string.

util.text.wer(original, result)[source]

The WER is defined as the editing/Levenshtein distance on word level divided by the amount of words in the original text. In case of the original having more words (N) than the result and both being totally different (all N words resulting in 1 edit operation each), the WER will always be 1 (N / N = 1).


Returns the number of GPUs available on this system.

class util.stm.STMSegment(stm_line)[source]

Representation of an individual segment in an STM file.


Parses an STM file at stm_file into a list of STMSegment.

Indices and tables