Python

Model

class Model(*args, **kwargs)[source]

Class holding a DeepSpeech model

Parameters
  • aModelPath (str) – Path to model file to load

  • aBeamWidth (int) – Decoder beam width

createStream()[source]

Create a new streaming inference state. The streaming state returned by this function can then be passed to feedAudioContent() and finishStream().

Returns

Object holding the stream

Throws

RuntimeError on error

enableDecoderWithLM(*args, **kwargs)[source]

Enable decoding using beam scoring with a KenLM language model.

Parameters
  • aLMPath (str) – The path to the language model binary file.

  • aTriePath (str) – The path to the trie file build from the same vocabulary as the language model binary.

  • aLMAlpha (float) – The alpha hyperparameter of the CTC decoder. Language Model weight.

  • aLMBeta (float) – The beta hyperparameter of the CTC decoder. Word insertion weight.

Returns

Zero on success, non-zero on failure (invalid arguments).

Type

int

feedAudioContent(*args, **kwargs)[source]

Feed audio samples to an ongoing streaming inference.

Parameters
  • aSctx (object) – A streaming state pointer returned by createStream().

  • aBuffer (int array) – An array of 16-bit, mono raw audio samples at the appropriate sample rate (matching what the model was trained on).

  • aBufferSize (int) – The number of samples in @p aBuffer.

finishStream(*args, **kwargs)[source]

Signal the end of an audio signal to an ongoing streaming inference, returns the STT result over the whole audio signal.

Parameters

aSctx (object) – A streaming state pointer returned by createStream().

Returns

The STT result.

Type

str

finishStreamWithMetadata(*args, **kwargs)[source]

Signal the end of an audio signal to an ongoing streaming inference, returns per-letter metadata.

Parameters

aSctx (object) – A streaming state pointer returned by createStream().

Returns

Outputs a struct of individual letters along with their timing information.

Type

Metadata()

intermediateDecode(*args, **kwargs)[source]

Compute the intermediate decoding of an ongoing streaming inference. This is an expensive process as the decoder implementation isn’t currently capable of streaming, so it always starts from the beginning of the audio.

Parameters

aSctx (object) – A streaming state pointer returned by createStream().

Returns

The STT intermediate result.

Type

str

sampleRate()[source]

Return the sample rate expected by the model.

Returns

Sample rate.

Type

int

stt(*args, **kwargs)[source]

Use the DeepSpeech model to perform Speech-To-Text.

Parameters
  • aBuffer (int array) – A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).

  • aBufferSize (int) – The number of samples in the audio signal.

Returns

The STT result.

Type

str

sttWithMetadata(*args, **kwargs)[source]

Use the DeepSpeech model to perform Speech-To-Text and output metadata about the results.

Parameters
  • aBuffer (int array) – A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).

  • aBufferSize (int) – The number of samples in the audio signal.

Returns

Outputs a struct of individual letters along with their timing information.

Type

Metadata()

Metadata

class Metadata[source]

Stores the entire CTC output as an array of character metadata objects

confidence()[source]

Approximated confidence value for this transcription. This is roughly the sum of the acoustic model logit values for each timestep/character that contributed to the creation of this transcription.

items()[source]

List of items

Returns

A list of MetadataItem() elements

Type

list

num_items()[source]

Size of the list of items

Returns

Size of the list of items

Type

int

MetadataItem

class MetadataItem[source]

Stores each individual character, along with its timing information

character()[source]

The character generated for transcription

start_time()[source]

Position of the character in seconds

timestep()[source]

Position of the character in units of 20ms