.Net Framework¶

DeepSpeech Interface¶

interface IDeepSpeech¶

Client interface of the Mozilla’s deepspeech implementation.

Subclassed by DeepSpeechClient.DeepSpeech

Public Functions

void DeepSpeechClient.Interfaces.IDeepSpeech.PrintVersions(): Prints the versions of Tensorflow and DeepSpeech.

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.CreateModel(string aModelPath, uint aBeamWidth)

Create an object providing an interface to a trained DeepSpeech model.

Parameters

aModelPath: The path to the frozen model graph.
aBeamWidth: The beam width used by the decoder. A larger beam width generates better results at the cost of decoding time.

Exceptions

ArgumentException: Thrown when the native binary failed to create the model.

unsafe int DeepSpeechClient.Interfaces.IDeepSpeech.GetModelSampleRate()

Return the sample rate expected by the model.

Return: Sample rate.

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.EnableDecoderWithLM(string aLMPath, string aTriePath, float aLMAlpha, float aLMBeta)

Enable decoding using beam scoring with a KenLM language model.

Parameters

aLMPath: The path to the language model binary file.
aTriePath: The path to the trie file build from the same vocabulary as the language model binary.
aLMAlpha: The alpha hyperparameter of the CTC decoder. Language Model weight.
aLMBeta: The beta hyperparameter of the CTC decoder. Word insertion weight.

Exceptions

ArgumentException: Thrown when the native binary failed to enable decoding with a language model.

unsafe string DeepSpeechClient.Interfaces.IDeepSpeech.SpeechToText(short [] aBuffer, uint aBufferSize)

Use the DeepSpeech model to perform Speech-To-Text.

Return

The STT result. The user is responsible for freeing the string. Returns NULL on error.

Parameters

aBuffer: A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
aBufferSize: The number of samples in the audio signal.

unsafe Metadata DeepSpeechClient.Interfaces.IDeepSpeech.SpeechToTextWithMetadata(short [] aBuffer, uint aBufferSize)

Use the DeepSpeech model to perform Speech-To-Text.

Return

The extended metadata result. The user is responsible for freeing the struct. Returns NULL on error.

Parameters

aBuffer: A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
aBufferSize: The number of samples in the audio signal.

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.FreeStream(): Destroy a streaming state without decoding the computed logits. This can be used if you no longer need the result of an ongoing streaming inference and don’t want to perform a costly decode operation.

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.FreeString(IntPtr intPtr): Free a DeepSpeech allocated string

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.FreeMetadata(IntPtr intPtr): Free a DeepSpeech allocated Metadata struct

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.CreateStream()

Creates a new streaming inference state.

Exceptions

ArgumentException: Thrown when the native binary failed to initialize the streaming mode.

unsafe void DeepSpeechClient.Interfaces.IDeepSpeech.FeedAudioContent(short [] aBuffer, uint aBufferSize)

Feeds audio samples to an ongoing streaming inference.

Parameters

aBuffer: An array of 16-bit, mono raw audio samples at the appropriate sample rate (matching what the model was trained on).

unsafe string DeepSpeechClient.Interfaces.IDeepSpeech.IntermediateDecode()

Computes the intermediate decoding of an ongoing streaming inference.

Return: The STT intermediate result. The user is responsible for freeing the string.

unsafe string DeepSpeechClient.Interfaces.IDeepSpeech.FinishStream()

Closes the ongoing streaming inference, returns the STT result over the whole audio signal.

Return: The STT result. The user is responsible for freeing the string.

unsafe Metadata DeepSpeechClient.Interfaces.IDeepSpeech.FinishStreamWithMetadata()

Closes the ongoing streaming inference, returns the STT result over the whole audio signal.

Return: The extended metadata result. The user is responsible for freeing the struct.

DeepSpeech Class¶

class

Client of the Mozilla’s deepspeech implementation.

Public Functions

unsafe void DeepSpeechClient.DeepSpeech.CreateModel(string aModelPath, uint aBeamWidth)

Create an object providing an interface to a trained DeepSpeech model.

Parameters

aModelPath: The path to the frozen model graph.
aBeamWidth: The beam width used by the decoder. A larger beam width generates better results at the cost of decoding time.

Exceptions

ArgumentException: Thrown when the native binary failed to create the model.

unsafe int DeepSpeechClient.DeepSpeech.GetModelSampleRate()

Return the sample rate expected by the model.

Return: Sample rate.

unsafe void DeepSpeechClient.DeepSpeech.Dispose(): Frees associated resources and destroys models objects.

unsafe void DeepSpeechClient.DeepSpeech.EnableDecoderWithLM(string aLMPath, string aTriePath, float aLMAlpha, float aLMBeta)

Enable decoding using beam scoring with a KenLM language model.

Parameters

aLMPath: The path to the language model binary file.
aTriePath: The path to the trie file build from the same vocabulary as the language model binary.
aLMAlpha: The alpha hyperparameter of the CTC decoder. Language Model weight.
aLMBeta: The beta hyperparameter of the CTC decoder. Word insertion weight.

Exceptions

ArgumentException: Thrown when the native binary failed to enable decoding with a language model.

unsafe void DeepSpeechClient.DeepSpeech.FeedAudioContent(short [] aBuffer, uint aBufferSize)

Feeds audio samples to an ongoing streaming inference.

Parameters

aBuffer: An array of 16-bit, mono raw audio samples at the appropriate sample rate (matching what the model was trained on).

unsafe string DeepSpeechClient.DeepSpeech.FinishStream()

Closes the ongoing streaming inference, returns the STT result over the whole audio signal.

Return: The STT result. The user is responsible for freeing the string.

unsafe Models.Metadata DeepSpeechClient.DeepSpeech.FinishStreamWithMetadata()

Closes the ongoing streaming inference, returns the STT result over the whole audio signal.

Return: The extended metadata. The user is responsible for freeing the struct.

unsafe string DeepSpeechClient.DeepSpeech.IntermediateDecode()

Computes the intermediate decoding of an ongoing streaming inference.

Return: The STT intermediate result. The user is responsible for freeing the string.

unsafe void DeepSpeechClient.DeepSpeech.PrintVersions(): Prints the versions of Tensorflow and DeepSpeech.

unsafe void DeepSpeechClient.DeepSpeech.CreateStream()

Creates a new streaming inference state.

Exceptions

ArgumentException: Thrown when the native binary failed to initialize the streaming mode.

unsafe void DeepSpeechClient.DeepSpeech.FreeStream(): Destroy a streaming state without decoding the computed logits. This can be used if you no longer need the result of an ongoing streaming inference and don’t want to perform a costly decode operation.

unsafe void DeepSpeechClient.DeepSpeech.FreeString(IntPtr intPtr): Free a DeepSpeech allocated string

unsafe void DeepSpeechClient.DeepSpeech.FreeMetadata(IntPtr intPtr): Free a DeepSpeech allocated Metadata struct

unsafe string DeepSpeechClient.DeepSpeech.SpeechToText(short [] aBuffer, uint aBufferSize)

Use the DeepSpeech model to perform Speech-To-Text.

Return

The STT result. The user is responsible for freeing the string. Returns NULL on error.

Parameters

aBuffer: A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
aBufferSize: The number of samples in the audio signal.

unsafe Models.Metadata DeepSpeechClient.DeepSpeech.SpeechToTextWithMetadata(short [] aBuffer, uint aBufferSize)

Use the DeepSpeech model to perform Speech-To-Text.

Return

The extended metadata. The user is responsible for freeing the struct. Returns NULL on error.

Parameters

aBuffer: A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
aBufferSize: The number of samples in the audio signal.

ErrorCodes¶

enum DeepSpeechClient::Enums::ErrorCodes¶

Error codes from the native DeepSpeech binary.

Values:

DS_ERR_OK = 0x0000¶

DS_ERR_NO_MODEL = 0x1000¶

DS_ERR_INVALID_ALPHABET = 0x2000¶

DS_ERR_INVALID_SHAPE = 0x2001¶

DS_ERR_INVALID_LM = 0x2002¶

DS_ERR_MODEL_INCOMPATIBLE = 0x2003¶

DS_ERR_FAIL_INIT_MMAP = 0x3000¶

DS_ERR_FAIL_INIT_SESS = 0x3001¶

DS_ERR_FAIL_INTERPRETER = 0x3002¶

DS_ERR_FAIL_RUN_SESS = 0x3003¶

DS_ERR_FAIL_CREATE_STREAM = 0x3004¶

DS_ERR_FAIL_READ_PROTOBUF = 0x3005¶

DS_ERR_FAIL_CREATE_SESS = 0x3006¶

Metadata¶

struct Metadata¶

Package Attributes

unsafe IntPtr DeepSpeechClient.Structs.Metadata.items: Native list of items.

unsafe int DeepSpeechClient.Structs.Metadata.num_items: Count of items from the native side.

unsafe double DeepSpeechClient.Structs.Metadata.confidence: Approximated confidence value for this transcription.

MetadataItem¶

struct MetadataItem¶

Package Attributes

unsafe IntPtr DeepSpeechClient.Structs.MetadataItem.character: Native character.

unsafe int DeepSpeechClient.Structs.MetadataItem.timestep: Position of the character in units of 20ms.

unsafe float DeepSpeechClient.Structs.MetadataItem.start_time: Position of the character in seconds.