External scorer scripts¶
DeepSpeech pre-trained models include an external scorer. This document explains how to reproduce our external scorer, as well as adapt the scripts to create your own.
The scorer is composed of two sub-components, a KenLM language model and a trie data structure containing all words in the vocabulary. In order to create the scorer package, first we must create a KenLM language model (using
data/lm/generate_lm.py, and then use
data/lm/generate_package.py to create the final package file including the trie data structure.
Reproducing our external scorer¶
You can download the LibriSpeech corpus with the following command:
cd data/lm wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
Then use the
generate_lm.py script to generate
As input you can use a plain text (e.g.
file.txt) or gzipped (e.g.
file.txt.gz) text file with one sentence in each line.
If you are using a container created from
Dockerfile.build, you can use
Else you have to build KenLM first and then pass the build directory to the script.
cd data/lm python3 generate_lm.py --input_txt librispeech-lm-norm.txt.gz --output_dir . \ --top_k 500000 --kenlm_bins path/to/kenlm/build/bin/ \ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \ --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
Afterwards you can use
generate_package.py to generate the scorer package using the
cd data/lm python3 generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-500000.txt \ --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284
Building your own scorer¶
Building your own scorer can be useful if you’re using models in a narrow usage context, with a more limited vocabulary, for example. Building a scorer requires text data matching your intended use case, which must be formatted in a text file with one sentence per line.
The LibriSpeech LM training text used by our scorer is around 4GB uncompressed, which should give an idea of the size of a corpus needed for a reasonable language model for general speech recognition. For more constrained use cases with smaller vocabularies, you don’t need as much data, but you should still try to gather as much as you can.
With a text corpus in hand, you can then re-use the
generate_package.py scripts to create your own scorer that is compatible with DeepSpeech clients and language bindings. Before building the language model, you must first familiarize yourself with the KenLM toolkit. Most of the options exposed by the
generate_lm.py script are simply forwarded to KenLM options of the same name, so you must read the KenLM documentation in order to fully understand their behavior.
generate_lm.py to create a KenLM language model binary file, you can use
generate_package.py to create a scorer package as described in the previous section. Note that we have a lm_optimizer.py script which can be used to find good default values for alpha and beta. To use it, you must first
generate a package with any value set for default alpha and beta flags. For this step, it doesn’t matter what values you use, as they’ll be overridden by
lm_optimizer.py. Then, use
lm_optimizer.py with this scorer file to find good alpha and beta values. Finally, use
generate_package.py again, this time with the new values.