Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations wav2letter++: Facebook's Fast Open-Source Speech Recognition System

wav2letter++: Facebook's Fast Open-Source Speech Recognition System



Vitaliy Liptchinsky introduces wav2letter++, an open-source deep learning speech recognition framework, explaining its architecture and design, and comparing it to other speech recognition systems.


Vitaliy Liptchinsky earned his PhD at Vienna Technical University (TU Wien), Distributed Systems Group. In his professional career, Vitaliy worked on solving a vast variety of engineering problems, ranging from prehistoric mobile applications and enterprise systems to highly optimized storage engines and large-scale deep learning systems.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Liptchinsky: My name is Vitaliy Liptchinsky, and I'm research engineering manager at Facebook AI Research. Today we are going to talk about automatic speech recognition with wav2letter++, a new speech toolkit from Facebook AI Research. Some of the slides in this presentation were borrowed from Ronan Collobert, who was not hurt in the process, I don't have evidence, but you have to trust me on this one.

First we're going to talk about automatic speech recognition and how it works in general. We will discuss some of the neural network architecture ideas using wav2letter, along with sequential losses, or how we call them, criteria. Afterwards we will discuss language models and decoding and the second part would be a discussion about toolkit with the overview of the designer kit and key components, key libraries you rely on, as well as Flashlight, which is the neural network library that was developed as part of wav2letter.

In the end we'll talk about benchmarks. The main standard benchmark for speech recognition is word error rate, and also we will talk about training and decoding speed, which is often important for real-time speech recognition. First we will focus on research agenda for wav2letter and what kind of research directions we pursue at Facebook AI research.

Automatic Speech Recognition

Automatic speech recognition typically implies the following components. As an input you have a real audio waveform that some handcrafted features are applied to it, the features are fed and input into acoustic model that produces phonemes. Phonemes are speech units basically produced by humans, and the phonemes are converted then using phonetic dictionary to actual spellings.

End-to-end Speech Recognition

End-to-end speech recognition is comprised of the following components. Acoustic model as an input takes the input features and produces characters, or separate units. Typically because there is not enough training data for acoustic models, the language models are trained on large text corpora, and the decoder combines input from acoustic models and language models to produce the resulting transcriptions.

The overall mission for the speech recognition for FAIR is to understand and advance end-to-end training for all those components. Starting with acoustic model, can we make acoustic model simple and efficient? Can we improve language models and make them scalable? Can we train feature transformations, I will go a bit more in details later, and can we make the decoder differentiable?

Features-Train Them

We will start with the features. Here basically you see two figures, on the left you see the conventional input feature transformations employed in speech recognition. Unlike computer vision models that have benefit from learning from their own pixels, speech recognition still relies on fixed handcrafted features, which include windowing of speech frames, fast-Fourier transformations, and applying triangular filters, and so on. I don't want to go into details on the speech transformations, they are too domain specific and not interesting for other domains.

The input feature transformation is conventional, that they're using speech recognition can be approximated with convolutional layers that can be trained together with the acoustic model and actually produce better accuracy.

Acoustic – How It Works

Let's talk about the acoustic model, how it works in general. As the input we have some audio wave, and neural network is applied essentially to segments. This is a convolutional neural network, it's applied to a segment of the input audio wave and it outputs scores for characters or graphings, as we call them. In the simplistic case, essentially when the acoustic model is perfect we can just consider the maximum scores the neural network outputs for every speech window it is applied to.

In this case, let's say the top symbol, it denotes silence, let's say that the maximum score is predicted as a silence. Then we slide the network essentially by its stripes, so if the network is comprised of convolutional layers that have no stripe, then it's basically we slide by one frame. Let's assume the output is a silence again, we slide it farther, it outputs the highest score for the letter "t," and then again "t," "h," "e," and then silence. "c," "a," "a," again, "t," and then again silence.

The model follows a duration model, so the letter repetitions are mapped to the same character. "The cat sat," in this example, would be mapped to, "The cat sat," without repetitions. Let's remember that the vertical line stands for silence, because we will keep referring to it in later slides.

Acoustic Model – Make It Simple

Let's make the neural network as simple as possible. We have this input audio waveform, we apply input feature transformations which were handcrafted by scientists to represent the human vocal cords. As an input to the first layer of the acoustic model is one-dimensional convolutional network. Why one-dimensional convolution network? Convolutions are highly parallelizable and they typically have very efficient limitations on the modern hardware. We hope that in the future they will be even more advancements in the hardware design for [inaudible 00:08:46] one-dimensional convolutions.

Another building block of acoustic models we employ is gated linear units. So what do gated linear units do? It's a non-linearity function, it takes the output of convolution, splits in two tensors of equal size, and uses one tensor as an output gate. If someone is familiar with how LSTM works, so this is exactly how output gate works. For output gates the sigma function is applied, so it returns the values in a range from 0 to 1. Then those are applied to the other part as gates by applying an element-wise product between matrices for tensors.

Gated linear units were successfully applied for language models, and they address the vanishing gradient problem because they enable the input to flow in case all the gates are 1.0 to flow right to the output of the network. After every gated linear unit we apply dropout, the three layers, the one-dimensional convolution, the gated linear unit, and dropout, I call them as the gated convnet block.

Acoustic Model – Architecture and Few Tricks

How does the overall architecture look like? We have input audio wave, we apply input feature transformations, and then the gated convnet block which consist of the convolution, the gated linear unit, and the dropout. The first lock for our state-of-the-art architecture has the kernel width for the convolution is 13. The channels mapped increased from 40 to 200, and dropout is 0.2. The next one, the kernel width is 14, so we are increasing kernel width, and we are also increasing the output channels, and we're also increasing dropout.

Here are 13 dots, which correspond to exactly 13 layers that follows the same model of increasing kernel widths, the output channels, and the dropout. The final convolution layer has kernel width 29, it has 900 output channels, and the dropout is very high, almost 0.6. The convolution layers are followed by linear layers with the high dropout, and the final linear layer essentially is a classifier for the characters. 26 English characters 4 special symbols, one is for silence, the other one is apostrophe, and two others.

For each consecutive convolution layer we increase kernel widths, increase channels, and increase dropout. The overall network receptive field, essentially how much of the audio wave the network sees is around 2 seconds. In other words, 2 seconds of audio correspond to one character. The motivation is that you put more modeling capacity and more regularization towards the output layers and why I will cover in the next slides.

Language Model – How It Works

Let's talk about the language model. Because acoustic models are not perfect, typically they have a limited amount of training data for acoustic model, and training data is the mapping between audio wave and the output characters or the text. We need language models that are trained on the large corpora of text. I will classify them into two categories, the first one is the statistical n-gram language models, and they generate probabilities for a sequence of words. N-gram language models, they are widespread and used in many different domains. Another is the feed-forward neural network models that classify output as the conditional probability of the next word based on the context of the words preceding.

It's also possible to train not only word language models but character language models. In this example basically the character language models would output the probability of "s" given the input characters "the cats," "the cat." Do acoustic models learn language modeling? The short answer is yes. Anecdotally, what we have observed is that for noisy audio segments, acoustic model can output a sequence of characters that would correspond to an article [inaudible 00:14:32] when there's actually in the target, you have a sequence of eight characters.

The idea for putting more weight, more capacity, and more regularization towards output layers of acoustic model is giving more modeling capacity for language modeling part of the acoustic model, and more variety.

Language Model – Architecture

Language models work with word embeddings, it's a standard practice today, so I won't go into details, but essentially every word corresponds to a vector of flows representing its embedding. We apply the same gated linear units for language models as well, and then at the end the language model classifies the probabilities for words for the entire dictionary of words. If the dictionary is the model softmax it's ok to use, but for large dictionary typically hierarchical softmax is applied.

Acoustic Model – Training

Let's go back to the acoustic models and discussed how they are trained. The loss that we use is called ASG loss, it stands for auto segmentation, so basically it has to solve two problems. One is a classification problem, again, output 26 English characters plus special characters, one stands for silence, and another one is the segmentation of when essentially to output which character. Segmentation problem, let's say we have the word "cab," and the letter dictionary is only three labels: a, b, and c. Over four frames the word "cab" can be mapped for essentially three or more different ways to spell it. Since we employ durational model remember that repeating characters are mapped to one character. What ASG loss additionally employs is the transition scores, it has a matrix of transition scores between every character and every character, and this matrix is trained jointly with the acoustic model.

The score for c-a-a-b is one of the possible ways of transcribing the word "cab," is the acoustic score for "c," plus the score for transition from "c" to "a," plus the score for "a," plus the score from transition from "a" to "a," and so on, and acoustic score for "b." Since the neural network acoustic model output scores for all characters, we can build a graph for all possible transcriptions.

The first part ASG loss is unconstrained graph and on the left, and the right part, in the second part, is the constrained graph. Constrained graph models all correct transcriptions for the target sequence. This is a duration model, so from "c" the model can go to another "c," or to "a," from "a" to "b," or another "a." Tthe score for unconstrained graph will always be higher than the score for the constraint graph. Unconstrained path will have the higher score, or equal for the constrained path. In the best case, the difference between the two graphs would be 0, and this is basically the ASG loss, the difference between constrained paths and constrained path.

Another popular loss in speech recognition is CTC, but it's not only in speech recognition, it also is employed in optical character recognition, for example. Instead of a durational model it has a blank label, which handles character repetitions and handles garbage frames. Here is the difference between CTC, connectionist temporal classification, and ASG, the two graphs. ASG is more simple, we can actually unfold those graphs for CTC, so from a character one can go to a garbage frame or to another character. For ASG, from a character one can go only to another character.

In practice when comparing the two criterias we observe significant delay between CTC and ASG. CTC was around 500 milliseconds delay compared to ASG. In this picture you see that CTC typically produces spikes, so a number of garbage labels and then a character label, whilst ASG is basically employing, the same durational model, so you would output at the beginning, "MMMM," and then "IAI", so it models the duration of the character.

Decoder – How It Works

We've covered so far acoustic models, language models, and decoder puts it all together. We have acoustic model, we have lexicon, which we have not covered so far, but lexicon is simply a trio for words. Lexicon also contains the scores for unigrams, and the language model. What does decoder do? It simply contains the bean search, constrained to fixed beam size. It bookkeeps all the positions in the trio in the language model state and for each step, essentially for each previous hypothesis in the beam, it tries to add a new hypothesis that's also built with the lexicon, so it tries to produce valid words, not just any words. If a word is emitted, it adds words from language model. Then for equal lexicon and language model states it tries to merge hypothesis.

Another thing is it's possible to make the decoder differentiable, so fully trainable, but I won't go into details in this presentation, and I encourage you to read the paper in the book. Let's consider the example for how the decoder works, let's say we have the hypothesis, "The cat, sat." The cat is a language model state. It has already produced two words, "The," and "cat." and it has a prefix "sat," which is a position in the trio. From this state, following the path that's in the trio, we can either try to append letter "t," or letter "a," or letter "d." If we append the letter "t" we will create a new hypothesis, and the score for it would be increased by the acoustic score of "t" that's generated at the current acoustic window, plus a transition from "a" to "t" transition score, and score from the lexicon, which again, is the score of unigrams.

For "a," we would generate a hypothesis with the score that would be increased by the acoustic score for "a" and transition from "a" to "a." We have no score for lexicon because it essentially stay as the same node in the lexicon. When transitioning to letter "d," the score will be incremented by acoustic score for "d," that is essentially the score of the [inaudible 00:24:16] transition from "a" to "d," and the lexicon for unigram score for word "sat."

Let's continue with the next example. Let's say in the beam we have a hypothesis, "The cat sat." The cat are already words that were produced, and sat is a prefix in the trio, so what we can do in this case is we can consider appending a letter "t," which again, doesn't change the position in the lexicon, or we can append a silence, so the vertical line is a silence at the bottom. In this case, we add acoustic score for silence, the score of transitioning from "t" to silence, and the score of language model for "the cat sat," which is a [inaudible 00:25:10] type sequence.

wav2letter++ Design

wav2letter is built entirely in C++, it's just as fast, and it's fast, and it has type safety and static typing, so essentially which helps when scaling the code, scaling in terms of amount of code and it's also fast. The one key library that wav2letter employs is ArrayFire. It's an open-source Tensor library, it features just-in-time compilation that I will go into more details later. It supports multiple backends, so CUDA, CPU, and OpenCL.

Our neural network library which we call Flashlight basically is built on top of ArrayFire, and Flashlight, it has Autograd, different neural network modules, including convolutions, recurrent neural network modules, serialization, training, and so on. In addition, we use NCCL and MPI, which are respectively GPU and CPU communication libraries, we use CuDNN and NNPACK as accelerator packages. The criteria or process essentially that we discussed were CTC and ASG, they basically can have a specific implementation in CuDNN, for example, or a more generic implementations in ArrayFire. We also support CTC, which I have not covered in this presentation.

As executables, we have the following, the train, test, and decode, the train is for training, testing is for running the acoustic model and outputting the scores, and decode is actually using decoder along with a language model to produce the best possible transcriptions. We support currently two data sets, "The Wall Street Journal," and LibriSpeech, and we have scripts for each of those. All of this is what actually constitutes wav2letter++.

PReLU Implementation

Here are examples and reasons why we chose ArrayFire. The first one is PReLU implementation, PReLU is parametric ReLU, the formula is at the bottom. It's pretty simple, if x is negative then you scale it, if x is positive you do not touch it. This is one of the possible implementations we have for Keras API, let's say first we need to evaluate the positive actors, and then we need to edit the negatives. We need to store the pos variable in memory along with all the [inaudible 00:28:56].

Contrarily, where they require in JIT, the mask that we apply to compute positives, so all the x that are greater or equal to 0 doesn't have to be evaluated in place and stored in-memory. Just-in-time completion avoids intermediate copies, and it works seamlessly on both CPU and GPU.

gfor and batchFunc

Another interesting functionality that requires support is gfor, which is simple parallel loop, and batchFunc, which batches the inputs. In this example coefs is the static constant array, and we have best input, so we can apply it to a number of whatever our batch size of inputs and multiply it. It's very easy to express it with ArrayFire, so this is an example. batchFunc you just provide the multiply operator and then it will be applied to every input, it figures out dimension by itself.

Flashlight – Neural Network

Flashlight that we designed on top of ArrayFire. It was designed with the best essence from Torch. It's entirely written in C++, and on the right I actually have a sample code for [inaudible 00:30:49]. Because ArrayFire supports just-in-time compilation it also supports just-in-time compilation, and it supports both CPU and GPU backends. It's open source so you can download it online.

Word Error Rate (WER)

There are benchmarks that we use to evaluate our work, the first one is the word error rate. How word error rate is computed, it's pretty standard benchmark in speech recognition. Word error rate is the Levenstein distance between transcription produced by ASR system and the reference at the word level. What it means is that word error rate is all deletions and insertion errors divided by total number of words in the reference. Here are some examples.

Let's say the reference is "the cat sat on the mat," and the hypothesis is "the cat sat mat." Here we are missing two words, so two deletions. The same reference, the hypothesis is "the bat sat on the mat." so the cat was substituted for the bat, and we have outputted an additional word "at," that's counted as one insertion.

In terms of word error rate, at the time of the publishing of the paper we were state-of-the-art, but it changes all the time. In all the domains there are newer and newer results and lower and lower word error rates. These are a couple of months old, but nevertheless, state-of-the-art everywhere. There's two data sets, "The Wall Street Journal" and LibriSpeech, LibriSpeech is a data set of audiobooks.

We also compare the performance characteristics and compare to other speech recognition toolkits available in open source. Kaldi has been available for quite a long time. It's mostly written in C++ but the rest of us are in Bash. ESPNet is fairly recent framework from academia, mostly written in Python, and backend is PyTorch. OpenSeq2Seq is the framework from Nvidia, and it's built in TensorFlow and wav2letter essentially is entirely in C++ and built on top of ArrayFire.

Benchmark: Training Epoch Time

In our setting we used our in-house cluster, which every machine features 8 GPU nodes, Tesla V100s, and there's 100Gbps InfiniBand between nodes. We relate it to CTC training because it's available in all frameworks except with Kaldi, for which we use LF MMI. We compare two main configurations, one is smaller network with only 30 million parameters, and another is larger network with 100 million parameters. Smaller network features two convolutions and five recurrent layers, so bidirectional LSTMs, and the larger one is comprised of 19 convolutions.

Even for smaller networks we have seen around 15% faster, wav2letter being 15% faster than next-best system. For larger networks OpenSeq2Seq was mixed precision, so flow of 16 training for 2, 4 GPU nodes outperforms wav2letter. Mixed precision is essentially something that we can consider adding in the future. Note that the charts are in log scale.

For the training epoch time, essentially here is a breakdown of what happens. We show that it's on "Wall Street Journal" that wav2letter is faster.

Benchmark: Decoding

For the decoding speed, so this is essentially if you remember the decoder is a component that takes, as you input both output of acoustic model and language model. For the decoding we compare the decoding speed and wav2letter has much higher throughput. The time per sample and sample is out of sequence, and the memory, both faster and uses less memory.

As a side note, ESPNet does not support n-gram LMs. N-gram LMs are really fast, which basically if they would support the number would be lower on time per sample in milliseconds.

Questions and Answers

Participant 1: I've got a question regarding your approach of just restricting your results to lexicographical possibilities. Have you seen your model having a strong prior on the first letters and words, and just filing the gaps itself? So if you would just start with two or three letters that would fill in the gaps, even if the person stops talking?

Liptchinsky: They have transition scores which help fill in the gaps, but in practice we've seen that the model is quite resilient in no speech or noisy speech. It hesitates to produce anything if there is no speech, or if there's any noise going on in the audio.

Participant 1: What I mean is can it complete words which a person didn't complete in speech? Because it's still trying to fill in the words for a lexicon.

Liptchinsky: It's anecdotally. There was no scientific study performed to study this well, but in practice, with the examples where it was replacing "a" article with "the," it can basically feature some part of language modeling and it can output characters where it doesn't hear clearly or there's no output.

Participant 2: I have two questions. The first one, what was the major motivation for this work, it was speed, optimization, or performance? The second one is that, are you guys using this in production at Facebook?

Liptchinsky: For the second question is yes. For the first question, what was the motivation? The first one is simplicity, we need a simple toolkit for research. The standard toolkit right now is Kaldi, which is not that simple.

Participant 3: Adding to that, what's the use case in Facebook where speech recognition is used?

Liptchinsky: For this talk I'm focusing on the public papers that we have published. I'm not sure I can talk about internal use case.

Participant 4: What was the reasoning to go with another library apart from speed? Because if you would want to go for speed and just have a toolkit, Facebook is heavily involved in developing PyTorch. What was the reasoning to go for an entirely different toolkit and not use PyTorch, and then export the models using [inaudible 00:39:48] to super-fast C++?

Liptchinsky: The first reason is at the time when we designed and developed this library there was no C++ interface for PyTorch available. Generally, at the time the landscape of C++ machine learning looked bad. That's the first one, and the second one, we really wanted to explore alternative and see what are the performance benefits. We measured end-to-end versus TensorFlow and PyTorch-based toolkits, that's here, and that's an interesting study. You can assume that most of the time is spent for computing convolutions, by matrix multiplication, but we wanted to study, what is the overhead of scripting languages, nevertheless, what is extra and this is the result that we got.


See more presentations with transcripts


Recorded at:

Jun 12, 2019