OpenAI Introduces Sparse Transformers for Deep Learning of Longer Sequences

OpenAI has developed the Sparse Transformer, a deep neural-network architecture for learning sequences of data, including text, sound, and images. The networks can achieve state-of-the-art performance on several deep-learning tasks with faster training times.

Several common AI applications, such as image captioning or language translation, can be modeled as sequence learning; that is, predicting the next item in a sequence of data. Sequence-learning networks typically consist of two sub-networks: an encoder and a decoder. Once the two are trained, often the decoder can be used by itself to generate completely new outputs; for example, artificial human speech or fake Shakespeare.

Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, have been particularly effective in solving these problems. In recent years, however, a simpler architecture called the Transformer has gained popularity, since the Transformer reduced training costs by an order of magnitude or more compared to other architectures.

Instead of processing each element of the input sequence in order, as an RNN does, the Transformer processes the full sequence in parallel. The key idea is the use of attention. Briefly, attention is a matrix of weights that encode a contribution of each input element to each output element. The number of attention weights, therefore, grows as the square of the length of the input sequences; further, there is a separate attention matrix for each layer of the network. Since the number of total weights in the network is limited, this imposes a trade-off between the network's depth and the maximum sequence length it can handle.

OpenAI's innovation is a sparse factorization of the attention matrices that reduces the network complexity from \(O(N^2)\) to \(O(N\sqrt{N})\). This allows OpenAI to "model sequences with tens of thousands of elements using hundreds of layers," compared to other networks that can only handle sequences of "a few thousand elements."

One example of a large-scale Transformer-based model from OpenAI is MuseNet, a system that "can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles." Much better known is GPT-2, which famously generated a news article about unicorns in the Andes mountains. OpenAI has not released the full GPT-2 model due to "concerns about malicious applications of the technology." However, smaller versions of the model are available, and are powering sites such as Talk to Transformer, where users can type in a custom prompt which the model uses to generate new stories.

Reaction from the community has been mixed. On Hacker News, one commenter stated: "That's really impressive! However, I'm a bit disappointed with the code release. I was expecting the full source code and setup." On Twitter, Etherium developer Iuri Matias asked, "Why were only small code snippets released and not the full code and trained models? Will this be the norm from now on?"

OpenAI's paper, Generating Long Sequences with Sparse Transformers, is available on arXiv.org. The sparse attention code is available on GitHub

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter