Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Facebook Releases AI Model for Protein Sequence Processing

Facebook Releases AI Model for Protein Sequence Processing

This item in japanese

A team of scientists at Facebook AI Research have released a deep-learning model for processing protein data from DNA sequences. The model contains approximately 700M parameters, was trained on 250 million protein sequences, and learned representations of biological properties that can be used to improve current state-of-the-art in several genomics prediction tasks.

The team described the model and several experiments in a paper published on bioArxiv. Using techniques similar to those used in natural language processing (NLP), the researchers trained a Transformer deep-learning model using unsupervised learning on sequences of amino acids which represent the genetic encoding of proteins. The Transformer learned a representation, or embedding, of the sequences that the researchers showed encodes several properties of the proteins, such as 3-d structure and evolutionary relationships. The team also showed that the embedding, when used as an input feature, can improve performance of other machine-learning tasks on genetic sequence data, such as predicting the evolutionary fitness of genetic mutations.

Deep-learning models for NLP typically use an embedding---a transformation of high-dimensional vectors into a lower-dimensional space---as the first layer in their networks. These embeddings often encode relationships about the original data in interesting ways; for example, in Google's famous word2vec embedding, performing vector arithmetic in the embedding space can produce results such as "Paris - France + Poland = Warsaw."

To learn embeddings for protein sequences, the team built a Transformer neural-network with 669.2M parameters, based on the BERT model used for NLP, and used self-supervised learning to train the model on 250 million sequences from the Uniparc database. The training data consists of sequences of amino acids; similar to masked language modeling in NLP training, each input sequence was "corrupted" by replacing random parts of the sequence with a special mask token, and the network was trained to correctly identify the removed amino acids.

After training, the team investigated the properties of the network's learned embedding. The embedding maps each amino acid into a point in embedding space; the researchers noted that the space had a "distinct clustering of hydrophobic and polar residues, aromatic amino acids, and organization by molecular weight and charge." A protein or gene can also be mapped into the space by averaging the points of its constituent amino acids. Using principal component analysis (PCA) on the embedding representation of orthologous genes from different species, the scientists noted that "linear dimensionality reduction recovers species and orthology as primary axes of variation."

Besides encoding chemical and genetic relationships, the embedding was also useful as input to further machine-learning tasks. One such task is secondary structure prediction. In this task, a machine-learning model tries to predict the local three-dimensional form of portions of a protein chain. By including the embedding representation of the input sequence, the team improved the state-of-the-art result by 2.5 percentage points. The embedding data also improved the task of predicting tertiary protein structure and the effect of mutations.

The team's lead author Alex Rives highlighted several of the results on Twitter. When asked by deep-learning researcher Gwern Branwen why the team only used 700M parameters in their model, Rives noted that that was the most that could fit in a single GPU. Branwen replied,

You could probably fit more, I didn't see any mention of reversible layers or reduced precision. Reducing context window length is also an option; it's unlikely you saturated the full 1024 window (eg predict the 1024th token as accurately as the 2nd).

Facebook isn't the only major tech company applying its NLP expertise to genomics problems. Google recently announced its BigBird NLP model also achieved new state-of-the-art performance on two genomics tasks. While Google has not released its BigBird code, Facebook has open-sourced their model available on GitHub.

Rate this Article