Photo by National Cancer Institute on Unsplash

Deep Learning and its Contributions to the Interpretability of Protein Sequences and Tertiary Structures

6 min readMay 23, 2021

Introduction

Deep Learning, the subfield of Machine Learning that deals with different neural network architectures, has been generally automating previously manual tasks from multiple fields including logistics, e-commerce, chemistry and biology. Companies and individuals alike have been experiencing its positive effects from traffic optimizations, product recommendations and even environmental solutions. Despite most models being data-hungry and computationally expensive, Deep Learning has been steadily on the rise as data and compute increase in volume, availability and speed. For the intersections of biology and chemistry, in particular, deep learning’s contributions mostly dwell in the space of systems biology, drug discovery and different types of sequence-related and classification tasks. Although Deep Learning is typically criticized for its black-box nature, it has been ironically providing state-of-the-art interpretability with regards to protein sequences. In this paper, we comprehensively discuss the attention interpretations of protein language models, a model called AlphaFold2 that predicts the 3D structure of a protein and the limitations of deep learning when it comes to applications in biology.

Brief Background

There is more to a protein than just its sequence of amino acids, and deep learning is here to help unravel some of these characteristics. Before that, here is a brief summary of terms that will be used throughout the paper:

Neural Networks. These are the basic components of the models to be discussed. Neural networks are essentially biologically-inspired networks of layers or weighted graphs composed of linear units/nodes (often undergoing activations) that can be used for tasks including prediction, generation and classification. Most of the time, neural networks take on complex problems that do not need solutions to be 100% accurate to be beneficial.

Transformers. A type of neural network architecture that contains self-attention blocks (discussed in the next point). Transformer networks, or simply transformers, are normally used for natural language processing problems but as of late, there have been numerous papers using transformers for computer vision tasks as well.

Attention. Attention is the component of a common neural network architecture called transformers that mimics human cognitive attention. This allows for the model to take note of the most important parts of an input matrix or embedding. Attention blocks are typically built with a combination of skip connections and dot products done on the keys, queries and values (all three based on the input embedding). The innovation here is that it allows for a model to look for relevant information globally, or simply across the whole input.

BERT. BERT, or Bidirectional Encoder Representation from Transformers, is a transformer-based deep learning technique that has grown to be one of the more prominent and useful language models to date. Multiple models have grown from BERT such as ALBERT and RoBERTa, and most of which are used in a lot of commercial applications such as text summarization. BERT-based models typically take as input a sequence of tokens (characters, words, sentences, etc.) which then outputs an embedding representation of the said tokens.

BERTology

Given that proteins, as a sequence of amino acids, can be viewed as a sequence of tokens, these could then be modeled using BERT just like a natural language processing (NLP) task (Vig et al. 2). In the paper called BERTology meets Biology: Interpreting Attention in Protein Language Models, five transformer-based models trained on protein sequence datasets were analyzed for attention and how it was able to make parts of the protein much more understandable. It was seen that the attention blocks actually capture the folding structure of proteins, essentially exposing the relationship of spatially close amino acids in the tertiary or three-dimensional structure.

Aside from this, the other experiment shows that attention also targets binding sites, giving light to a core component in drug discovery.

It was also found that this behavior is consistent across the transformer-based models that were tested on two separate protein datasets (Vig et al. 2). Clearly, it can be seen that these black-box models have been very useful for the interpretation of not-so-obvious parts of proteins, such as its tertiary structure connections and its binding sites.

AlphaFold2

Aside from models that expose certain characteristics and relationships of proteins for interpretability, there exists some models that generate macromolecules and/or do the prediction of the tertiary structure itself at a much faster pace than your common protein-folding algorithms. One model, which has been called a major breakthrough in this space, is the AlphaFold2. This deep learning model predicts a protein’s tertiary structure just based on its amino acid sequence (Senior et al. 1). This allows computational biologists to use these fairly accurate 3D structures for further research and drug discovery at lower costs. In fact, for six out of the uncommon proteins present in the SARS-CoV-2 virus, the previously unknown structures were predicted by AlphaFold2, including the protein ORF8 (The AlphaFold Team). This model stood out as it massively outperformed the other submissions present in CASP14, a protein-folding contest, getting a median score of 87 in the Global Distance Test. For context, the Global Distance Test (GDT) is the main metric of CASP which can be thought of as the percentage distance of the predicted amino acids from the correct position given a threshold distance.

In addition to all these, it was seen that AlphaFold2 matched some of the results from the highest quality methods including X-ray crystallography and cryo-electron microscopy. Both of which are known to be extremely laborious and expensive (The AlphaFold Team). As for how the model works, it makes use of a 3D equivariant transformer architecture to update the protein backbone (Senior et al.).

Deep Learning’s Limitations

Although Deep Learning applied to biology applications is undoubtedly promising, the problem-specific optimizations also inherit the disadvantages of deep learning in general. As previously mentioned, these models are typically data-hungry. Such models may be affected by small and imbalanced datasets that are commonly found in fields such as biology. This proves to be detrimental to the performance of the model with respect to key metrics such as accuracy (Ching et al.). Along with this, it is quite challenging to find the causes and explanations of certain outputs due to the black-box nature of neural networks in general. There is much work to be done with regards to the interpretability of the models itself. Other than these, tuning and taking care of the bias-variance tradeoff can often be problematic and time-consuming (Ching et al.). Such deep learning techniques are far from perfect. Despite the aforementioned models being breakthroughs, labs cannot be replaced as of the moment. Instead, such deep learning models will aid the chemists and computational biologists in searching through solution spaces at a much more efficient pace.

Nonetheless, deep learning has been very useful for the whole scientific community. The use of transformer-based implementations for drug discovery, protein language interpretability, tertiary structure prediction and the like should be embraced as our society becomes more and more interconnected. With confidence, I say that innovations that cross verticals and disciplines, binding artificial intelligence and biology, will be the impactful and creative trend that governs our future in decades to come.

Bibliography

The AlphaFold Team. “AlphaFold: a solution to a 50-year-old grand challenge in biology.” 2020, https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology.

Ching, Travers, et al. “Opportunities and obstacles for deep learning in biology and medicine.” 2018, https://www.biorxiv.org/content/10.1101/142760v2.

Senior, Andrew W., et al. “Improved protein structure prediction using potentials from deep learning.” 2020, https://www.nature.com/articles/s41586-019-1923-7%20.

Vig, Jesse, et al. “BERTology Meets Biology: Interpreting Attention in Protein Language Models.” 2021, https://www.biorxiv.org/content/10.1101/2020.06.26.174417v1.

Deep Learning and its Contributions to the Interpretability of Protein Sequences and Tertiary Structures

Written by Miko Planas