New machine learning model draws protein map with special properties
The biotech industry is constantly searching for the perfect mutation, where the properties of different proteins are synthetically combined to achieve the desired effect. It may be necessary to develop new drugs or enzymes that extend the shelf life of yogurt, break down plastics in nature, or make laundry efficient at low water temperatures.
New research from DTU Compute and the Department of Computer Science at the University of Copenhagen (DIKU) may in the long term help the industry to speed up the process. In the review Nature Communicationthe researchers explain how a new way of using machine learning (ML) draws a protein map, which helps name a candidate list of proteins you need to take a closer look at.
In recent years, we have started using machine learning to build a table of allowed mutations in proteins. The problem though is that you get different pictures depending on which method you use, and even if you train the same model multiple times, it can provide different answers about how the biology relates.
In our work, we seek to make this process more robust, and we show that you can extract much more biological information than you could before. This is an important step forward in order to be able to explore the landscape of mutations in the search for proteins with particular properties. »
Postdoc Nicki Skafte Detlefsen from the Cognitive Systems section at DTU Compute
The protein map
A protein is a chain of amino acids and a mutation occurs when only one of these amino acids in the chain is replaced by another. Since there are 20 natural amino acids, this means that the number of mutations is increasing so rapidly that it is totally impossible to study them all. There are more possible mutations than there are atoms in the universe, even if you look at simple proteins. It’s not possible to test everything experimentally, so you have to be selective about which proteins you want to try to produce synthetically.
The DIKU and DTU Compute researchers used their ML model to generate a picture of how proteins are linked. Presenting the model for many examples of protein sequences, it teaches how to draw a map with a dot for each protein so that closely related proteins are placed close to each other while distant proteins are placed far apart. others.
The ML model is based on the mathematics and geometry developed for drawing maps. Imagine you need to make a map of the globe. If you zoom in on Denmark, you can easily draw a map on a piece of paper that preserves the geography. But if you have to draw the earth, errors will occur because you stretch the globe, so the Arctic becomes one long country instead of a pole. So on the map, the earth is distorted. For this reason, cartographic research has developed a lot of mathematics that describes distortions and compensates for distortions in the map.
This is exactly the theory that DIKU and DTU Compute were able to extend to cover their machine learning (deep learning) model for proteins. Because they master the distortion on the map, they can also compensate for it.
“It allows us to talk about what a sensitive distance target between tightly bound proteins is, and then we can suddenly measure it. That way we can draw a path through the protein map that tells us which direction we expect one protein to develop from another – i.e. mutated, because they are all related to evolution.This way the ML model can measure a distance between proteins and plot optimal paths between promising proteins,” says Wouter Boomsma, Associate Professor in the Section for Machine Learning at DIKU.
The researchers tested the model on data from many proteins found in nature, where their structure is known, and they can see that the distance between proteins begins to match the evolutionary development of proteins, so that proteins that are close to each other during evolution are placed close to each other.
“We are now able to put two proteins on the map and draw the curve between them. Along the path between the two proteins are possible proteins, which have closely related properties. This is not a guarantee, but c This is an opportunity to have a hypothesis about proteins that the biotech industry should test when designing new proteins,” says Søren Hauberg, professor in the Cognitive Systems section of DTU Compute.
The unique collaboration between DTU Compute and DIKU has been established through a new Center for Machine Learning in Life Sciences (MLLS), which began last year with support from the Novo Nordisk Foundation. At the center, artificial intelligence researchers from both universities work together to solve fundamental machine learning problems raised by important questions in the field of biology.
The developed protein maps are part of a large-scale project that extends from basic research to industrial applications, for example in collaboration with Novozymes and Novo Nordisk.
FACT BOX: Artificial Intelligence, Machine Learning and Deep Learning
When computer programs are able to do something “intelligent”, it’s called artificial intelligence – or simply AI. Artificial intelligence is therefore a unified concept that covers several methods.
One of the methods is machine learning, and the newest and most advanced use of machine learning is called deep learning.
Deep Learning is based on neural networks, which is a mathematical model, where the model itself from a given set of data and without direct programming can learn to find patterns in the data. Because you’re using data, it’s called a data-driven model.
In unsupervised learning, the goal is to train a neural network to uncover underlying patterns in data. This is usually done by trying to compress the data, as this discards trends in the less frequent data, while the more important data takes up more information, so you can see the underlying patterns.
Through many repetitions, the network learns which data patterns can be used to compress the data.
Once the model is trained, it is tested on unknown data, which can also be compressed into a compact representation that can be interpreted to form scientific hypotheses or form the basis of other machine learning models.
Technical University of Denmark
Detlefsen, NS, et al. (2022) Learning Meaningful Representations of Protein Sequences. Communication Nature. doi.org/10.1038/s41467-022-29443-w.