Artificial Intelligence Featured Laboratory Specialties Technologies

Drug Design & Disease Transparency: Behind the Science of Protein Stability with AI

Image via Envato

by Erin Tallman

January 15, 2026Reading time: 11 mins

Jes Frellsen explains inverse folding models and zero-shot predictors of protein thermodynamic stability and gives insights into medical applications and the missing data phenomena.

Why are inverse folding models good zero-shot predictors of protein thermodynamic stability? Inverse folding models are trained to recover sequences from structures, yet they have emerged as highly effective zero-shot predictors of protein stability. How can we understand this connection?

These are the questions Jes Frellsen discussed in a talk during SophIA Summit 2025. Associate Professor at the Technical University of Denmark (DTU) in Machine Learning & Signal Processing, Jes Frellsen argues that theoretical assumptions connect the amino acid preferences of an inverse folding model to the free-energy considerations that govern thermodynamic stability. Drawing on concepts from probability theory and statistical physics, Frellsen explains how commonly used heuristics can be interpreted as simplistic approximations and that more principled alternatives empirically yield considerable performance gains.

Frellsen’s research spans deep generative models, Bayesian inference, uncertainty quantification, missing data, and out-of-distribution (OOD) detection. These topics are all very relevant for reliable AI applications in biology. Frellsen’s background, receiving a PhD from University of Copenhagen and completing a postdoc at Cambridge in the Machine Learning Group, supports a strong foundation in bioinformatics. He collaborated with pharmaceutical company Novo Nordisk on the scientific paper he presented at the summit, and has worked with Pierre-Alexandre Mattei, chair of the summit’s scientific committee and researcher at INRIA.

NOTE: Zero-shot learning (ZSL) is a machine learning scenario in which an AI model is trained to recognize and categorize objects or concepts without having seen any examples of those categories or concepts beforehand.

The Central Scientific Claim: Why Inverse Folding Models Predict Stability

At first glance, inverse folding models should not be particularly good at predicting protein stability. These models are trained to recover amino acid sequences from a given protein structure, not to measure thermodynamic quantities. Yet over the past few years, researchers have repeatedly observed that inverse folding models perform remarkably well as zero-shot predictors of stability. This means they can estimate whether a mutation stabilizes or destabilizes a protein without being explicitly trained on stability data.

This paradox is the starting point of Jes Frellsen’s work. Speaking at the SophIA Summit, he explained that the field had been relying on a heuristic: comparing the probability that an inverse folding model assigns to a mutant sequence versus the original, wild-type sequence.

“People saw in the literature that these models worked really well,” Frellsen said, “but we couldn’t really see why that should be related to a physical quantity like protein stability.”

Protein stability, after all, is governed by free energy — a balance between folded and unfolded states — while inverse folding models typically evaluate a single folded structure. The surprise was not just that the models worked, but that they worked consistently, with correlations around 0.6–0.7 against experimental measurements.

Frellsen’s contribution was to show that this apparent shortcut has a deeper explanation. By reframing inverse folding likelihoods through the lens of probability theory and statistical physics, his team demonstrated that commonly used likelihood ratios are rough approximations of underlying free-energy differences. More importantly, they showed how to correct these approximations using principled assumptions, leading to measurable performance gains.

“What we did,” Frellsen summarized, “was to link physics and machine learning — to understand what these models are actually predicting, not just observe that they work.”

Protein Stability, Mutations, and Why Clinicians Should Care

Protein thermodynamic stability is not an abstract concept confined to physics textbooks. It plays a direct role in disease mechanisms, particularly when genetic mutations destabilize key proteins and disrupt their function. From inherited metabolic disorders to cancer-associated variants, changes in protein folding can have profound biological consequences.

“In the medical setting,” Frellsen explained, “you might see a mutation in the genome that destabilizes one protein, and that interruption of function can help explain the disease.”

This is where AI-based stability prediction becomes relevant: it offers a way to prioritize which mutations are likely to matter biologically.

Thermodynamic stability is defined by the Gibbs free-energy difference (ΔG) between folded and unfolded states. A stable protein spends most of its time folded; a destabilized one does not. Traditionally, measuring these quantities requires labor-intensive experiments. Zero-shot predictors offer a complementary approach — one that can rapidly screen large numbers of variants.

What Frellsen’s work clarifies is why inverse folding models capture this signal at all. These models implicitly learn amino-acid “preferences” conditioned on structure. Under reasonable assumptions — namely that experimentally observed protein structures approximate samples from a Boltzmann distribution — these preferences align with free-energy considerations.

“This is a physical quantity,” Frellsen emphasized. “Protein stability is something you can measure and quantify. What we showed is how the model outputs can be interpreted in those terms.”

For clinicians and biomedical researchers, this matters because interpretability builds trust. Knowing that an AI model’s predictions map back to physical principles makes them more credible as tools for variant interpretation and disease mechanism research.

Interpretability builds trust.

Missing Data, Uncertainty, and the Safe Use of AI in Medicine

Beyond protein stability, Frellsen is widely recognized for his work on missing data and uncertainty — two issues that loom large in medical AI. Unlike benchmark datasets, real-world clinical data are incomplete by design. Tests are ordered selectively, measurements are skipped intentionally, and the absence of data often carries meaning.

“In the medical domain, we’ve got missing data all the time,” Frellsen said. “And sometimes the fact that data is missing tells you something.”

Statisticians distinguish between data missing at random and data missing not at random — the latter being the most difficult and the most realistic in healthcare. A chest X-ray not being ordered, for example, reflects a clinician’s judgment, not a technical failure. Ignoring this distinction can bias AI models in subtle but dangerous ways.

Frellsen has developed generative approaches, such as models designed specifically for non-random missingness, to address this problem. But he is also cautious about complexity.

“Especially in medicine,” he noted, “doctors prefer interpretable models. They want to understand why a model made a recommendation.”

This creates tension: the most powerful AI methods for handling missing data are often the least transparent. Frellsen sees this as an open research challenge, one that sits at the intersection of AI safety, explainability, and clinical usability.

“You should never be a thousand percent sure,” he added. “There has to be a margin of error.”

The mindset of acknowledging uncertainty rather than hiding it is central to the responsible deployment of AI in sensitive domains.

From Basic Research to Drug Design and Precision Medicine

Frellsen is careful to describe his work as basic research, not a ready-to-use clinical tool. Still, its implications for drug discovery and precision medicine are clear. Inverse folding models already sit at the heart of modern protein design pipelines, where they help identify sequences that fold into desired structures.

The next step, Frellsen believes, is extending these ideas to molecular interactions.

“Proteins don’t just exist on their own,” he said. “They bind to other proteins, to small molecules — that’s how drugs work.”

Applying the same physics-informed framework to binding affinity and complex stability could improve early-stage drug design.

This matters because generative AI offers something rare in biomedicine: scale without labels. Zero-shot methods can screen vast mutation spaces or design candidates without requiring massive experimental datasets. But power without understanding is risky.

“What we’re really doing,” Frellsen said, “is providing an explanation behind the prediction.” That explanatory layer — linking AI outputs to physical reality — is what makes these tools suitable for high-stakes biomedical research.

His recent research prize for work on incomplete data and safe AI underscores that point. Progress in medical AI will not come from accuracy alone, but from models that are transparent, grounded in theory, and honest about their limitations.

As Frellsen’s work shows, trust in AI is built not just by better predictions, but by understanding what they mean.