Proteína

Proteína: Scaling Flow-based Protein Structure Generative Models

1 NVIDIA    2 Mila - Québec AI Institute    3 Université de Montréal    4 Massachusetts Institute of Technology
* Core contributor.
Work done during internship at NVIDIA.
International Conference on Learning Representations (ICLR) 2025
(Oral Presentation)


In a Nutshell:

Abstract. Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteína, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteína achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.




Visualizations of Generated Proteins

Long Chain Generation. Proteína can generate diverse and designable protein backbones from 50 all the way up to 800 residues. Below, we show unconditional model samples, all of them designable. For quantitative evaluations, see performance section.

50 Residues

300 Residues

400 Residues

500 Residues

600 Residues

700 Residues

800 Residues

800 Residues

50 Residues

300 Residues

400 Residues

500 Residues

600 Residues

700 Residues

800 Residues

800 Residues


Fold Class-Conditional Generation. Proteína also offers fold class conditioning, which allows us to generate protein structures of different fold types. Fold classes are specified in terms of their C.A.T.H protein structure classifications and we can guide both with respect to high-level secondary structure content or low-level specific fold classes; see (designable) examples below.

Residues: 150
Fold Class: Beta Barrel

Residues: 175
Fold Class: Rossmann Fold

Residues: 225
Fold Class: 3-Layer (aba) Sandwich

Residues: 250
Fold Class: Immunoglobulin-like

Residues: 300
Fold Class: 5 Propeller

Residues: 400
Fold Class: Mainly Beta

Residues: 500
Fold Class: TIM Barrel

Residues: 700
Fold Class: Mixed Alpha Beta

Residues: 150
Fold Class: Beta Barrel

Residues: 175
Fold Class: Rossmann Fold

Residues: 225
Fold Class: 3-Layer (aba) Sandwich

Residues: 250
Fold Class: Immunoglobulin-like

Residues: 300
Fold Class: 5 Propeller

Residues: 400
Fold Class: Mainly Beta

Residues: 500
Fold Class: TIM Barrel

Residues: 700
Fold Class: Mixed Alpha Beta


Motif-Scaffolding. Proteína can also perform motif-scaffolding, where a functionally relevant motif is given and the model is tasked with generating a viable supporting scaffold structure. Below, we show successful motif-scaffolding examples for tasks from the RFDiffusion benchmark (task ID in captions). The given motif residues are visualized in red. See quantitative evaluations below.

1BCF

3IXT

4ZYP

5TRV-long

5WN9

5YUI

6EXZ-med

7MRX-128

1BCF

3IXT

4ZYP

5TRV-long

5WN9

5YUI

6EXZ-med

7MRX-128




Model Overview

Proteína is a novel flow-based protein backbone generative model. It is trained with flow matching (see Figure 1 and Figure 2), leverages a scalable and efficient transformer architecture, and offers hierarchical fold class conditioning for enhanced controllability, utilizing a tailored classifier-free guidance scheme.

Figure 1. Proteína uses flow-matching and learns a flow to transform a Gaussian distribution over initial protein backbone coordinates (residues' alpha carbon atoms) into realistic protein structures. We rely on a scalable transformer-based architecture and can condition the model on hierarchical fold class labels for improved controllability and complex protein structure design tasks.

Proteína is trained on datasets comprising up to 21 million backbone structures. Furthermore, we introduce new metrics to better analyze the learnt protein structure generative models, we explore LoRA-based fine-tuning of Proteína on small high-quality datasets, and we demonstrate how autoguidance can boost designability.

Figure 2. Proteína starts from Gaussian random noise and iteratively denoises it into novel protein structures, solving a stochastic differential equation that uses the learnt flow vector field and models the generation process.




Contributions

See our paper for details.

Figure 3. Dataset statistics: (left) Dataset size comparisons of the protein data bank (PDB) and our training datasets DFS and D21M. (right) Sunburst plot of the hierarchical fold class labels in our largest dataset D21M, depicting the hierarchical label structure and the relative sizes of the three hierarchical fold class categories C, A, and T (we do not use the Homologous superfamily level of the original C.A.T.H hierarchy).




Performance

We extensively evaluated Proteína and compared it to a variety of baseline methods. We found that our models can generate highly diverse, novel and designable protein backbones, while remaining fast to sample, overall achieving state-of-the-art unconditional and fold class-conditional protein backbone generation performance. Please see our paper for all benchmarks, evaluations and analyses. Here, we would like to highlight two quantitative results that we consider particularly interesting.

Long Chain Generation. In Figure 4, we present Proteína's performance when scaling generation to long chains, producing backbones of up to 800 residues. Proteína can successfully synthetize diverse and designable proteins at such chain lengths, while other models fail at this scale. This is thanks to Proteína's efficient and scalable transformer network architecture, which can be successfully trained at these length scales without experiencing prohibitive memory overhead. Unconditional and fold class-conditional samples across the entire range of chain lengths are visualized at the top of the webpage. The broad variety of structures indicates that the model has learnt a highly diverse distribution over protein backbones.

Figure 4. Proteína can generate diverse and designable protein backbones of up to 800 residues, significantly outperforming previous methods, which cannot reliably produce backbones at such chain lengths.

Motif-Scaffolding. We also evaluated Proteína on motif-scaffolding, a practically important conditional protein generation task where a functionally relevant motif is given and the model is tasked with completing the structure with a scaffold that supports the motif. See the animation in Figure 5.

Figure 5. Motif-scaffolding animation: Starting from Gaussian random noise, Proteína can generate novel scaffolds supporting a given conditioning motif. The motif is shown in red.

We quantified Proteína's motif-scaffolding performance on the benchmark introduced by RFDiffusion, see Figure 6. We observe that Proteína significantly outperforms previous methods and generates more diverse scaffolds. This means that Proteína does not only achieve state-of-the-art performance in unconditional and fold class-conditional protein backbone generation but that it also performs well for structure completion tasks with structure conditioning.

Figure 6. We evaluate Proteína on the motif-scaffolding benchmark introduced by RFDiffusion. (top) Performance in terms of number of unique successes out of 1000 sampled structures for all tasks for Proteína and baselines. (bottom) Total unique successes and number of tasks where Proteína and the baselines are ahead. The task success definition and clustering follows Genie2.

Future Applications. We hope that Proteína can serve as a versatile protein structure foundation model due to its broad diversity and high designability. Its advanced controllability through fold class conditioning and its scalability to long backbones has the potential to unlock new protein design tasks. Moreover, Proteína's strong performance on motif-scaffolding suggests that it could also be promising for binder generation, a related and important conditional generation task.




Paper

Proteina:
Scaling Flow-based Protein Structure Generative Models

Tomas Geffner*, Kieran Didi*, Zuobai Zhang*, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis*

* Core contributor.

International Conference on Learning Representations (ICLR), 2025 (Oral Presentation)

description arXiv
insert_comment BibTeX



Citation
@inproceedings{geffner2025proteina,
    title={Proteina: Scaling Flow-based Protein Structure Generative Models},
    author={Geffner, Tomas and Didi, Kieran and Zhang, Zuobai and Reidenbach, Danny and Cao, Zhonglin and Yim, Jason and Geiger, Mario and Dallago, Christian and Kucukbenli, Emine and Vahdat, Arash and Kreis, Karsten},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2025}
}