Proteína: Scaling Flow-based Protein Structure Generative Models

Abstract. Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteína, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteína achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.

Visualizations of Generated Proteins

Long Chain Generation. Proteína can generate diverse and designable protein backbones from 50 all the way up to 800 residues. Below, we show unconditional model samples, all of them designable. For quantitative evaluations, see performance section.

50 Residues

300 Residues

400 Residues

500 Residues

600 Residues

700 Residues

800 Residues

50 Residues

300 Residues

400 Residues

500 Residues

600 Residues

700 Residues

800 Residues

Fold Class-Conditional Generation. Proteína also offers fold class conditioning, which allows us to generate protein structures of different fold types. Fold classes are specified in terms of their C.A.T.H protein structure classifications and we can guide both with respect to high-level secondary structure content or low-level specific fold classes; see (designable) examples below.

Residues: 150
Fold Class: Beta Barrel

Residues: 175
Fold Class: Rossmann Fold

Residues: 225
Fold Class: 3-Layer (aba) Sandwich

Residues: 250
Fold Class: Immunoglobulin-like

Residues: 300
Fold Class: 5 Propeller

Residues: 400
Fold Class: Mainly Beta

Residues: 500
Fold Class: TIM Barrel

Residues: 700
Fold Class: Mixed Alpha Beta

Residues: 150
Fold Class: Beta Barrel

Residues: 175
Fold Class: Rossmann Fold

Residues: 225
Fold Class: 3-Layer (aba) Sandwich

Residues: 250
Fold Class: Immunoglobulin-like

Residues: 300
Fold Class: 5 Propeller

Residues: 400
Fold Class: Mainly Beta

Residues: 500
Fold Class: TIM Barrel

Residues: 700
Fold Class: Mixed Alpha Beta

Motif-Scaffolding. Proteína can also perform motif-scaffolding, where a functionally relevant motif is given and the model is tasked with generating a viable supporting scaffold structure. Below, we show successful motif-scaffolding examples for tasks from the RFDiffusion benchmark (task ID in captions). The given motif residues are visualized in red. See quantitative evaluations below.

1BCF

3IXT

4ZYP

5TRV-long

5WN9

5YUI

6EXZ-med

7MRX-128

1BCF

3IXT

4ZYP

5TRV-long

5WN9

5YUI

6EXZ-med

7MRX-128

Model Overview

Proteína is a novel flow-based protein backbone generative model. It is trained with flow matching (see Figure 1 and Figure 2), leverages a scalable and efficient transformer architecture, and offers hierarchical fold class conditioning for enhanced controllability, utilizing a tailored classifier-free guidance scheme.

Figure 1. Proteína uses flow-matching and learns a flow to transform a Gaussian distribution over initial protein backbone coordinates (residues' alpha carbon atoms) into realistic protein structures. We rely on a scalable transformer-based architecture and can condition the model on hierarchical fold class labels for improved controllability and complex protein structure design tasks.

Proteína is trained on datasets comprising up to 21 million backbone structures. Furthermore, we introduce new metrics to better analyze the learnt protein structure generative models, we explore LoRA-based fine-tuning of Proteína on small high-quality datasets, and we demonstrate how autoguidance can boost designability.

Figure 2. Proteína starts from Gaussian random noise and iteratively denoises it into novel protein structures, solving a stochastic differential equation that uses the learnt flow vector field and models the generation process.

Contributions

New Model with Scalable and Efficient Architecture. Proteína is a flow-based generative protein structure foundation model using a new scalable non-equivariant transformer architecture, which we scale to more than 400M parameters. Our architecture does not critically rely on compute- and memory-expensive components such as triangle attention, thereby remaining efficient and scalable.

Hierarchical Fold Class Conditioning. We incorporate hierarchical fold class conditioning into Proteína and develop tailored training algorithms and guidance schemes, leading to unprecedented semantic controllability over protein structure generation. Our methodology offers both low-level fold-specific synthesis and high-level guidance controlling secondary structure content (see Figure 3, right). For instance, we can use our conditioning to directly control the alpha helix and beta sheet content of generated proteins. Previous works often overrepresented alpha helices. Proteína can directly address and control this.

New Evaluation Metrics. We introduce several new protein structure generation metrics to complement existing metrics and to better analyze and compare existing models, providing new insights. In particular, we take inspiration from the image synthesis literature and introduce the Fréchet Protein Structure Distance (FPSD), the Fold Jensen Shannon Divergence (fJSD) and the Fold Score (fS), which measure different distributional aspects of the learnt generative models. The FPSD compares distributions of generated and reference structures in the feature space of a fold class classifier. The fJSD directly compares the categorical fold class distributions of generated and reference structures. The fS is inspired by the inception score and measures generated samples' fold class diversity and quality.

Exploring Training Data Scaling. We scale training data to almost 21M high-quality synthetic protein structures, and show successful training of models with very high designability on such large data. See dataset statistics in Figure 3 below.

State-Of-The-Art Performance. We achieve state-of-the-art designable and diverse protein backbone generation performance for unconditional and fold class-conditional generation as well as motif-scaffolding. Thanks to our efficient transformer architecture, we scale to an unprecedented length of 800 residues, still producing diverse and designable proteins, vastly outperforming previous works. Meanwhile, our models still remain fast to sample, thanks to our efficient architecture.

LoRA Fine-Tuning and Autoguidance. For the first time, we demonstrate LoRA-based fine-tuning and autoguidance for flow-based protein structure generative models. We fine-tune Proteína using LoRA on a small but high-quality subset of natural PDB structures. Separately, we also showcase how autoguidance can boost designability.

See our paper for details.

Figure 3. Dataset statistics: (left) Dataset size comparisons of the protein data bank (PDB) and our training datasets D_FS and D_21M. (right) Sunburst plot of the hierarchical fold class labels in our largest dataset D_21M, depicting the hierarchical label structure and the relative sizes of the three hierarchical fold class categories C, A, and T (we do not use the Homologous superfamily level of the original C.A.T.H hierarchy).

Performance

We extensively evaluated Proteína and compared it to a variety of baseline methods. We found that our models can generate highly diverse, novel and designable protein backbones, while remaining fast to sample, overall achieving state-of-the-art unconditional and fold class-conditional protein backbone generation performance. Please see our paper for all benchmarks, evaluations and analyses. Here, we would like to highlight two quantitative results that we consider particularly interesting.

Long Chain Generation. In Figure 4, we present Proteína's performance when scaling generation to long chains, producing backbones of up to 800 residues. Proteína can successfully synthetize diverse and designable proteins at such chain lengths, while other models fail at this scale. This is thanks to Proteína's efficient and scalable transformer network architecture, which can be successfully trained at these length scales without experiencing prohibitive memory overhead. Unconditional and fold class-conditional samples across the entire range of chain lengths are visualized at the top of the webpage. The broad variety of structures indicates that the model has learnt a highly diverse distribution over protein backbones.

Figure 4. Proteína can generate diverse and designable protein backbones of up to 800 residues, significantly outperforming previous methods, which cannot reliably produce backbones at such chain lengths.

Motif-Scaffolding. We also evaluated Proteína on motif-scaffolding, a practically important conditional protein generation task where a functionally relevant motif is given and the model is tasked with completing the structure with a scaffold that supports the motif. See the animation in Figure 5.

Figure 5. Motif-scaffolding animation: Starting from Gaussian random noise, Proteína can generate novel scaffolds supporting a given conditioning motif. The motif is shown in red.

We quantified Proteína's motif-scaffolding performance on the benchmark introduced by RFDiffusion, see Figure 6. We observe that Proteína significantly outperforms previous methods and generates more diverse scaffolds. This means that Proteína does not only achieve state-of-the-art performance in unconditional and fold class-conditional protein backbone generation but that it also performs well for structure completion tasks with structure conditioning.

Figure 6. We evaluate Proteína on the motif-scaffolding benchmark introduced by RFDiffusion. (top) Performance in terms of number of unique successes out of 1000 sampled structures for all tasks for Proteína and baselines. (bottom) Total unique successes and number of tasks where Proteína and the baselines are ahead. The task success definition and clustering follows Genie2.

Future Applications. We hope that Proteína can serve as a versatile protein structure foundation model due to its broad diversity and high designability. Its advanced controllability through fold class conditioning and its scalability to long backbones has the potential to unlock new protein design tasks. Moreover, Proteína's strong performance on motif-scaffolding suggests that it could also be promising for binder generation, a related and important conditional generation task.

Proteína

Proteína: Scaling Flow-based Protein Structure Generative Models

In a Nutshell:

Navigate to

Visualizations of Generated Proteins

Model Overview

Contributions

Performance

Paper

Citation