La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

Abstract. Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.

Contact: tgeffner@nvidia.com, kdidi@nvidia.com, kkreis@nvidia.com, avahdat@nvidia.com

Visualizations of Generated Protein Structures

Unconditional Generation. La-Proteina can generate diverse and co-designable fully atomistic protein structures of up to 800 residues. Below, we show unconditional model samples, all of them co-designable. For quantitative evaluations, see performance section.

100 Residues

200 Residues

300 Residues

400 Residues

500 Residues

600 Residues

700 Residues

800 Residues

100 Residues

200 Residues

300 Residues

400 Residues

500 Residues

600 Residues

700 Residues

800 Residues

All-Atom Motif Scaffolding. La-Proteina can also perform all-atom motif scaffolding, where a functionally relevant motif is given and the model is tasked with generating a viable fully atomistic supporting scaffold structure. In this setting, all atoms of the motif residues are given to La-Proteina as conditioning. Below, we show successful motif scaffolding examples for two tasks from the Protpardelle benchmark. For each task, we show two different solutions generated by our model (task IDs in captions; generated and conditioning structures overlaid; motif residues and side chains visualized in red). See quantitative evaluations below.

Task 5IUS

Task 6E6R

Task 5IUS

Task 6E6R

Tip-Atom Motif Scaffolding. Furthermore, La-Proteina can also be used for tip-atom motif scaffolding, where the functionally relevant group of motif atoms only consists of the tip atoms of important side chains, a critical task for instance in enzyme design. In Figure 1 and Figure 2, we show two successful tip-atom motif scaffolding examples, with details in the captions. Side chains that involve conditioning atoms are visualized as thick sticks; all other side chains are shown as thin sticks.

Figure 1. Task 1QJG (Delta(5)-3-Ketosteroid isomerase). The active site consists of an ASP that acts as a general base, a TYR that stabilises the oxyanion in the transition state and another ASP that also stabilises the transition state by forming a hydrogen bond with the oxyanion. La-Proteina successfully generates a valid atomistic scaffold and accurately reproduces the red conditioning atoms that form the tips of partially given side chains (see zoom-ins (a)-(c)).

Figure 2. Task 5YUI (carbonic anhydrase). The active site here combines a metal coordination site (HIS residues) with a hydrophobic substrate channel (VAL and TRP residues). La-Proteina successfully generates a valid atomistic scaffold and accurately reproduces the red conditioning atoms that form the tips of partially given side chains (see zoom-ins (b)-(d)). A small inconsistency can be observed in (a), where the model generates an incorrectly rotated ring (we found such inconsistencies to be extremely rare).

La-Proteina is able to generate diverse valid atomistic scaffolds for a given motif also in the tip-atom motif scaffolding setting, as visualized in Figure 3, below. For clarity, here we are only showing side chains of residues that involve conditioning atoms; all other side chains are generated, too, but not shown.

Figure 3. Task 5AOU (retro-aldolase). La-Proteina successfully generates diverse valid atomistic scaffolds and accurately reproduces the red conditioning atoms that form the tips of partially given side chains (see zoom-ins (a)-(d)). The atomistic motif is shown in (e) consisting of a catalytic tetrad that emerged during directed evolution in the laboratory, with the LYS acting as catalytic nucleophile, the two TYR stabilizing the transition state and participating in proton transfer and the ASN maintaining the hydrogen bond network that connects and spatially arranges all tetrad residues. We see that La-Proteina can produce diverse solutions to the scaffolding task. Note that each protein structure is visualized from different angles for best views of the active site.

La-Proteina Overview

La-Proteina (Latent Proteina) is a novel method for fully atomistic protein design based on partially latent flow matching, combining the strengths of explicit and latent modeling. La-Proteina models the protein's alpha-carbon coordinates explicitly, while capturing sequence and coordinates of all remaining non-alpha-carbon residue atoms within a continuous, fixed-size latent representation associated with each residue. We first train a Variational Autoencoder, encoding sequence and side chain details in latent space, followed by a flow matching model that jointly generates alpha-carbon coordinates and latent variables. New proteins are generated by sampling the flow model and decoding the alpha-carbons and latent variables into sequences and fully atomistic structures (see Figure 4).

Figure 4. La-Proteina consists of an encoder, a decoder, and a joint denoiser. The encoder featurizes the input protein and predicts per-residue latent variables of constant dimensionality. Together with the underlying alpha-carbon backbone, the decoder outputs the amino acid sequence as well as all other atoms and reconstructs the atomistic protein. To facilitate generation of de novo proteins, a partially latent flow model jointly generates novel alpha-carbon backbone structures and latents. The model is trained in two stages and all networks are implemented leveraging the same transformer architecture; see details in our paper.

Motivation. While prior works have been able to successfully tackle high-quality protein backbone design, fully atomistic structure generation comes with additional challenges. The model needs to jointly reason over large-scale backbone structure, amino acid types, and side-chains, whose dimensionality depends on the amino acid—this represents a complex continuous-categorical generative modeling problem. How can we best build on top of successful backbone generation frameworks, while addressing the additional fully atomistic modeling challenges? Our partially latent approach has several key advantages:

By encoding atomistic details, including varying-length side chains, together with their categorical residue type into a fixed-length, fully-continuous per-residue latent space, we avoid mixed continuous-categorical modeling challenges in the model's main generative component. Together with the continuous backbone coordinates, the per-residue latent variables can be generated using efficient, fully-continuous flow matching methods, while mixed modality modeling complexities are handled by encoder and decoder.
It is critical to maintain the explicit alpha-carbon-based backbone representation in La-Proteina's hybrid, partially latent framework. That way, we can build on top of advances in high-performance backbone modeling. Our ablations show that also encoding alpha-carbons in latent space leads to worse results.
Maintaining explicit backbone modeling capabilities also allows us to use different generation schedules for global alpha-carbon backbone structure and per-residue atomistic (latent) details, which we found to be important for producing high-quality outputs.
Our partially latent framework also increases scalability. Explicit modeling of all atoms in large proteins can require complex and memory-consuming neural networks. In contast, La-Proteina's per-residue latent variables simply become additional channels on top of the alpha-carbon coordinates, thereby enabling the application of established, high-performance backbone-processing architectures without increasing the length of internal sequence representations. Hence, we can keep the model's memory footprint manageable and scale the model to large protein generation tasks of up to 800 residues.

Contributions

New Partially Latent Flow Matching Framework. We propose La-Proteina, a partially latent flow matching method designed for the joint generation of protein sequence and fully atomistic structure, effectively combining explicit backbone modeling with fixed-size per-residue latent representations to capture sequence and atomistic side chains.

State-Of-The-Art Performance. In extensive benchmark experiments, La-Proteina achieves state-of-the-art performance in unconditional fully atomistic protein generation in terms of (co-)designability and diversity.

Strong Scalability and Structural Integrity. We verify La-Proteina's scalability, train our models on up to 46M protein structures and demonstrate that our model can generate structurally and geometrically valid fully atomistic proteins of up to 800 residues with accurate side chain conformations, outperforming previous methods.

Versatile Atomistic Motif Scaffolding. We successfully apply La-Proteina to both all-atom as well as tip-atom atomistic motif scaffolding, in both the indexed and the unindexed setting, again outperforming prior work in our benchmarks. This highlights La-Proteina's broad applicability to important conditional protein design tasks.

Novel Insights and In-Depth Analyses. We provide extensive further insights through ablation studies, latent space analyses, and rigorous biophysical assessments of La-Proteina's generated atomistic protein structures.

See our paper for details.

Performance and Benchmarking

We extensively evaluated and benchmarked La-Proteina and compared it to a variety of baseline methods. Our models produce highly diverse, novel and (co-)designable proteins and achieve state-of-the-art unconditional fully atomistic protein structure generation performance. Further, La-Proteina is scalable to long chain generation and its samples feature superior structural validity compared to prior works. Moreover, we also successfully applied La-Proteina to atomistic motif scaffolding. Please see our paper for all benchmarks, evaluations and analyses. Below, we highlight key quantitative results.

Fully Atomistic Generation of Long Protein Structures. In Figure 5, we present La-Proteina's performance when scaling its generation to proteins of up to 800 residues. La-Proteina can successfully synthetize diverse and designable proteins at such chain lengths, outperforming all prior methods. While co-designable generation degrades at longer lengths, La-Proteina is the first method that can produce diverse co-designable proteins at such lengths at all—all prior methods entirely collapse beyond 500 residues under this metric. This is thanks to La-Proteina's highly efficient partially latent flow matching framework, which allows us to successfully train highly performant models at these length scales. Fully atomistic samples across the entire range of protein lengths are visualized at the top of the webpage.

Figure 5. La-Proteina can generate diverse and (co-)designable protein backbones of up to 800 residues, significantly outperforming previous methods. Codesignability measures whether generated sequences fold into co-generated structures. Regular designability instead discards the model-generated sequences and uses ProteinMPNN to predict sequences for the generated structures; it then quantifies whether those predicted sequences fold back into the generated structures (this metric can also be calculated for backbone-only generators). The diversity metrics correspond to the number of structural clusters of designable or co-designable proteins.

Geometric Integrity and Biophysical Validity of Atomistic Structures. Next, we examined the biophysical quality of La-Proteina's generated structures. We used MolProbity to assess the structural validity in terms of bond angles, clashes and other physical quantities. The results in Figure 6 demonstrate that La-Proteina produces more high quality structures, scoring significantly better than all baselines. The protein structures generated by La-Proteina are the most physically realistic ones.

Figure 6. La-Proteina produces structures with higher structural validity than existing all-atom generation baselines. MolProbity metrics assessing structural quality: overall MolProbity score, clash score, Ramachandran angle outliers, and covalent bond outliers. See paper for details.

Most side chain torsion angles do not vary freely, but cluster due to steric repulsions into so-called rotamers. Therefore, to judge the coverage of conformational space, we also visualize side-chain dihedral angle distributions and compare their rotamer populations to Protein Databank (PDB) and AlphaFold Database (AFDB) references. La-Proteina models these distributions accurately, as shown in Figure 7 for the tryptophan χ1 angle. La-Proteina's samples accurately recover all major rotameric states as well as their respective frequencies with respect to the reference PDB/AFDB. In contrast, baselines often deviate from these references, missing modes or populating unrealistic angular regions.

Figure 7. Distribution of residue tryptophan χ1 angle for La-Proteina, prior fully atomistic protein structure generators, as well as PDB and AFDB references. La-Proteina more accurately reproduces the distribution from the PDB and AFDB datasets than the baseline methods.

Fully Atomistic Motif-Scaffolding. Next, we evaluated La-Proteina on fully atomistic motif scaffolding, where a functionally relevant motif is given and the model is tasked with completing the structure with a fully atomistic scaffold that supports the motif. We tackle both all-atom motif scaffolding, where all atoms of a motif residue are given to the model for conditioning, as well as tip-atom motif scaffolding, where only a set of tip atoms after the side chains' final rotatable bonds are given but not the motif residues' other atoms, and the model is tasked to decide backbone and rotamer placements of motif residues. Moreover, we consider both the indexed and unindexed generation scenarios. In indexed generation, the motif residue sequence indices are specified, whereas in unindexed generation, this information is not given and the model determines itself where in the sequence to place the motif. We adopt Protpardelle's atomistic motif scaffolding benchmark, using strict success criteria (note that Protpardelle is the only comparable baseline tackling fully atomistic motif scaffolding). The results in Figure 8 show that La-Proteina can solve most tasks and often generates many diverse unique successes, whereas Protpardelle fails on most tasks. We find that La-Proteina succeeds both in all-atom and tip-atom scaffolding, and can successfully perform both indexed and unindexed generation. Notably, we observe that unindexed generation leads to more unique successes—the additional freedom with regards to the motif placement within the sequence potentially allows the model to generate more diverse solutions. This observation is particularly pronounced when scaffolding motifs consisting of multiple residue segments, i.e., the distinct, continuous residue blocks forming the motif. See our paper for details.

Figure 8. We benchmark La-Proteina on fully atomistic motif scaffolding and compare to Protpardelle. (top) Performance in terms of number of unique successes out of 1000 sampled structures for all tasks for Protpardelle and La-Proteina (task codes on the x-axes). We test both all-atom and tip-atom motif scaffolding and evaluate both the indexed and unindexed generation settings for La-Proteina. (bottom) Total unique successes summed over all tasks. See our paper for task success definition and further details.

Future Applications. La-Proteina's scalability demonstates its potential to unlock large-scale fully atomistic protein structure generation tasks and its strong performance in atomistic motif scaffolding highlights its relevance for important conditional atomistic protein generation tasks, as they emerge for instance in enzyme design. Future work could apply La-Proteina also to challenging binder design problems, which we expect to similarly benefit from a scalable yet expressive fully atomistic protein structure generative model like La-Proteina.

La-Proteina

La-Proteina:
Atomistic Protein Generation via Partially Latent Flow Matching

In a Nutshell:

Navigate to

Visualizations of Generated Protein Structures

La-Proteina Overview

Contributions

Performance and Benchmarking

Paper

Citation

La-Proteina

La-Proteina:Atomistic Protein Generation via Partially Latent Flow Matching

In a Nutshell:

Navigate to

Visualizations of Generated Protein Structures

La-Proteina Overview

Contributions

Performance and Benchmarking

Paper

Citation

La-Proteina:
Atomistic Protein Generation via Partially Latent Flow Matching