Continuous Diffusion Scales Competitively
with Discrete Diffusion for Language

1NVIDIA    2Cornell    3Georgia Tech    4UW–Madison    5MBZUAI
*Equal advising; the senior authors are listed alphabetically.
Compute-optimal scaling of RePlaid

Likelihood scaling law The compute-optimal loss of RePlaid exhibits power-law scaling, decreasing at a rate comparable to AR, MDLM (SOTA masked diffusion), and Duo (SOTA uniform-state diffusion). MDLM needs 14× FLOPs to match AR; Duo needs 22×; RePlaid consumes 20× with self-conditioning (s.c.) and 27× without it.

RePlaid vs MDLM at matched compute

Parameter scaling law The compute-optimal RePlaid (s.c.) uses 1.8× fewer parameters than MDLM and Duo — while outperforming Duo's loss (see likelihood scaling law on the left) — and 3.4× fewer parameters than the AR baseline. For a given model size, MDLM and RePlaid (s.c.) match loss at the green line.

Abstract

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only 20× compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of 22.1 among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs.

Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

Key Contributions

We introduce RePlaid, a modernized continuous DLM that carefully aligns Plaid with the architectures used by contemporary discrete DLMs MDLM and Duo — enabling a fair, head-to-head comparison.

Strong empirical performance:

  • The first scaling law for continuous DLMs on SlimPajama-627B that rivals discrete DLMs: RePlaid scales at the same rate as AR, MDLM, and Duo, with a compute gap of only 20× to AR.
  • A new state-of-the-art PPL bound of 22.1 on OpenWebText among continuous DLMs, alongside superior generation quality (GenPPL, MAUVE) versus discrete and continuous baselines.

New theoretical insights:

  1. Optimizing the noise schedule to minimize ELBO variance naturally produces linear information loss over time — eliminating the need for case-specific time reparameterizations.
  2. Learning embeddings drive the largest likelihood gain. Likelihood-based training induces structured embedding geometries that the CE-trained baselines fail to learn.

Revisiting Scaling Laws of Continuous DLMs


A Unified Scaling Benchmark

We train and evaluate on the SlimPajama-627B corpus, using a Llama-2 tokenizer and \(L = 2048\). We run an IsoFLOP analysis across 5 compute budgets \(C \in \{6 \times 10^{18}, 1 \times 10^{19}, \ldots, 1 \times 10^{20}\}\) and \(\geq 7\) model sizes per budget. For each target FLOP budget \(C\), we fit a quadratic in \(\log N\) to the validation loss to identify the compute-optimal parameter count \(N_C^\ast\) and the corresponding optimal loss \(\mathcal{L}_C^\ast\).

IsoFLOP curves for MDLM (low var.)

(a) MDLM (low var. training).

IsoFLOP curves for RePlaid (s.c.)

(b) RePlaid with self-conditioning.

IsoFLOP analysis example

(c) RePlaid (s.c.) beats MDLM at over-training.

Scaling Laws

Fitting power laws to the compute-optimal points across budgets reveals four findings:

  • Compute-rate parity. RePlaid scales at a comparable rate to MDLM, Duo, and AR.
  • Competitive offset. Closing the compute gap to AR requires only 20× FLOPs for RePlaid (s.c.) and 27× for RePlaid (no s.c.) — competitive with MDLM's 14× and Duo's 22×.
  • Strong parameter efficiency. The compute-optimal RePlaid (s.c.) uses 1.8× fewer parameters than MDLM and Duo — while outperforming Duo's loss — and 3.4× fewer parameters than AR.
  • Beats MDLM at over-training. For a given model size, a RePlaid (s.c.) trained for 3.1× its optimal compute matches the loss of an MDLM trained for 6.9× its optimal compute.

For visualization, please refer to the likelihood and parameter scaling laws at the top of this page.

Benchmarking for Diffusion Language Models


Likelihood Evaluation

Beyond the scaling-law study, we benchmark RePlaid at the 0.1B (100M) parameter scale — the canonical testbed at which the recent wave of continuous DLMs is exclusively evaluated, making it the natural setting for a head-to-head comparison. Having established that RePlaid scales on par with discrete DLMs, we compare at this standard scale against the strongest discrete (MDLM, Duo) and continuous (FLM, LangFlow) baselines along both likelihood and sample quality. Both axes matter: methods like FLM does not propose likelihood bounds, and prior work has shown that a better likelihood does not always translate into better sample quality. Every method shares the same transformer architecture and a comparable optimization setup, trained for 1M steps on OpenWebText.

On OWT, RePlaid (s.c.) attains the best perplexity bound of 22.1 among all considered DLMs — improving over the strongest discrete baseline MDLM (low var., 23.1), and outperforming Duo (25.2), the original Plaid (24.4), and the concurrent LangFlow (32.2). Even without self-conditioning, RePlaid reaches 23.6 — already below Duo's 25.2 and far ahead of LangFlow's 38.4. The gap to MDLM shrinks further at 250K steps, consistent with the over-training crossover seen in the scaling study.

ModelCategoryOWT PPL (↓)
LangFlow (no s.c.)Continuous diffusion38.4
LangFlow (s.c.)Continuous diffusion32.2
Plaid (no s.c.)Continuous diffusion25.7
DuoDiscrete diffusion (uniform)25.2
Plaid (s.c.)Continuous diffusion24.4
RePlaid (no s.c.)Continuous diffusion23.6
MDLMDiscrete diffusion (mask)23.2
MDLM (low var.)Discrete diffusion (mask)23.1
RePlaid (s.c.)Continuous diffusion22.1
AR TransformerAutoregressive17.5

Test perplexity bounds on OWT (L = 1024, 1M steps), computed from the NELBO as in prior work. RePlaid (s.c.) sets a new state of the art among continuous DLMs and beats all considered discrete baselines.

Generation Quality

We compare unconditional generation quality on OWT using GenPPL and MAUVE.

With a vanilla DDPM sampler and a uniform time discretization, RePlaid achieves stronger sample quality across sampling-step budgets than all considered discrete and continuous baselines, especially at high NFEs, where it approaches data-level generative perplexity (but lower entropy) and MAUVE.

GenPPL vs sampling steps

(a) GenPPL vs. sampling steps.

MAUVE vs sampling steps

(b) MAUVE vs. sampling steps.

GenPPL vs entropy frontier

(c) GenPPL–entropy frontier.

Likelihood Training Shapes Structured Embeddings

RePlaid is trained with a plain VDM NELBO — no cross-entropy (CE) term and no hand-engineered time reparameterization. Why does this work so well? Because the NELBO is a true upper bound on the negative log-likelihood, optimizing it places strong, well-calibrated pressure on the token embeddings \(\mathbf{E}\). Recent embedding-based continuous DLMs (CDCD, FLM, LangFlow) instead substitute a CE loss; the difference shows up directly in the learned embedding geometry.

  • RePlaid yields low-rank, linguistically meaningful geometry. At a tiny embedding dimension \(d_e = 16\), a t-SNE projection of \(\mathbf{E}\) shows clusters of subwords by part-of-speech, and a PCA scree plot concentrates 90% of the variance in ~6 principal components.
  • CE acts as a separative force. Adding an auxiliary CE term to the NELBO disperses the embeddings, flattening the spectrum and degrading the PPL bound from 22.1 to 26.1.
  • Holds at scale. At \(d_e = 768\), RePlaid learns a much lower-rank geometry than CE-based LangFlow (while attaining a better PPL), whereas MDLM learns a much higher-rank geometry than both.
Embedding geometry of RePlaid

Embedding geometry of RePlaid (s.c.) on OWT (PPL 22.1). (a) t-SNE of \(\mathbf{E}\), each subword colored by its most-frequent POS tag. (b) PCA scree plot of \(\mathbf{E}\) — low-rank, ~6 components for 90% of the variance. (c) PCA scree plot when an auxiliary CE loss is added — the spectrum disperses and the PPL bound worsens to 26.1.

Why a Learned Noise Schedule Works

We prove that likelihood-based training learns a noise schedule that evenly distributes denoising difficulty across time without relying on hand-crafted time reparameterizations. More formally:

Minimizing the variance of the ELBO over time yields a noise schedule under which per-timestep information loss is linear in \(t\) and the per-timestep cross-entropy is near linear in \(t\).

Empirically, the noise schedule learned by RePlaid matches the theoretical prediction, while freezing the noise schedule during training produces highly non-linear per-timestep losses.

Per-timestep diffusion loss, CE loss, and decoding error

BibTeX

@misc{yang2026replaid,
      title={Continuous Diffusion Scales Competitively with Discrete Diffusion for Language},
      author={Zhihan Yang and Wei Guo and Shuibai Zhang and Subham Sekhar Sahoo and Yongxin Chen and Arash Vahdat and Morteza Mardani and John Thickstun},
      year={2026},
      eprint={2605.18530},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.18530},
}