We establish the first scaling law for continuous diffusion language models that rivals discrete DLMs — with only a 20× compute gap to autoregressive models and a new SOTA PPL bound of 22.1 on OpenWebText among continuous DLMs.
While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only 20× compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of 22.1 among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs.
Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.
We introduce RePlaid, a modernized continuous DLM that carefully aligns Plaid with the architectures used by contemporary discrete DLMs MDLM and Duo — enabling a fair, head-to-head comparison.
Strong empirical performance:
New theoretical insights:
We train and evaluate on the SlimPajama-627B corpus, using a Llama-2 tokenizer and \(L = 2048\). We run an IsoFLOP analysis across 5 compute budgets \(C \in \{6 \times 10^{18}, 1 \times 10^{19}, \ldots, 1 \times 10^{20}\}\) and \(\geq 7\) model sizes per budget. For each target FLOP budget \(C\), we fit a quadratic in \(\log N\) to the validation loss to identify the compute-optimal parameter count \(N_C^\ast\) and the corresponding optimal loss \(\mathcal{L}_C^\ast\).
(a) MDLM (low var. training).
(b) RePlaid with self-conditioning.
(c) RePlaid (s.c.) beats MDLM at over-training.
Fitting power laws to the compute-optimal points across budgets reveals four findings:
For visualization, please refer to the likelihood and parameter scaling laws at the top of this page.
Beyond the scaling-law study, we benchmark RePlaid at the 0.1B (100M) parameter scale — the canonical testbed at which the recent wave of continuous DLMs is exclusively evaluated, making it the natural setting for a head-to-head comparison. Having established that RePlaid scales on par with discrete DLMs, we compare at this standard scale against the strongest discrete (MDLM, Duo) and continuous (FLM, LangFlow) baselines along both likelihood and sample quality. Both axes matter: methods like FLM does not propose likelihood bounds, and prior work has shown that a better likelihood does not always translate into better sample quality. Every method shares the same transformer architecture and a comparable optimization setup, trained for 1M steps on OpenWebText.
On OWT, RePlaid (s.c.) attains the best perplexity bound of 22.1 among all considered DLMs — improving over the strongest discrete baseline MDLM (low var., 23.1), and outperforming Duo (25.2), the original Plaid (24.4), and the concurrent LangFlow (32.2). Even without self-conditioning, RePlaid reaches 23.6 — already below Duo's 25.2 and far ahead of LangFlow's 38.4. The gap to MDLM shrinks further at 250K steps, consistent with the over-training crossover seen in the scaling study.
| Model | Category | OWT PPL (↓) |
|---|---|---|
| LangFlow (no s.c.) | Continuous diffusion | 38.4 |
| LangFlow (s.c.) | Continuous diffusion | 32.2 |
| Plaid (no s.c.) | Continuous diffusion | 25.7 |
| Duo | Discrete diffusion (uniform) | 25.2 |
| Plaid (s.c.) | Continuous diffusion | 24.4 |
| RePlaid (no s.c.) | Continuous diffusion | 23.6 |
| MDLM | Discrete diffusion (mask) | 23.2 |
| MDLM (low var.) | Discrete diffusion (mask) | 23.1 |
| RePlaid (s.c.) | Continuous diffusion | 22.1 |
| AR Transformer | Autoregressive | 17.5 |
Test perplexity bounds on OWT (L = 1024, 1M steps), computed from the NELBO as in prior work. RePlaid (s.c.) sets a new state of the art among continuous DLMs and beats all considered discrete baselines.
We compare unconditional generation quality on OWT using GenPPL and MAUVE.
With a vanilla DDPM sampler and a uniform time discretization, RePlaid achieves stronger sample quality across sampling-step budgets than all considered discrete and continuous baselines, especially at high NFEs, where it approaches data-level generative perplexity (but lower entropy) and MAUVE.
(a) GenPPL vs. sampling steps.
(b) MAUVE vs. sampling steps.
(c) GenPPL–entropy frontier.
RePlaid is trained with a plain VDM NELBO — no cross-entropy (CE) term and no hand-engineered time reparameterization. Why does this work so well? Because the NELBO is a true upper bound on the negative log-likelihood, optimizing it places strong, well-calibrated pressure on the token embeddings \(\mathbf{E}\). Recent embedding-based continuous DLMs (CDCD, FLM, LangFlow) instead substitute a CE loss; the difference shows up directly in the learned embedding geometry.
Embedding geometry of RePlaid (s.c.) on OWT (PPL 22.1). (a) t-SNE of \(\mathbf{E}\), each subword colored by its most-frequent POS tag. (b) PCA scree plot of \(\mathbf{E}\) — low-rank, ~6 components for 90% of the variance. (c) PCA scree plot when an auxiliary CE loss is added — the spectrum disperses and the PPL bound worsens to 26.1.
We prove that likelihood-based training learns a noise schedule that evenly distributes denoising difficulty across time without relying on hand-crafted time reparameterizations. More formally:
Minimizing the variance of the ELBO over time yields a noise schedule under which per-timestep information loss is linear in \(t\) and the per-timestep cross-entropy is near linear in \(t\).
Empirically, the noise schedule learned by RePlaid matches the theoretical prediction, while freezing the noise schedule during training produces highly non-linear per-timestep losses.
@misc{yang2026replaid,
title={Continuous Diffusion Scales Competitively with Discrete Diffusion for Language},
author={Zhihan Yang and Wei Guo and Shuibai Zhang and Subham Sekhar Sahoo and Yongxin Chen and Arash Vahdat and Morteza Mardani and John Thickstun},
year={2026},
eprint={2605.18530},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.18530},
}