CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Document Clusters

Cluster Distribution

Details

Abstract

Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. This strategy enables effective domain adaptation without relying solely on curated data. When continuously trained on 400B tokens with this mixture, our 950M model exceeds the state-of-the-art Llama-3.2-1B by 2.0% averaged across 12 general reasoning tasks. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.3-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture.

Overview

CLIMB Data Filtering Architecture — Figure 1: Overview of the CLIMB data clustering, filtering and mixture optimization process

CLIMB first preprocesses raw data via embedding and clustering it into groups. These clusters serve as the basis for the search space, where a mixture is defined as a set of weights to combine different clusters. CLIMB samples n_k mixtures in iteration k, trains proxy models on a subset of them, and updates a predictor to estimate performance. The predictor prunes mixtures that are likely to perform poorly, so only the most promising mixtures proceed to full proxy training in subsequent iterations. Through progressively refining the search space and eliminating suboptimal candidates, CLIMB converges toward an optimized data mixture and balances general and domain-specific performance without exhaustive manual curation.

Mid-training

CLIMB Framework Concept — Figure 2: Continuously training a 1B model yields a 2.0% improvement over Llama-3.2-1B, demonstrating a more efficient scaling trend compared to prior models.


Size	Model	wiki	lambda	piqa	arc_c	arc_e	hellaswag	winogrande	siqa	avg.
350M	Base	22.70	8.87	70.03	28.11	56.12	51.16	54.48	40.75	50.11
	Random	20.92	9.85	71.16	30.54	62.50	52.14	55.40	41.29	52.17
	Doremi	19.41	10.39	70.29	33.53	66.41	52.25	55.95	41.86	53.38
	RegMix	20.93	10.32	71.92	33.42	66.12	53.69	55.27	42.23	53.78
	CLIMB	19.67	9.29	72.21	34.87	67.25	55.32	56.79	42.54	54.83

1B	Base	17.79	6.65	73.89	34.92	66.77	62.12	59.82	41.26	56.46
	Random	17.82	6.53	74.05	37.12	70.24	62.90	60.77	42.48	57.93
	Doremi	15.78	6.33	74.91	40.01	72.34	63.53	61.08	43.09	59.16
	RegMix	16.19	6.62	75.22	40.42	71.32	64.73	62.33	42.22	59.37
	CLIMB	15.96	6.44	75.78	40.98	72.97	66.01	63.32	43.37	60.41

Table 1: Comparison of different data mixture methods. CLIMB consistently outperforms other methods on 350M and 1B models trained on 40B tokens.

We first perform phase-1 pre-training on 10T tokens to establish a solid foundation. We use the warmup-stable-decay (WSD) learning rate schedule, allowing resumption in the stable stage and focusing on data mixing research in the decay stage. As shown in Figure 2, CLIMB achieves the best performance among all sub-500M and sub-1.2B models. For example, when comparing models of similar scales (around 1B parameters), CLIMB consistently outperforms other baselines including Llama-3.2 and AMD-OLMo. In particular, it achieves the highest overall average score, surpassing the next-best method (i.e., Llama-3.2) by a noticeable margin (2.0%). In addition, we compare CLIMB with other data mixture methods in Table 1 and it consistently outperforms other methods. For example, with the 350M target model, CLIMB achieves an average accuracy of 54.83%, outperforming Random (52.17%) and the best-performing baseline, Regmix (53.78%). Similarly, for the 1B model, CLIMB achieves an average accuracy of 60.41%, higher than all baselines. Although the optimization objective is confined to the validation sets of PIQA, ARC_E, and HellaSwag, we observe that the resulting performance gains carry over to all the benchmark tasks. This clearly demonstrates the robust generalization ability of our approach, indicating that optimizing on a limited set of core tasks can effectively capture and transfer essential reasoning capabilities across a broader range of problems. Using the optimal data mixture identified by our method, we further investigate the effect of scaling up. Specifically, we used the same data mixture to train on 400B tokens and then compared the resulting model against state-of-the-art baselines.

Pre-training from Scratch

Data Mixture Comparison — Figure 3: Training a 1B model from scratch on ClimbMix shows better scaling effects than training on other datasets.

Based on the insights obtained from our explorations, we apply CLIMB to two existing datasets: Nemotron-CC and smollm-corpus, with the goal of constructing a powerful new pre-training dataset. Specifically, we first combine Nemotron-CC and smollm-corpus, and then employ our proposed CLIMB-clustering method to semantically reorganize and filter this combined dataset into 20 distinct clusters, leading to a 1.2-trillion-token high-quality corpus, named ClimbLab. Subsequently, we utilize CLIMB-search to identify an optimal data mixture from these clusters. Using this optimal mixture, we further extract a 400-billion-token high-quality dataset named ClimbMix. We publicly release these two datasets: the filtered 1.2-trillion-token dataset organized into 20 semantic clusters as a research playground for further data-mixture studies, and the optimized 400-billion-token ClimbMix dataset for efficient pre-training.

We train a 1B model from scratch with ClimbMix and evaluate its performance relative to models pretrained on other datasets under the same token budget. The results in Figure 3 indicate that models trained on ClimbMix significantly outperform those trained on existing datasets.

ClimbLab Clusters

Cluster ID	# of Tokens (B)	Weight (%)	Topics
1	17.79	0.81	Mathematics, Algorithms, Programming, Software Development, Data Analysis
2	109.73	1.11	Books, Education, Writing, Literature, AI Ethics, History, Philosophy
3	80.62	1.26	Environmental Education, History, Architecture, Engineering, Classical Music
4	64.70	3.05	Education, Teaching, Science, Engineering, Psychology, Special Education
5	92.97	1.65	International Trade, Business, Economics, AI Consulting, Ethical Decision Making
6	70.95	20.46	Genetics, Biotechnology, AI, Robotics, Aging, Healthcare, Industrial Automation
7	64.04	16.08	Chemistry, Insects, Taxonomy, Agriculture, Gardening, Veterinary Science
8	24.68	0.91	Gaming, Role-Playing, Board Games, Video Games, Strategy, Fantasy, Virtual Reality
9	12.75	0.78	Astronomy, Cosmology, Astrophysics, Space Exploration, Urban Planning
10	135.45	6.60	Health, Sleep, Clinical Technology, Healthcare, Fitness, Addiction, Early Childhood Education
11	37.11	1.20	Software Development, Programming, Web Development, JavaScript, Databases
12	78.31	28.04	Technology, Mathematics, Legal Content, Human Rights, Energy Efficiency, Industrial Equipment
13	10.95	0.63	Sports, Cricket, Soccer, Tennis, Basketball, Cultural Heritage, Competition
14	15.64	0.21	Music, Instrumental Practice, Guitar, Jazz, Singing, Composition, Music Theory
15	35.24	0.21	Film, Cinema, Horror, Sci-Fi, Comics, Literature, Criticism, Philosophy
16	52.24	7.45	Sustainability, Climate Change, Renewable Energy, Environmental Conservation
17	82.23	6.35	Cardiovascular Health, Medical Research, Immunology, Cancer Prevention, Drug Therapy
18	54.02	1.79	Technology, Cybersecurity, Social Media, Privacy, Artificial Intelligence, Cloud Computing
19	50.32	0.91	Social Media, Digital Communication, Internet Culture, Misinformation, Psychology
20	79.47	0.49	Public Safety, Law Enforcement, Political History, Social Justice, Government
Total	1,170.30	100.0	-

BibTeX


        @article{diao2025climb,
          author    = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov},
          title={CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training}, 
          journal   = {arXiv preprint},
          year      = {2025},
          archivePrefix = {arXiv},
          primaryClass = {cs.CL},
          url={https://arxiv.org/abs/2504.13161}, 
        }

🧗🏻 CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training