Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Celine Lin, Jan Kautz, Pavlo Molchanov

February 2025

PDF Code Source Document

Abstract

The transformative capabilities of language models (LMs) have intensified the demand for their deployment on everyday devices, necessitating efficient processing for on-device language tasks. To address this, we propose Hymba, a new family of small language models featuring a hybrid-head architecture that strategically integrates attention mechanisms with state space models (SSMs). This architecture leverages the strengths of both systems: attention heads provide high-resolution recall, akin to snapshot memories in the human brain, while SSM heads offer efficient context summarization, similar to fading memories. To further enhance Hymba’s performance, we introduce learnable meta tokens that are prepended to input sequences and jointly trained with model weights during pretraining. These meta tokens act as a learned cache initialization during inference, modulating all subsequent tokens within the hybrid heads and boosting the model’s focus on salient information, similar to metamemory. Extensive experiments and ablation studies demonstrate that Hymba sets new state-of-the-art results for small LMs across various benchmarks and advances the accuracy-efficiency trade-offs of small LMs. For instance, Hymba-1.5B achieves comparable commonsense reasoning accuracy to LLaMA 3.2 3B while being 3.49x faster and offering a 14.72x reduction in cache size.

Type

Conference paper

Publication

ICLR 2025

Mamba Language Models

Hymba: A Hybrid-head Architecture for Small Language Models

Abstract

Shih-Yang Liu

Min-Hung Chen

Related