Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

Anastasis Kratsios, Haitz Sáez de Ocáriz Borde, Takashi Furuya, Marc T. Law

January 2025

PDF

Abstract

Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single ``expert'' deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions.

Type

Conference paper

Publication

TMLR 2025

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

Abstract

Related