Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | Paper Notes
8/22/2025
https://arxiv.org/abs/1701.06538
Conditional computation is a technique that lets neural networks selectively activate only parts of the model for each input, rather than using the entire network every time. The challenge is learning how to intelligently route inputs to the right parts of the network They use a trainable gating mechanism.
A "sparsely-gated mixture-of-experts layer" has a number of experts (each a FFN) and a trainable gating network which selects a parse combination of the experts to process each input.
Let be the output of the gating network and the output of the -th expert.
We save computation based on the sparsity of the output of . Wherever , we need not compute .
They also implement hierarchical MoE to solve scalability problem with the gating mechanism. When you have lots of experts, the gating network needs to compute a probability score for every single expert for every input.
8/22/2025