Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements.To evaluate the approach, we architected, implemented, fabricated,and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves4 TOPSpeak performance,and the 36-chiplet MCM package achieves up to128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the base-line layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.

Authors

Sophia Shao (NVIDIA)

Jason Clemons

Rangharajan Venkatesan

Alicia Klinefelter (NVIDIA)

Nathaniel Pinckney

Priyanka Raina (Stanford)

Publication Date

Saturday, October 12, 2019

Published in

International Symposium on Microarchitecture (MICRO)

Research Area

Artificial Intelligence and Machine Learning

Circuits and VLSI Design

Computer Architecture

External Links

ACM Digital Library

Uploaded Files

Published manuscript10.83 MB

Awards

Best Paper award

IEEE Micro Top Picks in Computer Architecture (Honorable Mention)

Copyright

Copyright by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. The definitive version of this paper can be found at ACM's Digital Library http://www.acm.org/dl/.