1. [Publications](/publications)
2. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
 
 # Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

  ![](/sites/default/files/styles/wide/public/publications/RC18_photo_1.jpg?itok=153dgwTL)

 Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements.To evaluate the approach, we architected, implemented, fabricated,and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves4 TOPSpeak performance,and the 36-chiplet MCM package achieves up to128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the base-line layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.


 ## Authors


Sophia Shao (NVIDIA)

[Jason Clemons](/person/jason-clemons)

[Rangharajan Venkatesan](/person/rangharajan-venkatesan)

[Brian Zimmer](/person/brian-zimmer)

[Matt Fojtik](/person/matt-fojtik)

[Ted Jiang](/person/ted-jiang)

[Ben Keller](/person/ben-keller)

Alicia Klinefelter (NVIDIA)

[Nathaniel Pinckney](/person/nathaniel-pinckney)

Priyanka Raina (Stanford)

[Stephen Tell](/person/stephen-tell)

[Yanqing Zhang](/person/yanqing-zhang)

[William Dally](/person/william-dally)

[Joel Emer](/person/joel-emer)

[Tom Gray](/person/tom-gray)

[Brucek Khailany](/person/brucek-khailany)

[Steve Keckler](/person/stephen-keckler)

 
 ## Publication Date


Saturday, October 12, 2019

 
 ## Published in


[International Symposium on Microarchitecture (MICRO)](https://dl.acm.org/doi/10.1145/3352460.3358302)

 
 ## Research Area


[Artificial Intelligence and Machine Learning ](/research-area/machine-learning-artificial-intelligence)

[Circuits and VLSI Design](/research-area/circuits)

[Computer Architecture](/research-area/computer-architecture)

 
 ## External Links


[ACM Digital Library](https://dl.acm.org/doi/10.1145/3352460.3358302)

 
 ## Uploaded Files


[Published manuscript](https://research.nvidia.com/sites/default/files/pubs/2019-10_Simba%3A-Scaling-Deep-Learning//shao2019-micro.pdf "Open file in new window")10.83 MB

 
 ## Awards


Best Paper award

IEEE Micro Top Picks in Computer Architecture (Honorable Mention)

 
 ## Copyright


Copyright by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or <permissions@acm.org>. The definitive version of this paper can be found at ACM's Digital Library <http://www.acm.org/dl/>.