1. [Publications](/publications)
2. A 17–95.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm
 
 # A 17–95.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm

  ![](/sites/default/files/styles/wide/public/publications/Figure5.png?itok=fQAjdhIz)

 We present a deep neural network (DNN) accelerator designed for efficient execution of transformer-based DNNs, which have become ubiquitous for natural language processing tasks. DNN inference accelerators often employ specialized hardware techniques such as reduced precision to improve energy efficiency, but many of these techniques result in catastrophic accuracy loss on transformers. The proposed accelerator supports per-vector scaled quantization and approximate softmax to enable the use of 4-bit arithmetic with little accuracy loss. The 5nm prototype achieves 95.6 TOPS/W in benchmarking and 1711 inferences/s/W with only 0.7% accuracy loss on BERT, demonstrating a practical accelerator design for energy-efficient inference with transformers.



 ## Authors



[Ben Keller](/person/ben-keller)

[Rangharajan Venkatesan](/person/rangharajan-venkatesan)

[Steve Dai](/person/steve-dai)

[Stephen Tell](/person/stephen-tell)

[Brian Zimmer](/person/brian-zimmer)

[William Dally](/person/william-dally)

[Tom Gray](/person/tom-gray)

[Brucek Khailany](/person/brucek-khailany)

 

 

 ## Publication Date



Tuesday, June 14, 2022

 

 ## Published in



[2022 Symposium on VLSI Technology &amp; Circuits Digest of Technical Papers](https://www.vlsisymposium.org)

 

 ## Research Area



[Artificial Intelligence and Machine Learning ](/research-area/machine-learning-artificial-intelligence)

[Circuits and VLSI Design](/research-area/circuits)

 

 

 ## External Links



[\[IEEEXplore\] A 17–95.6 TOPS/W Deep Learning Inference Accelerator with Per-Vect…](https://ieeexplore.ieee.org/document/9830277)

 

 

 ## Uploaded Files



[C02-1.PDF](https://d1qx31qr3h6wln.cloudfront.net/publications/C02-1.PDF "Open file in new window")998.15 KB

 

 

 ## Copyright



This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to <pubs-permissions@ieee.org>.