1. [Publications](/publications)
2. VS-QUANT: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference
 
 # VS-QUANT: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

  ![](/sites/default/files/styles/wide/public/publications/VSQuant_Fig_0.JPG?itok=hOh5iaL8)

 Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of (≈16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 37% area saving and 24% energy saving while maintaining over 75% accuracy for ResNet50 on ImageNet. 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26% compared to an 8-bit baseline.



 ## Authors



[Steve Dai](/person/steve-dai)

[Rangharajan Venkatesan](/person/rangharajan-venkatesan)

Mark Haoxing Ren (NVIDIA)

[Brian Zimmer](/person/brian-zimmer)

[William Dally](/person/william-dally)

[Brucek Khailany](/person/brucek-khailany)

 

 

 ## Publication Date



Monday, April 5, 2021

 

 ## Published in



[MLSys 2021](https://mlsys.org/)

 

 ## Research Area



[Artificial Intelligence and Machine Learning ](/research-area/machine-learning-artificial-intelligence)

[Circuits and VLSI Design](/research-area/circuits)

[Computer Architecture](/research-area/computer-architecture)

 

 

 ## External Links



[VS-QUANT: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Netw…](https://arxiv.org/abs/2102.04503)