David Nellans  

 
  ![](/sites/default/files/person/IMG_2133_Y5eqTgG.JPG)

  
 Dave Nellans joined NVIDIA in 2013 and leads the Architecture Research Group.


   Research Area(s)

[Artificial Intelligence and Machine Learning ](/index.php/research-area/machine-learning-artificial-intelligence)

[Computer Architecture](/index.php/research-area/computer-architecture)

[High Performance Computing](/index.php/research-area/high-performance-computing)

[Hyperscale Graphics](/index.php/research-area/hyperscale-graphics)

[Programming Languages, Systems and Tools](/index.php/research-area/programming-languages-systems)

[Storage and Systems](/index.php/research-area/storage-and-systems)

 
 Main Field of Interest

[Computer Architecture](/index.php/research-area/computer-architecture)

 
 Google Scholar

[https://scholar.google.com/citations?user=mjvx1GIAAAAJ&amp;hl=en](https://scholar.google.com/citations?user=mjvx1GIAAAAJ&hl=en)

 
 ### Publications

 
### 2023 

[Parsimony: Enabling SIMD/Vector Programming in Standard Compiler Flows](/publication/2023-02_parsimony-enabling-simdvector-programming-standard-compiler-flows)

Vijay Kandiah, [Daniel Lustig](/person/daniel-lustig), Oreste Villa, [David Nellans](/person/david-nellans), Nikos Hardavellas


[International Symposium on Code Generation and Optimization](https://dl.acm.org/doi/10.1145/3579990.3580019)


### 2022 

[The Implications of Page Size Management on Graph Analytics](/publication/2022-11_implications-page-size-management-graph-analytics)

Aninda Manocha, [Zi Yan](/person/zi-yan), Esin Tureci, Juan Luis Aragón, [David Nellans](/person/david-nellans), Margaret Martonosi


[International Symposium on Workload Characterization (IISWC)](https://ieeexplore.ieee.org/document/9975438)


### 2021 

[GPU Domain Specialization via Composable On-Package Architecture](/index.php/publication/2021-12_gpu-domain-specialization-composable-package-architecture)

[Yaosheng Fu](/index.php/person/yaosheng-fu), Evgeny Bolotin, [Niladrish Chatterjee](/index.php/person/niladrish-chatterjee), [David Nellans](/index.php/person/david-nellans), [Steve Keckler](/index.php/person/stephen-keckler)


[ACM Transactions on Architecture and Code Optimization (TACO)](https://dl.acm.org/doi/full/10.1145/3484505)


[GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management](/publication/2021-10_gps-global-publish-subscribe-model-multi-gpu-memory-management)

[Harini Muthukrishnan](/person/harini-muthukrishnan), [Daniel Lustig](/person/daniel-lustig), [David Nellans](/person/david-nellans), Thomas Wenisch


[International Symposium on Microarchitecture (MICRO)](https://dl.acm.org/doi/10.1145/3466752.3480088)


Best Paper nominee, IEEE Micro Top Picks in Computer Architecture (Honorable Mention)


[Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers](/publication/2021-06_efficient-multi-gpu-shared-memory-automatic-optimization-fine-grained-transfers)

[Harini Muthukrishnan](/person/harini-muthukrishnan), [David Nellans](/person/david-nellans), [Daniel Lustig](/person/daniel-lustig), Jeffrey Fessler, Thomas Wenisch


[International Symposium on Computer Architecture (ISCA)](https://ieeexplore.ieee.org/document/9499752)


[GPU Domain Specialization via Composable On-Package Architecture](/publication/2021-04_gpu-domain-specialization-composable-package-architecture)

[Yaosheng Fu](/person/yaosheng-fu), Evgeny Bolotin, [Niladrish Chatterjee](/person/niladrish-chatterjee), [David Nellans](/person/david-nellans), [Steve Keckler](/person/stephen-keckler)


[arXiv](https://arxiv.org/abs/2104.02188)


[Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator.](/publication/2021-02_need-speed-experiences-building-trustworthy-system-level-gpu-simulator)

Oreste Villa, [Daniel Lustig](/person/daniel-lustig), [Zi Yan](/person/zi-yan), Evgeny Bolotin, [Yaosheng Fu](/person/yaosheng-fu), [Niladrish Chatterjee](/person/niladrish-chatterjee), [Ted Jiang](/person/ted-jiang), [David Nellans](/person/david-nellans)


[International Symposium on High Performance Computer Architecture (HPCA)](https://doi.org/10.1109/HPCA51647.2021.00077)


### 2020 

[The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems](/publication/2020-12_architectural-implications-distributed-reinforcement-learning-cpu-gpu-systems)

Ahmet Inci, Evgeny Bolotin, [Yaosheng Fu](/person/yaosheng-fu), [Gal Dalal](/person/gal-dalal), [Shie Mannor](/person/shie-mannor), [David Nellans](/person/david-nellans), Diana Marculescu


[Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2)](https://www.emc2-ai.org/virtual-20)


[Locality-Centric Data and Threadblock Management for Massive GPUs](/publication/2020-10_locality-centric-data-and-threadblock-management-massive-gpus)

Mahmoud Khairy, Vadim Nikiforov, [David Nellans](/person/david-nellans), Timothy G. Rogers


[International Symposium on Microarchitecture (MICRO)](https://ieeexplore.ieee.org/document/9251964)


[Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs](/publication/2020-06_buddy-compression-enabling-larger-memory-deep-learning-and-hpc-workloads-gpus)

Esha Chouske, [Michael B. Sullivan](/person/mike-sullivan), [Mike O'Connor](/person/mike-o-connor), Mattan Erez, Jeff Pool, [David Nellans](/person/david-nellans), [Steve Keckler](/person/stephen-keckler)


[International Symposium on Computer Architecture (ISCA)](https://ieeexplore.ieee.org/document/9138915)


[HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems](/index.php/publication/2020-02_hmg-extending-cache-coherence-protocols-across-modern-hierarchical-multi-gpu)

Xiaowei Ren, [Daniel Lustig](/index.php/person/daniel-lustig), Evgeny Bolotin, [Aamer Jaleel](/index.php/person/aamer-jaleel), Oreste Villa, [David Nellans](/index.php/person/david-nellans)


[International Symposium on High Performance Computer Architecture (HPCA)](https://ieeexplore.ieee.org/document/9065597)


### 2019 

[NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs](/publication/2019-10_nvbit-dynamic-binary-instrumentation-framework-nvidia-gpus)

Oreste Villa, [Mark Stephenson](/person/mark-stephenson), [David Nellans](/person/david-nellans), [Steve Keckler](/person/stephen-keckler)


[International Symposium on Microarchitecture (MICRO)](https://doi.org/10.1145/3352460.3358307)


[Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training](/publication/2019-08_optimizing-multi-gpu-parallelization-strategies-deep-learning-training)

Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, [Yaosheng Fu](/person/yaosheng-fu), Victor Zhang, Szymon Migacz, [David Nellans](/person/david-nellans), Puneet Gupta


[IEEE MICRO: Special Edition on Machine Learning Acceleration](https://ieeexplore.ieee.org/document/8805338)


[Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training](/publication/2019-07_optimizing-multi-gpu-parallelization-strategies-deep-learning-training)

Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, [Yaosheng Fu](/person/yaosheng-fu), Victor Zhang, Szymon Migacz, [David Nellans](/person/david-nellans), Puneet Gupta 


[arXiv](https://arxiv.org/abs/1907.13257)


[Translation Ranger: Operating System Support for Contiguity-Aware TLBs](/index.php/publication/2019-06_translation-ranger-operating-system-support-contiguity-aware-tlbs)

[Zi Yan](/index.php/person/zi-yan), [Daniel Lustig](/index.php/person/daniel-lustig), [David Nellans](/index.php/person/david-nellans), Abhishek Bhattacharjee


[International Symposium on Computer Architecture (ISCA)](https://dl.acm.org/doi/10.1145/3307650.3322223)


[Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs](/publication/2019-04_buddy-compression-enabling-larger-memory-deep-learning-and-hpc-workloads-gpus)

Esha Choukse, [Michael B. Sullivan](/person/mike-sullivan), [Mike O'Connor](/person/mike-o-connor), Mattan Erez, Jeff Pool, [David Nellans](/person/david-nellans), Stephen W. Keckler


[arXiv](https://arxiv.org/abs/1903.02596)


[Nimble Page Management for Tiered Memory Systems](/index.php/publication/2019-04_nimble-page-management-tiered-memory-systems)

[Zi Yan](/index.php/person/zi-yan), [Daniel Lustig](/index.php/person/daniel-lustig), [David Nellans](/index.php/person/david-nellans), Abhishek Bhattacharjee


[International Conference on Architectural Support for Programming Languages and…](https://dl.acm.org/doi/10.1145/3297858.3304024)


[Understanding the Future of Energy Efficiency in Multi-Module GPUs.](/publication/2019-02_understanding-future-energy-efficiency-multi-module-gpus)

Akhil Arunkumar, Evgeny Bolotin, [David Nellans](/person/david-nellans), Carole-Jean Wu


[International Symposium on High Performance Computer Architecture (HPCA)](https://ieeexplore.ieee.org/document/8675192)


### 2018 

[Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems](/index.php/publication/2018-10_combining-hwsw-mechanisms-improve-numa-performance-multi-gpu-systems)

Vinson Young, [Aamer Jaleel](/index.php/person/aamer-jaleel), Evgeny Bolotin, Eiman Ebrahimi, [David Nellans](/index.php/person/david-nellans), Oreste Villa


[International Symposium on Microarchitecture (MICRO)](https://dl.acm.org/doi/10.1109/MICRO.2018.00035)


### 2017 

[Beyond the Socket: NUMA-Aware GPUs](/index.php/publication/2017-10_beyond-socket-numa-aware-gpus)

Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, [Aamer Jaleel](/index.php/person/aamer-jaleel), Alex Ramirez, [David Nellans](/index.php/person/david-nellans)


[International Symposium on Microarchitecture (MICRO)](https://dl.acm.org/citation.cfm?id=3124534)


[MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability](/index.php/publication/2017-06_mcm-gpu-multi-chip-module-gpus-continued-performance-scalability)

Akhil Arunkumar , Evgeny Bolotin, Benjamin Cho, Ugljesa Milic , Eiman Ebrahimi, Oreste Villa, [Aamer Jaleel](/index.php/person/aamer-jaleel), Carole-Jean Wu , [David Nellans](/index.php/person/david-nellans)


[International Symposium on Computer Architecture (ISCA)](https://doi.org/10.1145/3079856.3080231)


### 2016 

[Towards High Performance Paged Memory for GPUs](/publication/2016-03_towards-high-performance-paged-memory-gpus)

Tianhao Zheng, [David Nellans](/person/david-nellans), Arslan Zulfiqar, [Mark Stephenson](/person/mark-stephenson), [Steve Keckler](/person/stephen-keckler)


[International Symposium on High Performance Computer Architecture (HPCA)](https://ieeexplore.ieee.org/document/7446077)


[Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence](/publication/2016-03_selective-gpu-caches-eliminate-cpu-gpu-hw-cache-coherence)

Neha Agarwal, [David Nellans](/person/david-nellans), Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, [Steve Keckler](/person/stephen-keckler)


[ International Symposium on High Performance Computer Architecture (HPCA)](https://ieeexplore.ieee.org/document/7446089)


### 2015 

[Designing Efficient Heterogeneous Memory Architectures](/index.php/publication/2015-08_designing-efficient-heterogeneous-memory-architectures)

Evgeny Bolotin, [David Nellans](/index.php/person/david-nellans), Oreste Villa, [Mike O'Connor](/index.php/person/mike-o-connor), Alex Ramirez, [Steve Keckler](/index.php/person/stephen-keckler), [Mike O'Connor](/index.php/person/mike-o-connor)


[IEEE Micro](https://ieeexplore.ieee.org/document/7155441)


[Flexible Software Profiling of GPU Architectures](/publication/2015-06_flexible-software-profiling-gpu-architectures)

[Mark Stephenson](/person/mark-stephenson), [Siva Hari](/person/siva-hari), Yunsup Lee, Eiman Ebrahimi, Daniel Johnson, [David Nellans](/person/david-nellans), [Mike O'Connor](/person/mike-o-connor), [Steve Keckler](/person/stephen-keckler)


[International Symposium on Computer Architecture (ISCA)](https://dl.acm.org/doi/10.1145/2749469.2750375)


[Page Placement Strategies for GPUs within Heterogeneous Memory Systems](/publication/2015-03_page-placement-strategies-gpus-within-heterogeneous-memory-systems)

Neha Agarwal, [David Nellans](/person/david-nellans), [Mark Stephenson](/person/mark-stephenson), [Mike O'Connor](/person/mike-o-connor), [Steve Keckler](/person/stephen-keckler)


[International Conference on Architectural Support for Programming Languages and…](http://dl.acm.org/citation.cfm?id=2694381)


[Unlocking Bandwidth for GPUs in CC-NUMA systems](/publication/2015-02_unlocking-bandwidth-gpus-cc-numa-systems)

Neha Agarwal, [David Nellans](/person/david-nellans), [Mike O'Connor](/person/mike-o-connor), [Steve Keckler](/person/stephen-keckler), Thomas Wenisch


[International Symposium on High Performance Computer Architecture (HPCA)](http://ieeexplore.ieee.org/document/7056046/)


### 2014 

[Scaling the Power Wall: A Path to Exascale](/publication/2014-11_scaling-power-wall-path-exascale)

Oreste Villa, Daniel Johnson, [Mike O'Connor](/person/mike-o-connor), Evgeny Bolotin, [David Nellans](/person/david-nellans), Justin Luitjens, Nikolai Sakharnykh, Peng Wang, Paulius Micikevicius, Anthony Scudiero, [Steve Keckler](/person/stephen-keckler), [William Dally](/person/william-dally)


[SC '14](http://ieeexplore.ieee.org/abstract/document/7013055/)