Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs

GPUs offer orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, GPU device memory tends to be relatively small and the memory capacity can not be increased by the user. This paper describes Buddy Compression, a scheme to increase both the effective GPU memory capacity and bandwidth while avoiding the downsides of conventional memory-expanding strategies. Buddy Compression compresses GPU memory, splitting each compressed memory entry between high-speed device memory and a slower-but-larger disaggregated memory pool (or system memory).

Exascale Deep Learning for Climate Analytics

We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision.

Highly-scalable, Physics-informed GANs for Learning Solutions of Stochastic PDEs

Uncertainty quantification for forward and inverse problems is a central challenge across physical and biomedical disciplines. We address this challenge for the problem of modeling subsurface flow at the Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, and multiple correlation length scales of the Hanford Site require training a computationally intensive GAN model to thousands of dimensions.

Exascale Deep Learning for Scientific Inverse Problems

We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer.

Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance

We present Task Bench, a parameterized benchmark designed to explore the performance of parallel and distributed programming systems under a variety of application scenarios. Task Bench lowers the barrier to benchmarking multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.

Song Han

Song Han's research interest is efficient deep learning computing. He received his PhD degree from Stanford University advised by Prof. Bill Dally. He is also an Associate Professor at MIT (songhan.mit.edu). Song proposed the “Deep Compression” technique that is widely used for efficient AI, and “Efficient Inference Engine” that first brought weight sparsity to modern AI accelerator design.

LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches

Emerging non-volatile memory (NVM) technologies, such as spin-transfer torque RAM (STT-RAM), are attractive options for replacing or augmenting SRAM in implementing last-level caches (LLCs). However, the asymmetric read/write energy and latency associated with NVM introduces new challenges in designing caches where, in contrast to SRAM, dynamic energy from write operations can be responsible for a larger fraction of total cache energy than leakage.

The Bunker Cache for Spatio-Value Approximation

The cost of moving and storing data is still a fun- damental concern for computer architects. Inefficient handling of data can be attributed to conventional architectures being oblivious to the nature of the values that these data bits carry. We observe the phenomenon of spatio-value similarity, where data elements that are approximately similar in value exhibit spatial regularity in memory. This is inherent to 1) the data values of real-world applications, and 2) the way we store data structures in memory.