Nimble Page Management for Tiered Memory Systems

Software-controlled heterogeneous memory systems have the potential to increase the performance and cost efficiency of computing systems. However they can only deliver on this promise if supported by efficient page management policies and mechanisms within the operating system (OS). Current OS implementations do not support efficient tiering of data between heterogeneous memories. Instead, they rely on expensive offlining of memory or swapping data to disk as a means of profiling and migrating hot or cold data between memory nodes.

Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence

Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor.

Exascale Deep Learning for Climate Analytics

We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision.

Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-based Runtimes

Many recent programming systems for both supercomputing and data center workloads generate task graphs to express computations that run on parallel and distributed machines. Due to the overhead associated with constructing these graphs the dependence analysis that generates them is often statically computed and memoized, and the resulting graph executed repeatedly at runtime. However, many applications require a dynamic dependence analysis due to data dependent behavior, but there are new challenges in capturing and re-executing task graphs at runtime.

Isometry: A Path-Based Distributed Data Transfer System

Data transfers in parallel systems have a significant impact on the performance of applications. Most existing systems generally support only data transfers between memories with a direct hardware connection and have limited facilities for handling transformations to the data’s layout in memory. As a result, to move data between memories that are not directly connected, higher levels of the software stack must explicitly divide a multi-hop transfer into a sequence of single-hop transfers and decide how and where to perform data layout conversions if needed.

BabelFlow: An Embedded Domain Specific Language for Parallel Analysis and Visualization

The rapid growth in simulation data requires large-scale parallel implementations of scientific analysis and visualization algorithms, both to produce results within an acceptable timeframe and to enable in situ deployment. However, efficient and scalable implementations, especially of more complex analysis approaches, require not only advanced algorithms, but also an in-depth knowledge of the underlying runtime. Furthermore, different machine configurations and different applications may favor different runtimes, i.e., MPI vs Charm++ vs Legion, etc., and different hardware architectures.

Scalable Collectives for Distributed Asynchronous Many-Task Runtimes

Global collectives (reductions/aggregations) are ubiquitous and feature in nearly every application of distributed high-performance computing (HPC). While it is advisable to devise algorithms by placing collectives off the critical path of execution, they are sometimes unavoidable for correctness, numerical convergence and analyses purposes.

Integrating External Resources with a Task-Based Programming Model

Accessing external resources (e.g., loading input data, checkpointing snapshots, and out-of-core processing) can have a significant impact on the performance of applications. However, no existing programming systems for high- performance computing directly manage and optimize external accesses. As a result, users must explicitly manage external accesses alongside their computation at the application level, which can result in both correctness and performance issues.

Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional parallel programming models that require the programmer to explicitly manage threads and the communication and synchronization between them, implicitly parallel programs have sequential execution semantics and naturally avoid the pitfalls of explicitly parallel code. However, without optimizations to distribute control overhead, scalability is often poor.

A Novel Shard-Based Approach for Asynchronous Many-Task Models for In Situ Analysis

We present the current status of our work towards a scalable, asynchronous many-task, in situ statistical analysis engine using the Legion runtime system, expanding upon earlier work, that was limited to a prototype implementation with a proxy mini-application as a surrogate for a full-scale scientific simulation code. In contrast, we have more recently integrated our in situ analysis engines with S3D, a full-size scientific application, and conducted numerical tests therewith on the largest computational platform currently available for DOE science applications.