BabelFlow: An Embedded Domain Specific Language for Parallel Analysis and Visualization

The rapid growth in simulation data requires large-scale parallel implementations of scientific analysis and visualization algorithms, both to produce results within an acceptable timeframe and to enable in situ deployment. However, efficient and scalable implementations, especially of more complex analysis approaches, require not only advanced algorithms, but also an in-depth knowledge of the underlying runtime. Furthermore, different machine configurations and different applications may favor different runtimes, i.e., MPI vs Charm++ vs Legion, etc., and different hardware architectures.

Scalable Collectives for Distributed Asynchronous Many-Task Runtimes

Global collectives (reductions/aggregations) are ubiquitous and feature in nearly every application of distributed high-performance computing (HPC). While it is advisable to devise algorithms by placing collectives off the critical path of execution, they are sometimes unavoidable for correctness, numerical convergence and analyses purposes.

Integrating External Resources with a Task-Based Programming Model

Accessing external resources (e.g., loading input data, checkpointing snapshots, and out-of-core processing) can have a significant impact on the performance of applications. However, no existing programming systems for high- performance computing directly manage and optimize external accesses. As a result, users must explicitly manage external accesses alongside their computation at the application level, which can result in both correctness and performance issues.

Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional parallel programming models that require the programmer to explicitly manage threads and the communication and synchronization between them, implicitly parallel programs have sequential execution semantics and naturally avoid the pitfalls of explicitly parallel code. However, without optimizations to distribute control overhead, scalability is often poor.

A Novel Shard-Based Approach for Asynchronous Many-Task Models for In Situ Analysis

We present the current status of our work towards a scalable, asynchronous many-task, in situ statistical analysis engine using the Legion runtime system, expanding upon earlier work, that was limited to a prototype implementation with a proxy mini-application as a surrogate for a full-scale scientific simulation code. In contrast, we have more recently integrated our in situ analysis engines with S3D, a full-size scientific application, and conducted numerical tests therewith on the largest computational platform currently available for DOE science applications.

DAGguise: Mitigating Memory Timing Side Channels

This paper studies the mitigation of memory timing side channels, where attackers utilize contention within DRAM controllers to infer a victim’s secrets. Already practical, this class of channels poses an important challenge to secure computing in shared memory environments.

Existing state-of-the-art memory timing side channel mitigations have several key performance and security limitations. Prior schemes require onerous static bandwidth partitioning, extensive profiling phases, or simply fail to protect against attacks which exploit fine-grained timing and bank information.

GAMMA: Exploiting Gustavson’s Algorithm to Accelerate Sparse Matrix Multiplication

Sparse matrix-sparse matrix multiplication (spMspM) is at the heart of a wide range of scientific and machine learning applications. spMspM is inefficient on general-purpose architectures, making accelerators attractive. However, prior spMspM accelerators use inner- or outer-product dataflows that suffer poor input or output reuse, leading to high traffic and poor performance.

CaSA: End-to-end Quantitative Security Analysis of Randomly Mapped Caches

It is well known that there are micro-architectural vulnerabilities that enable an attacker to use caches to exfiltrate secrets from a victim. These vulnerabilities exploit the fact that the attacker can detect cache lines that were accessed by the victim. Therefore, architects have looked at different forms of randomization to thwart the attacker’s ability to communicate using the cache.

How to Evaluate Deep Neural Network Processors: TOPS/W (Alone) Considered Harmful

A significant amount of specialized hardware has been developed for processing deep neural networks (DNNs) in both academia and industry. This article aims to highlight the key concepts required to evaluate and compare these DNN processors. We discuss existing challenges, such as the flexibility and scalability needed to support a wide range of neural networks, as well as design considerations for both the DNN processors and the DNN models themselves.