Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors

Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power con- sumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators are recently being augmented with peer-to-peer communication capabilities that allow for autonomous traffic sourcing and sinking. While appropriate hardware support is becoming available, it seems that the right communication semantics are yet to be identified. Maintaining the semantics of existing communication models, such as the Message Passing Interface (MPI), seems problematic as they have been designed for the CPU’s execution model, which inherently differs from such specialized processors. In this paper, we analyze the compatibility of traditional message passing with massively parallel Single Instruction Mul- tiple Thread (SIMT) architectures, as represented by GPUs, and focus on the message matching problem. We begin with a fully MPI-compliant set of guarantees, including tag and source wildcards and message ordering. Based on an analysis of exascale proxy applications, we start relaxing these guarantees to adapt message passing to the GPU’s execution model. We present suitable algorithms for message matching on GPUs that can yield matching rates of 60M and 500M matches/s, depending on the constraints that are being relaxed. We discuss our experiments and create an understanding of the mismatch of current message passing protocols and the architecture and execution model of SIMT processors.

Authors

Benjamin Klenk (ZITI, Institute for Computer Engineering, Heidelberg University Mannheim, Germany)

Holger Fröning (ZITI, Institute for Computer Engineering, Heidelberg University Mannheim, Germany)

Hans Eberle

Larry Dennison

Publication Date

Thursday, June 1, 2017

Published in

32nd IEEE International Parallel and Distributed Processing

Research Area

High Performance Computing

Networking

Programming Languages, Systems and Tools

Uploaded Files

2017-IPDPS-camera-ready.pdf936.01 KB

Award

Best Paper Award

Copyright

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org.