Global collectives (reductions/aggregations) are ubiquitous and feature in nearly every application of distributed high-performance computing (HPC). While it is advisable to devise algorithms by placing collectives off the critical path of execution, they are sometimes unavoidable for correctness, numerical convergence and analyses purposes. Scalable algorithms for distributed collectives are well studied and have become an integral part of MPI, but new and emerging distributed computing frameworks and paradigms such as Asynchronous Many-Task (AMT) models lack the same sophistication for distributed collectives. Since the central promise of AMT runtimes is that they automatically discover, and expose, task dependencies in the underlying program and can schedule work optimally to minimize idle time and hide data movement, a naively designed collectives protocol can completely offset any gains made from asynchronous execution. In this study we demonstrate that scalable distributed collectives are indispensable for performance in AMT models. We design, implement and test the performance of a scalable collective algorithm in Legion, an exemplar data-centric AMT programming model. Our results show that AMT systems contain the necessary primitives that allow for fully scalable collectives without breaking the transparent data movement abstractions. Scalability tests of an integrated Legion 1D stencil mini-application show the clear benefit of implementing scalable collectives and the performance degradation when a naive collectives alternative is used instead.
This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to firstname.lastname@example.org.