1. [Publications](/index.php/publications)
2. Task-Based Tensor Computations on Modern GPUs
 
 # Task-Based Tensor Computations on Modern GPUs

  ![Publication image](/sites/default/files/styles/wide/public/default_images/default.jpeg?itok=qUFsuJCP "Publication image")

 Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA’s Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called tasks that operate on tensors and are free of communication and synchronization. Cypress programs are bound to the target machine through a mapping specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.



 ## Authors



Rohan Yadav (Stanford University)

[Michael Garland](/index.php/person/michael-garland)

Alex Aiken (Stanford University)

[Michael Bauer](/index.php/person/mike-bauer)

 

 

 ## Publication Date



Monday, June 16, 2025

 

 ## Published in



[PLDI](https://pldi25.sigplan.org/)

 

 ## Research Area



[High Performance Computing](/index.php/research-area/high-performance-computing)

[Programming Languages, Systems and Tools](/index.php/research-area/programming-languages-systems)

 

 

 ## Uploaded Files



[Cypress\_PLDI\_25.pdf](https://d1qx31qr3h6wln.cloudfront.net/publications/Cypress_PLDI_25.pdf "Open file in new window")813.68 KB