1. [Publications](/publications)
2. Fearless Concurrency on the GPU
 
 # Fearless Concurrency on the GPU

  ![](/sites/default/files/styles/wide/public/publications/logo_pub_banner.png?itok=iTs2pOBb)

 Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees. We present cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring in Rust. cuTile Rust extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into disjoint pieces, kernel launches preserve the host-side ownership contract, and programmers can opt out locally when they need lower-level control. The system also provides a composable host execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay.

Our evaluation shows that these abstractions can preserve performance on high-end GPUs. On the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python within measurement noise. Grout, a cuTile-Rust-based inference engine, exercises cuTile Rust across an end-to-end Qwen3 inference path. In batch-1 decode, Grout reaches 171 generated tokens/s for Qwen3-4B on the NVIDIA GeForce RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang and consistent with an HBM roofline sanity check.



 ## Authors



[Melih Elibol](/person/melih-elibol)

Jared Roesch (NVIDIA)

[Isaac Gelado](/person/isaac-gelado)

Eric Buehler (Hugging Face)

[Michael Garland](/person/michael-garland)

 

 

 ## Publication Date



Tuesday, June 16, 2026

 

 ## Published in



[arXiv:2606.15991](https://arxiv.org/abs/2606.15991)

 

 ## Research Area



[Artificial Intelligence and Machine Learning ](/research-area/machine-learning-artificial-intelligence)

[High Performance Computing](/research-area/high-performance-computing)

[Programming Languages, Systems and Tools](/research-area/programming-languages-systems)

 

 

 ## External Links



[Source code (GitHub)](https://github.com/nvlabs/cutile-rs)

 

 

 ## Uploaded Files



[fearless\_concurrency\_on\_the\_gpu.pdf](https://d1qx31qr3h6wln.cloudfront.net/publications/fearless_concurrency_on_the_gpu.pdf?VersionId=nzsksnj1Wwkajyh1GbJUF9vkZhLeBmki "Open file in new window")721.59 KB