NOVA: A Functional Language for Data Parallelism
"NOVA: A Functional Language for Data Parallelism"
Alex Collins (University of Edinburgh), Dominik Grewe (University of Edinburgh), Vinod Grover (NVIDIA), Sean Lee (NVIDIA), Adriana Susnea (NVIDIA), July 2013
|Research Area:||Tools & Libraries|
|Author(s):||Alex Collins (University of Edinburgh), Dominik Grewe (University of Edinburgh), Vinod Grover (NVIDIA), Sean Lee (NVIDIA), Adriana Susnea (NVIDIA)|
Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit available parallelism in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional imperative languages. This makes them uniquely suited for generation of efficient target code for parallel systems, such as multiple Central Processing Units (CPUs) or highly data-parallel Graphics Processing Units (GPUs). Such systems are becoming the mainstream for scientific and ‘desktop’ computing.
Writing performance portable code for such systems using lowlevel languages requires significant effort from a human expert. This paper presents NOVA, a functional language and compiler for multi-core CPUs and GPUs. The NOVA language is a polymorphic, statically-typed functional language with a suite of higher-order functions which are used to express parallelism. These include map, reduce and scan. The NOVA compiler is a light-weight, yet powerful, optimizing compiler. It generates code for a variety of target platforms that achieve performance comparable to competing languages and tools, including hand-optimized code. The NOVA compiler is stand-alone and can be easily used as a target for higher-level or domain specific languages or embedded in other applications.
We evaluate NOVA against two competing approaches: the Thrust library and hand-written CUDA C. NOVA achieves comparable performance to these approaches across a range of benchmarks. NOVA-generated code also scales linearly with the number of processor cores across all compute-bound benchmarks.