Metering for Exposure Stacks

When creating a High-Dynamic-Range (HDR) image from a sequence of differently exposed Low-Dynamic-Range (LDR) images, the set of LDR images is usually generated by sampling the space of exposure times with a geometric progression and without explicitly accounting for the distribution of irradiance values of the scene. We argue that this choice can produce sub-optimal results both in terms of the number of acquired pictures and the quality of the resulting HDR image.

Improved Dual-Space Bounds for Simultaneous Motion and Defocus Blur

Our previous paper on stochastic rasterization [Laine et al. 2011] presented a method for constructing time and lens bounds to accelerate stochastic rasterization by skipping the costly 5D coverage test. Although the method works for the combined case of simultaneous motion and defocus blur, its efficiency drops when significant amounts of both effects are present. In this paper, we describe a bound computation method that treats time and lens domains in a unified fashion, and yields tight bounds also for the combined case.

GPUs and the Future of Parallel Computing

This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. NVIDIA Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

 

 

Optical Image Processing Using Light Modulation Displays

We propose to enhance the capabilities of the human visual system by performing optical image processing directly on an observed scene. Unlike previous work which additively superimposes imagery on a scene, or completely replaces scene imagery with a manipulated version, we perform all manipulation through the use of a light modulation display to spatially filter incoming light.

Efficient Triangle Coverage Tests for Stochastic Rasterization

In our previous paper on stochastic rasterization [Laine et al. 2011], we stated that a 5D triangle coverage test consumes approximately 25 FMA (fused multiply-add) operations. This technical report details the operation of our coverage test. We also provide variants specialized for defocus-only and motion-only cases.

NVIDIA Tesla: A Unified Graphics and Computing Architecture

To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.

A User-Programmable Vertex Engine.

In this paper we describe the design, programming interface, and implementation of a very efficient user-programmable vertex engine. The vertex engine of NVIDIA's GeForce3 GPU evolved from a highly tuned fixed-function pipeline requiring considerable knowledge to program. Programs operate only on a stream of independent vertices traversing the pipe. Embedded in the broader fixed function pipeline, our approach preserves parallelism sacrificed by previous approaches.

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

As the computational power of GPUs continues to scale with Moore's Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms.

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads.

High Performance and Scalable GPU Graph Traversal

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.