Sparse Matrix-Vector Multiplication on Multicore and Accelerators

This chapter summarizes recent work on the development of highperformance multicore and accelerator-based implementations of sparse matrix-vector multiplication (SpMV). As an object of study, SpMV is an interesting computation for at least two reasons. First, it appears widely in applications in scienti c and engineering computing, nancial and economic modeling, and information retrieval, among others, and is therefore of great practical interest.

Allocation-oriented Algorithm Design with Application to GPU Computing, Ph.D. Dissertation

The wide data-parallelism of GPU processor design facilitates the execution of many concurrent, fine-grained tasks with unprecedented performance and efficiency. However, contemporary opinion regards GPU architecture as being poorly suited for many parallelizations that impose dynamic dependences between processing elements. This dissertation specifically attends to problems whose efficient solutions require cooperative allocation, i.e., their dependences stem from dynamic data placement within shared structures.

Scalable GPU Graph Traversal

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.

Gaussian Process Regression Flow for Analysis of Motion Trajectories

Recognition of motions and activities of objects in videos requires effective representations for analysis and matching of motion trajectories. In this paper, we introduce a new representation specifically aimed at matching motion trajectories. We model a trajectory as a continuous dense flow field from a sparse set of vector sequences using Gaussian Process Regression. Furthermore, we introduce a random sampling strategy for learning stable classes of motions from limited data.

Metering for Exposure Stacks

When creating a High-Dynamic-Range (HDR) image from a sequence of differently exposed Low-Dynamic-Range (LDR) images, the set of LDR images is usually generated by sampling the space of exposure times with a geometric progression and without explicitly accounting for the distribution of irradiance values of the scene. We argue that this choice can produce sub-optimal results both in terms of the number of acquired pictures and the quality of the resulting HDR image.

Improved Dual-Space Bounds for Simultaneous Motion and Defocus Blur

Our previous paper on stochastic rasterization [Laine et al. 2011] presented a method for constructing time and lens bounds to accelerate stochastic rasterization by skipping the costly 5D coverage test. Although the method works for the combined case of simultaneous motion and defocus blur, its efficiency drops when significant amounts of both effects are present. In this paper, we describe a bound computation method that treats time and lens domains in a unified fashion, and yields tight bounds also for the combined case.

GPUs and the Future of Parallel Computing

This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. NVIDIA Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

 

 

Optical Image Processing Using Light Modulation Displays

We propose to enhance the capabilities of the human visual system by performing optical image processing directly on an observed scene. Unlike previous work which additively superimposes imagery on a scene, or completely replaces scene imagery with a manipulated version, we perform all manipulation through the use of a light modulation display to spatially filter incoming light.

Efficient Triangle Coverage Tests for Stochastic Rasterization

In our previous paper on stochastic rasterization [Laine et al. 2011], we stated that a 5D triangle coverage test consumes approximately 25 FMA (fused multiply-add) operations. This technical report details the operation of our coverage test. We also provide variants specialized for defocus-only and motion-only cases.