High-speed Low-power On-chip Global Signaling Design Overview

On-chip global signaling in modern SoCs faces significant challenges due to wire pitch scaling and increasing die size. Conventional on-chip synchronous CMOS links have already hit a performance wall in power and latency. Although approaches based on custom low-swing equalized serial-link techniques can yield improvements, strict power/silicon budgets and non-ideal in-situ conditions of large SoCs make their design much more challenging than simply transitioning off-chip signaling technologies to onchip. Therefore, a holistic approach to the on-chip global signaling problem is required.

A 6.5-to-23.3fJ/b/mm Balanced Charge-Recycling Bus in 16nm FinFET CMOS at 1.7-to-2.6Gb/s/wire with Clock Forwarding and Low-Crosstalk Contraflow Wiring

Signaling over chip-scale global interconnect is consuming a larger fraction of total power in large processor chips, as processes continue to shrink. Solving this growing crisis requires simple, low-energy and area-efficient signaling for high-bandwidth data buses. This paper describes a balanced charge-recycling bus (BCRB) that achieves quadratic power savings, relative to signaling with full-swing CMOS repeaters. The scheme stacks two CMOS repeated wire links, one operating in the Vtop domain, between Vdd and Vmid=Vdd/2, the other, Vbot, between Vmid and GND.

Deterministic Consistent Density Estimation for Light Transport Simulation

Quasi-Monte Carlo methods often are more efficient than Monte Carlo methods, mainly, because deterministic low discrepancy sequences are more uniformly distributed than independent random numbers ever can be. So far, tensor product quasi-Monte Carlo techniques have been the only deterministic approach to consistent density estimation. By avoiding the repeated computation of identical information, which is intrinsic to the tensor product approach, a more efficient quasi-Monte Carlo method is derived.

Path space similarity determined by Fourier histogram descriptors

We propose a simple technique for the efficient estimation of the similarity of light transport paths. Considering descriptors of the incident radiance, we improve both filtering [Keller et al. 2014] and caching based [Ward et al. 1988] variance reduction techniques for image synthesis that so far could not measure variations of material and lighting as they only included geometric measures of similarity, such as the divergence of normals, irradiance gradients, and the distance between vertices storing information.

GI next: global illumination for production rendering on GPUs

The sheer size of texture data and the complexity of custom shaders in production rendering were the two major hurdles in the way of GPU acceleration. Requiring only tiny modifications of an existing production renderer, we are able to accelerate the computation of global illumination by more than an order of magnitude.

Path space filtering

Light transport simulation comprises of summing up the contributions of light transport paths that connect sensors and light sources. Such light transport paths may be sampled by following photon trajectories from the lights, tracing paths from the camera, and connecting such path segments by proximity (photon mapping) or shadow rays (both dashed in black). Smoothing the contribution of light transport paths before reconstructing the image can efficiently reduce the noise inherent to sampling.

Efficient stackless hierarchy traversal on GPUs with backtracking in constant time

The fastest acceleration schemes for ray tracing rely on traversing a bounding volume hierarchy (BVH) for efficient culling and use backtracking, which in the worst case may expose cost proportional to the depth of the hierarchy in either time or state memory. We show that the next node in such a traversal actually can be determined in constant time and state memory. In fact, our newly proposed parallel software implementation requires only a few modifications of existing traversal methods and outperforms the fastest stack-based algorithms on GPUs.

Stackless ray tracing of patches from feature-adaptive subdivision on GPUs

OpenSubdiv [Pixar 2012] is the de-facto industry standard for the representation of subdivision surfaces. Its feature-adaptive subdivision [Nießner 2013] allows for efficient display using rasterization hardware. Based on this feature-adaptive refinement of creases, semi-sharp edges, and irregular patches, we introduce an efficient algorithm for ray tracing the resulting patches up to almost floating point precision.