Priority-Based Cache Allocation in Throughput Processors

GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth.

Scaling the Power Wall: A Path to Exascale

Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency.

High-speed Low-power On-chip Global Signaling Design Overview

On-chip global signaling in modern SoCs faces significant challenges due to wire pitch scaling and increasing die size. Conventional on-chip synchronous CMOS links have already hit a performance wall in power and latency. Although approaches based on custom low-swing equalized serial-link techniques can yield improvements, strict power/silicon budgets and non-ideal in-situ conditions of large SoCs make their design much more challenging than simply transitioning off-chip signaling technologies to onchip. Therefore, a holistic approach to the on-chip global signaling problem is required.

A 6.5-to-23.3fJ/b/mm Balanced Charge-Recycling Bus in 16nm FinFET CMOS at 1.7-to-2.6Gb/s/wire with Clock Forwarding and Low-Crosstalk Contraflow Wiring

Signaling over chip-scale global interconnect is consuming a larger fraction of total power in large processor chips, as processes continue to shrink. Solving this growing crisis requires simple, low-energy and area-efficient signaling for high-bandwidth data buses. This paper describes a balanced charge-recycling bus (BCRB) that achieves quadratic power savings, relative to signaling with full-swing CMOS repeaters. The scheme stacks two CMOS repeated wire links, one operating in the Vtop domain, between Vdd and Vmid=Vdd/2, the other, Vbot, between Vmid and GND.

Deterministic Consistent Density Estimation for Light Transport Simulation

Quasi-Monte Carlo methods often are more efficient than Monte Carlo methods, mainly, because deterministic low discrepancy sequences are more uniformly distributed than independent random numbers ever can be. So far, tensor product quasi-Monte Carlo techniques have been the only deterministic approach to consistent density estimation. By avoiding the repeated computation of identical information, which is intrinsic to the tensor product approach, a more efficient quasi-Monte Carlo method is derived.

Path space similarity determined by Fourier histogram descriptors

We propose a simple technique for the efficient estimation of the similarity of light transport paths. Considering descriptors of the incident radiance, we improve both filtering [Keller et al. 2014] and caching based [Ward et al. 1988] variance reduction techniques for image synthesis that so far could not measure variations of material and lighting as they only included geometric measures of similarity, such as the divergence of normals, irradiance gradients, and the distance between vertices storing information.

GI next: global illumination for production rendering on GPUs

The sheer size of texture data and the complexity of custom shaders in production rendering were the two major hurdles in the way of GPU acceleration. Requiring only tiny modifications of an existing production renderer, we are able to accelerate the computation of global illumination by more than an order of magnitude.


Subscribe to Research RSS