MemcachedGPU: Scaling-up Scale-out Key-value Stores

This paper tackles the challenges of obtaining more efficient data center computing while maintaining low latency, low cost, programmability, and the potential for workload consolidation. We introduce GNoM, a software framework enabling energy-efficient, latency bandwidth optimized UDP network and application processing on GPUs. GNoM handles the data movement and task management to facilitate the development of high-throughput UDP network services on GPUs.

Exploiting Asymmetry in Booth-Encoded Multipliers for Reduced Energy Multiplication

Booth Encoding is a common technique utilized in the design of high-speed multipliers. These multipliers typically encode just one operand of the multiplier, and this asymmetry results in different power characteristics as each input transitions to the next value in a pipelined design. Relative to the non-encoded input, changes on the Booth-encoded input induce more signal transitions requiring ~73% more multiplier array energy.

Priority-Based Cache Allocation in Throughput Processors

GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth.

Scaling the Power Wall: A Path to Exascale

Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency.

High-speed Low-power On-chip Global Signaling Design Overview

On-chip global signaling in modern SoCs faces significant challenges due to wire pitch scaling and increasing die size. Conventional on-chip synchronous CMOS links have already hit a performance wall in power and latency. Although approaches based on custom low-swing equalized serial-link techniques can yield improvements, strict power/silicon budgets and non-ideal in-situ conditions of large SoCs make their design much more challenging than simply transitioning off-chip signaling technologies to onchip. Therefore, a holistic approach to the on-chip global signaling problem is required.

A 6.5-to-23.3fJ/b/mm Balanced Charge-Recycling Bus in 16nm FinFET CMOS at 1.7-to-2.6Gb/s/wire with Clock Forwarding and Low-Crosstalk Contraflow Wiring

Signaling over chip-scale global interconnect is consuming a larger fraction of total power in large processor chips, as processes continue to shrink. Solving this growing crisis requires simple, low-energy and area-efficient signaling for high-bandwidth data buses. This paper describes a balanced charge-recycling bus (BCRB) that achieves quadratic power savings, relative to signaling with full-swing CMOS repeaters. The scheme stacks two CMOS repeated wire links, one operating in the Vtop domain, between Vdd and Vmid=Vdd/2, the other, Vbot, between Vmid and GND.

Deterministic Consistent Density Estimation for Light Transport Simulation

Quasi-Monte Carlo methods often are more efficient than Monte Carlo methods, mainly, because deterministic low discrepancy sequences are more uniformly distributed than independent random numbers ever can be. So far, tensor product quasi-Monte Carlo techniques have been the only deterministic approach to consistent density estimation. By avoiding the repeated computation of identical information, which is intrinsic to the tensor product approach, a more efficient quasi-Monte Carlo method is derived.