William Dally

William Dally, Ph.D.
Chief Scientist and SVP of Research
William Dally's picture
Bill Dally joined NVIDIA in January 2009 as chief scientist, after spending 12 years at Stanford University, where he was chairman of the computer science department. Dally and his Stanford team developed the system architecture, network architecture, signaling, routing and synchronization technology that is found in most large parallel computers today. Dally was previously at the Massachusetts Institute of Technology from 1986 to 1997, where he and his team built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanism from programming models and demonstrated very low overhead synchronization and communication mechanisms. From 1983 to 1986, he was at California Institute of Technology (CalTech), where he designed the MOSSIM Simulation Engine and the Torus Routing chip, which pioneered “wormhole” routing and virtual-channel flow control. He is a member of the National Academy of Engineering, a Fellow of the American Academy of Arts & Sciences, a Fellow of the IEEE and the ACM, and has received the IEEE Seymour Cray Award and the ACM Maurice Wilkes award. He has published over 200 papers, holds over 50 issued patents, and is an author of two textbooks. Dally received a bachelor's degree in Electrical Engineering from Virginia Tech, a master’s in Electrical Engineering from Stanford University and a Ph.D. in Computer Science from CalTech.  He is a cofounder of Velio Communications and Stream Processors.
Research Interests:
Computer Architecture, Parallel Programming Systems, Interconnection Networks, High-Performance Circuit Design
A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications
A 0.54pJ/b 20Gb/s Ground-Referenced Single-Ended Short-Haul Serial Link in 28nm CMOS for Advanced Packaging Applications
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
A Compile-Time Managed Multi-Level Register File Hierarchy
GPUs and the Future of Parallel Computing
Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors
A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing
A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS
Imagine: Media Processing with Streams