In this paper we describe the design, programming interface, and implementation of a very efficient user-programmable vertex engine. The vertex engine of NVIDIA's GeForce3 GPU evolved from a highly tuned fixed-function pipeline requiring considerable knowledge to program. Programs operate only on a stream of independent vertices traversing the pipe. Embedded in the broader fixed function pipeline, our approach preserves parallelism sacrificed by previous approaches. The programmer is presented with a straightforward programming model, which is supported by transparent multi-threading and bypassing to preserve parallelism and performance. In the remainder of the paper we discuss the motivation behind our design and contrast it with previous work. We present the programming model, the instruction set selection process, and details of the hardware implementation. Finally, we discuss important API design issues encountered when creating an interface to such a device. We close with thoughts about the future of programmable graphics devices.