Yue’s research is focused on building video-centric foundation models. It is based on three pillars: modeling, data, and system design. For modeling, he proposes to learn video representation from free-form narratives, inspired by the recent success of large language models (LLM) enabling us to view all kinds of videos through the lens of narratives. For data, he is distilling image-based vision language models on videos such that narrating videos becomes as fast as annotating images. This scales up the video data size to match or even exceed the image counterpart. For system design, he examines the training pipeline of a modern video Transformer architecture and mitigates the video loading a
Yue is a fourth-year PhD student at the University of Texas at Austin, supervised by Prof. Philipp Krähenbühl. He obtained his MPhil’s degree from Multimedia Laboratory at the Chinese University of Hong Kong, supervised by Prof. Dahua Lin. More previously, he obtained his Bachelor's degrees from Tsinghua University. His research interests are in computer vision and machine learning. Particularly he has been focusing on developing computer vision models for video understanding and generation and video. He is also interested in building efficient systems for video compression and analysis.