Qianli Ma

Qianli Ma is research scientist at NVIDIA Research. He received his PhD from ETH Zürich and Max-Planck-Institute for Intelligent Systems (Tübingen), advised by Professor Michael Black and Professor Siyu Tang. He has also interned at Meta Reality Labs in Pittsburgh. He has been developing new representations and methods for reconstructing, generating and modeling digital humans. His research interests span generative models, 3D computer vision and graphics, with a current focus on dynamic 3D content generation.

Global Context Vision Transformers

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.

FasterViT: Fast Vision Transformers with Hierarchical Attention

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention.