The spatio-temporal relationship between the pixels of a video carries critical information for low-level fine-grained 4D perception. A single model that reasons about it should be able to solve several such tasks well and help many downstream applications. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand.
We present L4P (pronounced "lap"), a feedforward, general-purpose architecture that solves multiple low-level 4D perception tasks in a unified framework. Given an unposed monocular video, L4P solves dense tasks (tasks that produce pixel-wise estimations) like depth estimation, optical-flow estimation and motion-based segmentation and sparse tasks like 2D and 3D point tracking.
L4P combines a MAE-pretrained ViT-based video encoder with per-task lightweight heads. We also introduce a memory mechanism that allows us to perform long-range estimation needed for tracking tasks. Despite its general formulation and feedforward nature, our model matches or surpasses the performance of existing state-of-the-art specialized methods for each task we support. Moreover, it solves all those tasks at once in a time comparable to that of individual single-task methods.