L4P: Low-Level 4D Vision Perception Unified

Abstract

The spatio-temporal relationship between the pixels of a video carries critical information for low-level fine-grained 4D perception. A single model that reasons about it should be able to solve several such tasks well and help many downstream applications. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand.

We present L4P (pronounced "lap"), a feedforward, general-purpose architecture that solves multiple low-level 4D perception tasks in a unified framework. Given an unposed monocular video, L4P solves dense tasks (tasks that produce pixel-wise estimations) like depth estimation, optical-flow estimation and motion-based segmentation and sparse tasks like 2D and 3D point tracking.

L4P combines a MAE-pretrained ViT-based video encoder with per-task lightweight heads. We also introduce a memory mechanism that allows us to perform long-range estimation needed for tracking tasks. Despite its general formulation and feedforward nature, our model matches or surpasses the performance of existing state-of-the-art specialized methods for each task we support. Moreover, it solves all those tasks at once in a time comparable to that of individual single-task methods.

Video

Results

Results for multiple 4D perception tasks for in-the-wild videos using a single unified model.

The original videos shown here are from:
[1] Perazzi et al., A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[2] Mehl et al., Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

BibTeX

@article{badki2025l4p,
  author    = {Badki, Abhishek and Su, Hang and Wen, Bowen and Gallo, Orazio},
  title     = {{L4P}: {L}ow-Level {4D} Vision Perception Unified},
  journal   = arxiv,
  year      = {2025},
}

L4P: Low-Level 4D Vision Perception Unified

L4P (pronounced "lap") is a general-purpose architecture that solves several low-level dense and sparse 4D perception tasks in a unified way.

Abstract

Video

Results

BibTeX