L4P: Towards Unified Low-Level 4D Vision Perception

(* indicates equal contribution)

L4P (pronounced "lap") is a general-purpose architecture that solves several low-level dense and sparse 4D perception tasks in a unified way.

Abstract

The spatio-temporal relationship between the pixels of a video carries critical information for low-level fine-grained 4D perception. A single model that reasons about it should be able to solve several such tasks well and help many downstream applications. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand.

We present L4P (pronounced "lap"), a feedforward, general-purpose architecture that solves multiple low-level 4D perception tasks in a unified framework. Given an unposed monocular video, L4P solves dense tasks (tasks that produce pixel-wise estimations) like depth estimation, optical-flow estimation, camera rays estimation, and motion-based segmentation and sparse tasks like 2D and 3D point tracking.

L4P combines a MAE-pretrained ViT-based video encoder with per-task lightweight heads. We also introduce a memory mechanism that allows us to perform long-range estimation needed for tracking tasks. Despite its general formulation and feedforward nature, our model matches or surpasses the performance of existing state-of-the-art specialized methods for each task we support. Moreover, it solves all those tasks at once in a time comparable to that of individual single-task methods.

abcd

Video

Results

Results for multiple 4D perception tasks for in-the-wild videos using a single unified model.



L4P estimates camera poses from an unposed monocular video, which can then be used to visualize depth, cameras, and 3D point tracks in a consistent reference frame as shown below.


The original videos shown here are from:
[1] Perazzi et al., A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[2] Mehl et al., Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Here are some related works that also explore use of video foundation models for low-level 4D perception:
[1] Carreira et al., Scaling 4D Representations, arXiv, 2024.
[2] Hu et al., DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

BibTeX

@article{badki2025l4p,
  author    = {Badki, Abhishek and Su, Hang and Wen, Bowen and Gallo, Orazio},
  title     = {{L4P}: Towards Unified {L}ow-Level {4D} Vision Perception},
  journal   = arxiv,
  year      = {2025},
}
Privacy PolicyManage My PrivacyDo Not Sell or Share My DataTerms of ServiceAccessibilityCorporate Policies Contact