We present the first human mesh recovery algorithm to fully depart from the orthographic camera model and recover a fully perspective projection model without applying heuristics. On several benchmark datasets captured at diverse ranges including close-range images, we outperform all existing SOTA methods at estimating subject depth, focal parameters, 3D pose, and 2D alignment.
(Drag divider to compare input images with human mesh recovery from BLADE)
Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera. Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation \(T_z\), and we show that \(T_z\) can be reliably estimated from the image. We then discuss the important role of \(T_z\) for accurate human mesh recovery estimated from close-range images. Finally, we show that, once \(T_z\) and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide range of images.
Our method BLADE not only estimates 3D shape and pose precisely but also accurately retrieves perspective projection parameters, enabling the predicted 3D human mesh to align seamlessly with the input image.
Starting with a bounding box image crop \(I_{crop}\) of the person, the Pelvis Depth Estimator \(F^{T_z}\) (green box) estimates the Z-translation of the person’s pelvis, \(T_z\). Then, the Pose Estimator \(F^{pose}\) (blue box) estimates SMPL-X shape and pose \( (\beta, \theta) \) from the full input image while considering the image distortion induced by \(T_z\). Finally, through differentiable rasterization, the Camera Solver (brown box) recovers the optimal focal length and 3D translations that best aligns the rasterized SMPL-X mesh with the segmented mask of the person. We are thus able to solve for the full perspective projection model without heuristic assumptions.
While perspective distortion is more severe for the depth range smaller than 1.2m, existing datasets for HMR do not contain enough data for this range. Therefore, we create a new large synthetic dataset we name BEDLAM-CC (“close camera”) utilizing assets provided with the BEDLAM dataset. Our new dataset contains 2 million synthetically rendered images enhancing current data for depth estimation. The strong variation in lighting and camera angles as well as severe close-up distortion are intentionally part of the data.
@inproceedings{
wang2024blade,
author = {Wang, Shengze and Li, Jiefeng and Li, Tianye and Yuan, Ye and Fuchs, Henry and De Mello, Shalini and Nagano, Koki and Stengel, Michael},
title = {{BLADE}: {S}ingle-view {B}ody {M}esh {L}earning through {A}ccurate {D}epth {E}stimation},
booktitle = {arXiv},
year = {2024}
}
We thank Ariel Brown for the video voice over. We base this website off of the EG3D website template and the WYSIWYG website template.