The MvDeCor pipeline:
(a) Dense 2D representations are learned using pixel-level correspondences guided by 3D shapes.
(b) The 2D representations can be fine-tuned using a few labels for 3D shape segmentation tasks in a multi-view setting.
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks. This is inspired by the observation that view-based surface representations are more effective at modeling high-resolution surface details and texture than their 3D counterparts based on point clouds or voxel occupancy. Specifically, given a 3D shape, we render it from multiple views, and set up a dense correspondence learning task within the contrastive learning framework. As a result, the learned 2D representations are view-invariant and geometrically consistent, leading to better generalization when trained on a limited number of labeled shapes compared to alternatives that utilize self-supervision in 2D or 3D alone. Experiments on textured (RenderPeople) and untextured (PartNet) 3D datasets show that our method outperforms state-of-the-art alternatives in fine-grained part segmentation. The improvements over baselines are greater when only a sparse set of views is available for training or when shapes are textured, indicating that MvDeCor benefits from both 2D processing and 3D geometric reasoning.
Few-shot segmentation on the PartNet dataset k=30 labeled
shapes. Left: 30 fully labeled
shapes are used for training. Right: 30 shapes are used for training, each containing
v = 5 random labeled views. Evaluation is done on the test set of PartNet with the
mean part-iou metric (%). Results are reported by averaging over 5 random runs.
MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D Segmentation