Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
(ICLR 2025)
Finetuning feature equivariance on a synthetic object significantly enhances the Vision Transformer’s ability to generate better 3D feature correspondences across various objects. This improvement translates to superior performance in 3D tasks such as pose estimation, video tracking, and semantic correspondence.
You may also be interested in our previous work SparseDFF (ICLR 2024), which employs DINO features for one-shot dexterous manipulation.
- Change Logs
- Environment Setup
- Quick Start
- Finetuning on Objaverse
- Evaluation
- Acknowledgements
- BibTeX
2025/2/19
- Uploaded other ViT-family models.2025/2/18
- Uploaded two missing files for PF-PASCAL evaluation.2025/2/4
- Uploaded more DINOv2 variants (DINOv2-Small/Large/Giant). Provide the environment requirements.2025/1/26
- Uploaded pretrained models (DINOv2-Base) along with training/evaluation recipes.
We provide a Huggingface demo at https://huggingface.co/spaces/qq456cvb/3DCorrEnhance.
Our environment information can be found in requirements.txt
. You can install them by:
pip install -r requirements.txt
Our finetuned DINOv2-Small/Base/Large/Giant and other ViT-family models are available at Huggingface. To load DINOv2-Base, run:
from finetune import FinetuneDINO
model = FinetuneDINO.load_from_checkpoint('https://huggingface.co/qq456cvb/3DCorrEnhance/resolve/main/dinov2_base.ckpt', r=4, backbone_size='base').eval().cuda()
To load other ViT models (e.g., CLIP), run:
from finetune_timm import FinetuneTIMM
model = FinetuneTIMM.load_from_checkpoint('https://huggingface.co/qq456cvb/3DCorrEnhance/resolve/main/clip.ckpt', r=4, vit='clip').eval().cuda()
To extract descriptors for specific keypoints (a Nx2
numpy array), use:
import torch
from PIL import Image
import numpy as np
rgb = np.array(Image.open('/path/to/rgb.png'))
kps = ... # N x 2 numpy array
rgb_input = torch.from_numpy(np.moveaxis((rgb / 255.).astype(np.float32), -1, 0)).cuda()
with torch.no_grad():
kp_feats = model.get_feature(rgb_input[None], torch.from_numpy(kps).cuda()[None], normalize=True)[0] # N x F torch tensor
To extract the entire feature map:
import torch
from PIL import Image
import numpy as np
rgb = np.array(Image.open('/path/to/rgb.png'))
rgb_input = torch.from_numpy(np.moveaxis((rgb / 255.).astype(np.float32), -1, 0)).cuda()
with torch.no_grad():
feat_img = model.get_feature_wo_kp(rgb_input[None], normalize=True)[0] # H x W x F torch tensor
To prepare the multi-view training data on Objaverse, first download the Objaverse glbs (only a 10k subset is required, as defined in data/10k.txt
). Then run data_utils/render_objects.py
to render 10K randomly sampled Objaverse objects with blenderproc
.
Your directory structure should look like this:
3DCorrEnhance/
└── data/
└── 10k.txt
└── obj_poses.npy
└── objaverse/
└── hf-objaverse-v1/
└── glbs/
├── 000-000/
├── ...
└── 000-159/
└── objaverse_renderings/
To finetune the DINOv2 base network, run:
finetune.py backbone=base
This will finetune DINOv2 Base and save checkpoints in the checkpoints/
folder. For other DINOv2 variants, change the backbone type:
finetune.py backbone=large
For DINOv2 with registers, use:
finetune.py backbone=base reg=True
For pose estimation, download the test data from OnePose++ and place it under data/
. Your directory should look like this:
3DCorrEnhance/
└── data/
└── sfm_output/
└── outputs_softmax_loftr_loftr/
└── lowtexture_test_data/
For video tracking evaluation, download the data from TAP-Vid-DAVIS and place it under data/
:
3DCorrEnhance/
└── data/
├── tapvid_davis_data_strided.pkl
└── lowtexture_test_data/
For semantic transfer, download the PF-PASCAL dataset and place it under data/
:
3DCorrEnhance/
└── data/
└── PF-dataset-PASCAL/
├── Annotations/
├── JPEGImages/
├── test_pairs_pf_different_views.txt
└── test_pairs_pf_same_views.txt
To evaluate a checkpoint on all three tasks, run:
python evaluate.py --ckpt /path/to/ckpt --pose --tracking --transfer
Some code is adapted from DINO-Tracker, FiT3D, and Objaverse-XL. We thank these projects for their open-source contributions.
If you find our work helpful, please consider citing:
@misc{you2024multiviewequivarianceimproves3d,
title={Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning},
author={Yang You and Yixin Li and Congyue Deng and Yue Wang and Leonidas Guibas},
year={2024},
eprint={2411.19458},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.19458},
}