Skip to content

The official source code for the project RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo, CVPR 2023.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



2 Commits

Repository files navigation

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo (CVPR 2023)

Project Page Paper Arxiv Video Poster Slide

This repository contains an official PyTorch implementaiton for training and testing an MVS depth estimation method proposed in the paper:

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

Changjiang Cai * , Pan Ji, Qingan Yan, Yi Xu

OPPO US Research Center
* Corresponding author

overview of architecture

πŸ†• Updates

  • 11/10/2024: Official code initially released per institutional approval.
  • 06/01/2023: RIAV-MVS paper released, see arXiv paper.

πŸ“‹ Table of Contents

  1. Overview
  2. Setup
  3. Datasets
  4. Training
  5. Testing and Evaluation
  6. License
  7. Acknowledgements
  8. Citations
  9. Troubleshooting

🌐 Overview

We present a learning-based approach for multi-view stereo (MVS), i.e., estimate the depth map of a reference frame using posed multi-view images. Our core idea lies in leveraging a β€œlearning-to-optimize” paradigm to iteratively index a plane-sweeping cost volume and regress the depth map via a convolutional Gated Recurrent Unit (GRU). Besides, a pose module is leveraged to improve the relative pose among multi-view frames, and a self-attention block is applied only to the reference frame for constructing asymetrical matching volume for improved prediction.

See the comparison between ours and other SOTA baselines.

overview of architecture

βš™οΈ Setup

  • The code has been tested with Python 3.10 and PyTorch 2.2.0 with CUDA 12.1. Assume the project is located at ~/riav-mvs. We provide the Docker file at docker/Dockerfile_oppo and the library requirements at docker/dev_requirements.txt, which will be installed when you build the docker image (see below).

Docker Environment:

  • Build and run the docker:
cd ~/riav-mvs/docker
sh # build docker container
# it will generate a container with tag "changjiang_cai/mvs-raft:1.0".
# You can change the tag name in this script.

sh # run the container just generated above;

## (Optional) [Useful Tips ✍]: To exit Docker container without stopping it, press Ctrl+P followed by Ctrl+Q; 
# If you want to exit the container's interactive shell session, 
# but do not want to interrupt the processes running in it,
# press Ctrl+P followed by Ctrl+Q. 
# This operation detaches the container and 
# allows you to return to your system's shell.

## (Optional) [Useful Tips ✍]: Re-enter the Container
# get the container id
docker ps
#  Run as root, e.g., if you want to install some libraries via `pip`;
#  Here `d89c34efb04a` is the container id;
docker exec -it -u 0 d89c34efb04a bash
#  Run as regular user
docker exec -it d89c34efb04a bash

## (Optional) [Useful Tips ✍]: To save the container to a new docker image;
# After the pip installation, save the container to an image. 
# You can run the following from the outside of the docker container.
docker commit -m "some notes you specified" d89c34efb04a xyz/riavmvs:1.1

πŸ’Ύ Datasets

RGB-D Datasets

To train/evaluate RIAV-MVS, you will need to download the required datasets.

The following datasets are only used for evaluation to show cross-domain generalization performance.

MVS Datasets

πŸ“Š Testing and Evaluation

Pretrained Models on ScanNet

You can download the model checkpoints for our method and the baseline methods that we trained from scratch at this link.

Here we provide three varients of the pipeline for ablation, including:

  • V1: Base model: our proposed paradigm that iteratively indexes a plane-sweeping cost volume and regresses the depth map via a convolutional Gated Recurrent Unit (GRU).
  • V2: +Pose model: with a residual pose module to correct the relative poses, helping the cost volume construction at frame levels.
  • V3: +Pose,Atten model: this is the full model. Besides the modules seen in V1 and V2, this variant includes a transformer block applied to the reference image (but not to the source images). It breaks the symmetry of the Siamese network (which is typically used in MVS to extract image features) to construct the so-called Asymmetric Volume in our paper. It embedded both the pixel-wise local (high-frequency) features via high-pass CNNs, and the long-range global (low-frequency)context by self-attention, to store more accurate matching similarity cues.

Our pretrained models trained on ScanNet training set can be downloaded as below.

Model Variants V1 (Base) V2 (+Pose) V3 (+Pose,Atten
YAML Config [1] riavmvs_base_test.yaml [2]riavmvs_pose_test.yaml [3]riavmvs_full_test.yaml
Checkpoint trained on ScanNet [4]riavmvs_base_epoch_002.pth.tar [5]riavmvs_pose_epoch_003.pth.tar [6]riavmvs_full_epoch_007.pth.tar

Finetuned Models on DTU

Our pretrained models trained on ScanNet are further finetuned on DTU training set, which can be downloaded as below. Here we skip the V2 (+Pose) model on DTU since DTU has accurate poses.

Model Variants V1 (Base) V3 (+Pose,Atten
YAML Config [7] riavmvs_base_dtu_test.yaml see [3] above
Checkpoint finetuned on DTU [8]riavmvs_base_dtu_epoch_04.pth.tar [9]riavmvs_full_dtu_epoch_03.pth.tar

Key Hyper-Parameters

  1. Base model on ScanNet: Download our pretrained checkpoint shown in the table above and save it at a directory, e.g., the model [4]: checkpoints_nfs/saved/released/riavmvs_base_epoch_002.pth.tar trained on ScanNet. You can find the config YAML file at [1]: config/riavmvs_base_test.yaml. Pay attention to these parameters:
raft_mvs_type: 'raft_mvs' # mvs depth module;
pose_net_type: "none"
raft_depth_init_type: 'none'
  1. Base model on DTU: Download our checkpoint shown in the table above and save it at a directory, e.g., the model [8]: checkpoints_nfs/saved/released/riavmvs_base_dtu_epoch_04.pth.tar finetuned on DTU. You can find the config YAML file at [7]: riavmvs_base_dtu_test.yaml. Pay attention to these parameters:
fusion_pairnet_feats: False # no feature fusion layers;
# -- mvs plane sweeping setup -- #
num_depth_bins: 96 # here we use 96 depth hypotheses planes;
  1. +Pose model on ScanNet: Download our pretrained checkpoint and save it at a directory, e.g., the model [5]: checkpoints_nfs/saved/released/riavmvs_pose_epoch_003.pth.tar trained on ScanNet. You can find the config YAML file [2]: config/riavmvs_pose_test.yaml. Pay attention to these parameters:
raft_mvs_type: 'raft_mvs' # mvs depth module;
pose_net_type: "resnet_pose"
raft_depth_init_type: 'none'
  1. Full model on ScanNet or DTU: Download our pretrained checkpoint and save it at a directory, e.g., the model [6]: checkpoints_nfs/saved/released/riavmvs_full_epoch_007.pth.tar trained on ScanNet and the model [9]: checkpoints_nfs/saved/released/riavmvs_pose_dtu_epoch_04.pth.tar finetuned on DTU. You can find the config YAML file at [3]: config/riavmvs_full_test.yaml. Pay attention to these parameters:
raft_mvs_type: 'raft_mvs_asyatt_f1_att' # attention to frame f1;
pose_net_type: "resnet_pose"
raft_depth_init_type: 'soft-argmin'

Evaluation on ScanNet Test Set

We do same-domain evaluation. I.e., models are trained in the training set of ScanNet (or DTU), then are evaluated in the test set of the same domain ScanNet (or DTU).

We evaluate two sampling strategies to generate the evaluation frames: 1) the simple view selection strategy (i.e., sampling by every 10 frames, resulting in 20,668 samples) as in ESTDepth CVPR'21, and 2) the keyframe selection based on heuristics as in Deep-VideoMVS CVPR'21, resulting in 25,481 samples.

Change this hyperparameter in the YAML configuration files,

scannet_eval_sampling_type: 'e-s10n3' # use this for simple sampling 
scannet_eval_sampling_type: 'd-kyn3' # use this for keyframe sampling 
Model Config Checkpoint Evaluation on ScanNet Test-Set (unit: meter)
Simple Sampling Keyframe Sampling
Abs-rel ↓ Abs ↓ Ξ΄ < 1.25 ↑ Abs-rel ↓ Abs ↓ Ξ΄ < 1.25 ↑
Our (base) see [1] see [4] 0.0885 0.1605 0.9211 0.0843 0.1603 0.9280
Our (+pose) see [2] see [5] 0.0827 0.1523 0.9277 0.0790 0.1525 0.9344
Our(+pose,atten) see [3] see [6] 0.0734 0.1381 0.9395 0.0692 0.1362 0.9470
  • Here we provide three varients of our model for ablation. From the results, we can find the proposed residual pose module (see Our(base) vs Our(+pose)), and the so called asymetric attention module, which was applied to the reference view only, (see Our(+pose,atten) vs Our(+pose)), both help boost the performance.
  • Our(+pose,atten) is the full model, as the default one we used in most of the experiments.

Evaluation on DTU Test Set

Model Config Checkpoint Evaluation on DTU Test-Set (unit: mm)
Abs-rel ↓ Abs ↓ rmse ↓
Our (base) see [7] see [8] 0.0102 7.3564 19.6125
Our(+pose,atten) see [3] see [9] 0.0091 6.7214 18.5950
  • Here we provide two varients of our model for ablation. From the results, we can find the so called asymetric attention module, which was applied to the reference view only, help boost the performance.

Evaluation Script to Replicate Paper Results

Run the script to load the those checkpoints mentioned above will reproduce the results in Table 1 as shown in our paper.


The default parameters are


Baseline Models: We also provide the checkpoints for the baseline methods MVSNet, PairNet, IterMVS. We trained those baseline models fram scratch using the same training dataset and data augmentation as our proposed methods for fair comparison. You can find the YAML configuration files at config/*. Checkpoints are specified for each baseline in the bash script.

See more details in the bash script, and flexible arguments are provided for running experiments easily and protably.

⏳ Training

For network training, run the following script


The default parameters are


You can adjust other arguments at your will. More details can be found in the bash script for running experiments easily and protably.

βš–οΈ License

The code in this repository is licensed under MIT licence, see LICENSE.

πŸ™ Acknowledgements

Our work partially adopts codes from RAFT(BSD 3-Clause License), RAFT-Stereo(MIT License), DeepVideoMVS (MIT License) and GMA (WTFPL License). We also compare our method with baselines IterMVS(MIT License) and ESTDepth(MIT License) for most of the experiments. We sincerely thank those authors for making these repos available.

πŸ“‘ Citations

If you find our work useful, please consider citing our paper:

    author    = {Cai, Changjiang and Ji, Pan and Yan, Qingan and Xu, Yi},
    title     = {RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo},
    booktitle = {CVPR},
    month     = {June},
    year      = {2023},
    pages     = {919-928}

Please also consider our another MVS paper if you find it useful:

    author    = {Liu, Jiachen and Ji, Pan and Bansal, Nitin and Cai, Changjiang and Yan, Qingan and Huang, Xiaolei and Xu, Yi},
    title     = {PlaneMVS: 3D Plane Reconstruction From Multi-View Stereo},
    booktitle = {CVPR},
    month     = {June},
    year      = {2022},
    pages     = {8665-8675}

πŸ› οΈ Troubleshooting

We will keep updating this section as issues arise.

  • [1] torchvision/models/ UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
    • Solution: change line 26 self.encoder = resnets[num_layers](pretrained) to self.encoder = resnets[num_layers](weights="IMAGENET1K_V2" if pretrained else None), e.g., in the file third_parties/ESTDepth/hybrid_models/


The official source code for the project RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo, CVPR 2023.







No releases published


No packages published