This repository provides the implementation of the following paper:
MMSNet: Multi-Modal scene recognition using multi-scale encoded features
Ali Caglayan *, Nevrez Imamoglu *, Ryosuke Nakamura
[Paper]
Before starting, it is required to install the following libraries. Note that the package versions might need to be changed depending on the system:
conda create -n mmsnet python=3.7
conda activate mmsnet
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install -U scikit-learn
pip install opencv-python
pip install psutil
pip install h5py
Also, source code path might need to be included to the PYTHONPATH (e.g. export PYTHONPATH=$PYTHONPATH:/path_to_project/MMSNet/src/utils
).
SUN RGB-D Scene dataset is the largest real-world RGB-D indoor dataset as of today. Download the dataset from here, keep the file structure as is after extracting the files. In addition, allsplit.mat
and SUNRGBDMeta.mat
files need to be downloaded from the SUN RGB-D toolbox. allsplit.mat
file is under SUNRGBDtoolbox/traintestSUNRGBD
and SUNRGBDMeta.mat
is under SUNRGBDtoolbox/Metada
. Both files should be placed under the root folder of SUN RGB-D dataset. E.g. :
SUNRGBD ROOT PATH ├── SUNRGBD │ ├── kv1 ... │ ├── kv2 ... │ ├── realsense ... │ ├── xtion ... ├── allsplit.mat ├── SUNRGBDMeta.mat
The dataset is presented in a complex hierarchy. Therefore, it's adopted to the local system as follows:
python utils/organize_sunrgb_scene.py --dataset-path <SUNRGBD ROOT PATH>
This creates train/eval splits, copies RGB and depth files together with camera calibration parameters files for depth data under the corresponding split structure. Then, depth colorization is applied as below, which takes a couple of hours.
python utils/depth_colorize.py --dataset "sunrgbd" --dataset-path <SUNRGBD ROOT PATH>
NYUV2 RGB-D Scene dataset is available here. In addition, splits.mat
file needs to be downloaded from here together with sceneTypes.txt
from here. The dataset structure should be something like below:
NYUV2 ROOT PATH ├── nyu_depth_v2_labeled.mat ├── splits.mat ├── sceneTypes.txt
Unlike other datasets, NYUV2 dataset is provided as a Matlab .mat file in nyu_depth_v2_labeled.mat
. This work uses the provided in-painted depth maps and RGB images. In order to prepare depth data offline, depth colorization can be applied as follows:
python utils/depth_colorize.py --dataset "nyuv2" --dataset-path <NYUV2 ROOT PATH>
Fukuoka RGB-D Indoor Scene dataset is used for the first time in the literature for benchmarking in this work. There are 6 categories: corridor, kitchen, lab, office, study room, and toilet (see the download links below). The files should be extracted in a parent folder (e.g. fukuoka
). The dataset structure should be something like below:
Fukuoka ROOT PATH ├── fukuoka │ ├── corridors ... │ ├── kitchens ... │ ├── labs ... │ ├── offices ... │ ├── studyrooms ... │ ├── toilets ...
The dataset is organized using the following command, which creates eval-set
under the root path:
python utils/organize_fukuoka_scene.py --dataset-path <Fukuoka ROOT PATH>
Then, depth colorization is applied similar to the other dataset usages.
python utils/depth_colorize.py --dataset "fukuoka" --dataset-path <Fukuoka ROOT PATH>
Trained models that give the results in the paper are provided as follows in a tree hierarchy. Download the models to run the evaluation code. Note that we share the used random weights here. However, it's possible to generate new random weights using the param --reuse-randoms 0
(default 1). The results might change slightly (could be higher or lower). We discuss the effect of randomness in our previous paper here. Note that this (random modeling) should be done during the training process, not only for the evaluation (as the new random set naturally creates a new distribution).
ROOT PATH TO MODELS ├── models │ ├── resnet101_sun_rgb_best_checkpoint.pth │ ├── resnet101_sun_depth_best_checkpoint.pth │ ├── sunrgbd_mms_best_checkpoint.pth │ ├── nyuv2_mms_best_checkpoint.pth │ ├── fukuoka_mms_best_checkpoint.pth ├── random_weights │ ├── resnet101_reduction_random_weights.pkl │ ├── resnet101_rnn_random_weights.pkl
After data preparation and downloading the models, to evaluate to models on SUN RGB-D, NYUV2, and Fukuoka RGB-D, run the following commands:
python eval_models.py --dataset "sunrgbd" --dataset-path <SUNRGBD ROOT PATH> --models-path <ROOT PATH TO MODELS>
python eval_models.py --dataset "nyuv2" --dataset-path <NYUV2 ROOT PATH> --models-path <ROOT PATH TO MODELS>
python eval_models.py --dataset "fukuoka" --dataset-path <Fukuoka ROOT PATH> --models-path <ROOT PATH TO MODELS>
Multi-modal performance comparison of this work (MMSNet) with the related methods on SUN RGB-D, NYUV2 RGB-D, and Fukuoka RGB-D Scene datasets in terms of accuracy (%). * indicates additional use of large-scale data with multi-task training.
Method | Paper | SUN RGB-D | NYUV2 RGB-D | Fukuoka RGB-D |
---|---|---|---|---|
Places CNN-RBF SVM | NeurIPS’14 | 39.0 | - | - |
SS-CNN-R6 | ICRA’16 | 41.3 | - | - |
DMFF | CVPR’16 | 41.5 | - | - |
Places CNN-RCNN | CVPR’16 | 48.1 | 63.9 | - |
MSMM | IJCAI’17 | 52.3 | 66.7 | - |
RGB-D-CNN | AAAI’17 | 52.4 | 65.8 | - |
D-BCNN | RAS’17 | 55.5 | 64.1 | - |
MDSI-CNN | TPAMI’18 | 45.2 | 50.1 | - |
DF2Net | AAAI’18 | 54.6 | 65.4 | - |
HP-CNN-T | Auton.’19 | 42.2 | - | - |
LM-CNN | Cogn. Comput.’19 | 48.7 | - | - |
RGB-D-OB | TIP’19 | 53.8 | 67.5 | - |
Cross-Modal Graph | AAAI’19 | 55.1 | 67.4 | - |
RAGC | ICCVW’19 | 42.1 | - | - |
MAPNet | PR’19 | 56.2 | 67.7 | - |
TRecgNet Aug | CVPR’19 | 56.7 | 69.2 | - |
G-L-SOOR | TIP’20 | 55.5 | 67.4 | - |
MSN | Neurocomp.’20 | 56.2 | 68.1 | - |
CBCL | BMVC’20 | 59.5 | 70.9 | - |
ASK | TIP’21 | 57.3 | 69.3 | - |
2D-3D FusionNet | Inf. Fusion’21 | 58.6 | 75.1 | - |
TRecgNet Aug | IJCV’21 | 59.8 | 71.8 | - |
CNN-randRNN | CVIU’22 | 60.7 | 69.1 | 78.3 |
MMSNet | This work | 62.0 | 72.2 | 81.7 |
Omnivore * | CVPR’22 | 67.1 | 79.8 | - |
We also share our LaTeX
comparison tables together with the bibtext
file for SUN RGB-D and NYUV2 benchmarking (see LaTeX
directory). Feel free to use them.
If you find this work useful in your research, please cite the following papers:
@article{Caglayan2022MMSNet,
title={MMSNet: Multi-Modal Scene Recognition Using Multi-Scale Encoded Features},
journal = {Image and Vision Computing},
volume = {122},
pages = {104453},
author={Ali Caglayan and Nevrez Imamoglu and Ryosuke Nakamura},
doi = {https://doi.org/10.1016/j.imavis.2022.104453},
year={2022}
}
@article{Caglayan2022CNNrandRNN,
title={When CNNs meet random RNNs: Towards multi-level analysis for RGB-D object and scene recognition},
journal = {Computer Vision and Image Understanding},
author={Ali Caglayan and Nevrez Imamoglu and Ahmet Burak Can and Ryosuke Nakamura},
volume = {217},
pages = {103373},
issn = {1077-3142},
doi = {https://doi.org/10.1016/j.cviu.2022.103373},
year={2022}
}
This project is released under the MIT License (see the LICENSE file for details).
This paper is based on the results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).