Skip to content

DYDevelop/2023-Samsung-AI-Challenge

 
 

Repository files navigation

INTERN-2.5: Multimodal Multitask General Large Model

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

2023 Samsung AI Challenge : Camera-Invariant Domain Adaptation

GOAL

자율주행은 다양한 센서들을 사용해 주변 상황을 인식하고 이를 바탕으로 차량을 제어하게 됩니다. 

카메라 센서의 경우, 장착 위치, 센서의 종류, 주행 환경 등에 따라 영상간의 격차(Domain Gap)가 발생합니다. 
그간 여러 선행 연구에서는 이미지의 광도와 질감(Photometry and Texture) 격차에 의한 인식 성능 저하를 극복하기 위해, 
Unsupervised Domain Adaptation 기술을 광범위하게 적용해왔습니다.

하지만 대부분의 기존 연구들은 카메라의 광학적 특성, 특히 이미지의 왜곡 특성(Geometric Distortion)에 따른 영상간의 격차는 고려하지 않고 있습니다. 
따라서 본 대회에서는 왜곡이 존재하지 않는 이미지(Source Domain)와 레이블을 활용하여, 
왜곡된 이미지(Target Domain)에 대해서도 고성능의 이미지 분할(Semantic Segmentation)을 수행하는 AI 알고리즘 개발을 제안합니다.

Look in to Data

  • Train Data and Target Data image
     데이터셋에 왜곡이 없고 라벨링  이미지와 실제로 원하는 왜곡된 이미지가 존재.
     따라서 왜곡 없는 데이터 셋을 왜곡 시켜 Target 이미지와 최대한 비슷하게 만들어주자.
     또한, Background mask를 덮어씌어 주기 때문에 기존 12 class에서 Background class를
     추가해  13개의 class로 증강시킴
    
     # 총 13개의 Class로 Data 준비
     CLASSES=('Road', 'Sidewalk', 'Construction', 'Fence', 'Pole',
     	'Traffic_Light', 'Traffic_sign', 'Nature', 'Sky','Person',
     	'Rider', 'Car', 'Background')
  • Train Data and Augmented Data image image image image
     Target 데이터셋에 존재하는 이미지 주변의 Background를 Mask로 따와서 이를 Fisheye 전처리를 한 이미지에 덮어씌워 줌.
     
     1. 기존 이미지에 Fisheye effect Augmentation 적용
     2. Target 데이터셋에 있는 Background 부분을 라벨링하여 Mask로 만들어옴
     3. 이를 전처리된 이미지와 합성하여 Target 데어터셋과 흡사한 이미지 생성
     4. Segmentation Dataset 특성상 Annotation 또한 동일하게 진행해 줌
     -> For Details look at segmentation/augment.py
    

After Data Preparation

  • Trainig Code (RTX 3090 * 2 == 48G)

     # Trining from the first iter
     bash dist_train.sh configs/cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py 2
    
     # Resume Training
     bash dist_train.sh configs/cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py 2 --resume-from work_dirs/upernet_internimage_b_512x1024_160k_cityscapes/latest.pth
  • Inference Code (RTX 3090 * 1 == 24G)

     # Inference on Test Dataset with visualization and saving pred masks
     python test.py work_dirs/upernet_internimage_b_512x1024_160k_cityscapes/upernet_internimage_b_512x1024_160k_cityscapes.py \
     work_dirs/upernet_internimage_b_512x1024_160k_cityscapes/latest.pth --show-dir visualization

    기존 test.py는 vidualization과 pickle format 파일만 만들 수 있었지만 대회에 제출을 하기 위해서는 Pred Masks가 필요해 기존 코드 수정하여 /work_dirs/Pred_masks 폴더에 Pred한 Masks를 저장도록 만듦

  • Submission (CPU)

     # To make an csv file for Submission
     python submit.py

Highlights

  • 👍 The strongest open-source visual universal backbone model with up to 3 billion parameters
  • 🏆 Achieved 90.1% Top1 accuracy in ImageNet, the most accurate among open-source models
  • 🏆 Achieved 65.5 mAP on the COCO benchmark dataset for object detection, the only model that exceeded 65.0 mAP

Introduction

"INTERN-2.5" is a powerful multimodal multitask general model jointly released by SenseTime and Shanghai AI Laboratory. It consists of large-scale vision foundation model "InternImage", pre-training method "M3I-Pretraining", generic decoder "Uni-Perceiver" series, and generic encoder for autonomous driving perception "BEVFormer" series.

Applications

🌅 Image Modality Tasks

"INTERN-2.5" achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, "INTERN-2.5" is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.

"INTERN-2.5" outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.

"INTERN-2.5" also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.

Performance

  • Classification
Image Classification Scene Classification Long-Tail Classification
ImageNetPlaces365Places 205iNaturalist 2018
90.161.271.792.3
  • Detection
Conventional Object DetectionLong-Tail Object Detection Autonomous Driving Object DetectionDense Object Detection
COCOVOC 2007VOC 2012OpenImageLVIS minivalLVIS valBDD100KnuScenesCrowdHuman
65.594.097.274.165.863.238.864.897.2
  • Segmentation
Semantic SegmentationStreet SegmentationRGBD Segmentation
ADE20KCOCO Stuff-10KPascal ContextCityScapesNYU Depth V2
62.959.670.386.169.7

🌁 📖 Image and Text Cross-Modal Tasks

Image-Text Retrieval: "INTERN-2.5" can quickly locate and retrieve the most semantically relevant images based on textual content requirements. This capability can be applied to both videos and image collections and can be further combined with object detection boxes to enable a variety of applications, helping users quickly and easily find the required image resources. For example, it can return the relevant images specified by the text in the album.

Image-To-Text: "INTERN-2.5" has a strong understanding capability in various aspects of visual-to-text tasks such as image captioning, visual question answering, visual reasoning, and optical character recognition. For example, in the context of autonomous driving, it can enhance the scene perception and understanding capabilities, assist the vehicle in judging traffic signal status, road signs, and other information, and provide effective perception information support for vehicle decision-making and planning.

Performance

Image CaptioningFine-tuning Image-Text RetrievalZero-shot Image-Text Retrieval
COCO CaptionCOCO CaptionFlickr30kFlickr30k
148.276.494.889.1

Released Models

Open-source Visual Pretrained Models
name pretrain pre-training resolution #param download
InternImage-L ImageNet-22K 384x384 223M ckpt
InternImage-XL ImageNet-22K 384x384 335M ckpt
InternImage-H Joint 427M 384x384 1.08B ckpt
InternImage-G - 384x384 3B ckpt
ImageNet-1K Image Classification
name pretrain resolution acc@1 #param FLOPs download
InternImage-T ImageNet-1K 224x224 83.5 30M 5G ckpt | cfg
InternImage-S ImageNet-1K 224x224 84.2 50M 8G ckpt | cfg
InternImage-B ImageNet-1K 224x224 84.9 97M 16G ckpt | cfg
InternImage-L ImageNet-22K 384x384 87.7 223M 108G ckpt | cfg
InternImage-XL ImageNet-22K 384x384 88.0 335M 163G ckpt | cfg
InternImage-H Joint 427M 640x640 89.6 1.08B 1478G ckpt | cfg
InternImage-G - 512x512 90.1 3B 2700G ckpt | cfg
COCO Object Detection and Instance Segmentation
backbone method schd box mAP mask mAP #param FLOPs download
InternImage-T Mask R-CNN 1x 47.2 42.5 49M 270G ckpt | cfg
InternImage-T Mask R-CNN 3x 49.1 43.7 49M 270G ckpt | cfg
InternImage-S Mask R-CNN 1x 47.8 43.3 69M 340G ckpt | cfg
InternImage-S Mask R-CNN 3x 49.7 44.5 69M 340G ckpt | cfg
InternImage-B Mask R-CNN 1x 48.8 44.0 115M 501G ckpt | cfg
InternImage-B Mask R-CNN 3x 50.3 44.8 115M 501G ckpt | cfg
InternImage-L Cascade 1x 54.9 47.7 277M 1399G ckpt | cfg
InternImage-L Cascade 3x 56.1 48.5 277M 1399G ckpt | cfg
InternImage-XL Cascade 1x 55.3 48.1 387M 1782G ckpt | cfg
InternImage-XL Cascade 3x 56.2 48.8 387M 1782G ckpt | cfg
backbone method box mAP (val/test) #param FLOPs download
InternImage-H DINO (TTA) 65.0 / 65.4 2.18B TODO TODO
InternImage-G DINO (TTA) 65.3 / 65.5 3B TODO TODO
ADE20K Semantic Segmentation
backbone method resolution mIoU (ss/ms) #param FLOPs download
InternImage-T UperNet 512x512 47.9 / 48.1 59M 944G ckpt | cfg
InternImage-S UperNet 512x512 50.1 / 50.9 80M 1017G ckpt | cfg
InternImage-B UperNet 512x512 50.8 / 51.3 128M 1185G ckpt | cfg
InternImage-L UperNet 640x640 53.9 / 54.1 256M 2526G ckpt | cfg
InternImage-XL UperNet 640x640 55.0 / 55.3 368M 3142G ckpt | cfg
InternImage-H UperNet 896x896 59.9 / 60.3 1.12B 3566G ckpt | cfg
InternImage-H Mask2Former 896x896 62.5 / 62.9 1.31B 4635G ckpt | cfg
Main Results of FPS

Export classification model from pytorch to tensorrt

Export detection model from pytorch to tensorrt

Export segmentation model from pytorch to tensorrt

name resolution #param FLOPs batch 1 FPS (TensorRT)
InternImage-T 224x224 30M 5G 156
InternImage-S 224x224 50M 8G 129
InternImage-B 224x224 97M 16G 116
InternImage-L 384x384 223M 108G 56
InternImage-XL 384x384 335M 163G 47

Before using mmdeploy to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator builded correctly. You can build it with the following command:

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy

# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt

# build custom ops
cd ${MMDEPLOY_DIR}
mkdir -p build && cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) && make install

# install the mmdeploy after building custom ops
cd ${MMDEPLOY_DIR}
pip install -e .

For more details on building custom ops, please refering to this document.

Citations

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

@inproceedings{zhu2022uni,
  title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
  author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
  booktitle={CVPR},
  pages={16804--16815},
  year={2022}
}

@article{zhu2022uni,
  title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
  author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
  journal={arXiv preprint arXiv:2206.04674},
  year={2022}
}

@article{li2022uni,
  title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
  author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
  journal={arXiv preprint arXiv:2211.09808},
  year={2022}
}

@article{yang2022bevformer,
  title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
  author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
  journal={arXiv preprint arXiv:2211.10439},
  year={2022}
}

@article{su2022towards,
  title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
  author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
  journal={arXiv preprint arXiv:2211.09807},
  year={2022}
}

@inproceedings{li2022bevformer,
  title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
  author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  booktitle={ECCV},
  pages={1--18},
  year={2022},
}

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.1%
  • Jupyter Notebook 30.1%
  • Cuda 6.4%
  • C++ 1.2%
  • Shell 0.2%