BEVDet implemented by TensorRT, C++

This project is a TensorRT implementation for BEVDet inference, written in C++. It can be tested on the nuScenes dataset and also provides a single test sample. BEVDet is a multi-camera 3D object detection model in bird's-eye view. For more details about BEVDet, please refer to the following link BEVDet. The script to export the ONNX model is in this repository.

NEWS : A new branch "one" has been released. This branch implements TensorRT-plugins. bevdet-tensorrt-cpp

This project implements the following:

Long-term model
Depth model
On the NVIDIA A4000, the BEVDet-r50-lt-depth model shows a 2.38x faster inference speed for TRT FP32 compared to PyTorch FP32, and a 5.21x faster inference speed for TRT FP16 compared to PyTorch FP32
On the Jetson AGX Orin, the FP16 model inference time is around 29 ms, achieving real-time performance
A Dataloader for the nuScenes dataset and can be used to test on the dataset
Fine-tuned the model to solve the problem that the model is sensitive to input resize sampling, which leads to the decline of mAP and NDS
An Attempt at Int8 Quantization

The features of this project are as follows:

A CUDA Kernel that combines Resize, Crop, and Normalization for preprocessing
The Preprocess CUDA kernel includes two interpolation methods: Nearest Neighbor Interpolation and Bicubic Interpolation
Alignment of adjacent frame BEV features using C++ and CUDA kernel implementation
Multi-threading and multi-stream NvJPEG
Sacle-NMS
Remove the preprocess module in BEV encoder

The following parts need to be implemented:

Quantization to int8.
Integrate the bevpool and adjacent frame BEV feature alignment components into the engine as plugins
Exception handling

Results && Speed

Inference Speed

All time units are in milliseconds (ms), and Nearest interpolation is used by default.

	Preprocess	Image stage	BEV pool	Align Feature	BEV stage	Postprocess	mean Total
NVIDIA A4000 FP32	0.478	16.559	0.151	0.899	6.848	0.558	25.534
NVIDIA A4000 FP16	0.512	8.627	0.168	0.925	2.966	0.619	13.817
NVIDIA A4000 Int8	0.467	3.929	0.143	0.885	1.771	0.631	7.847
Jetson AGX Orin FP32	2.800	38.09	0.620	2.018	11.893	1.065	55.104
Jetson AGX Orin FP16	2.816	17.025	0.571	2.111	5.747	0.919	29.189
Jetson AGX Orin Int8	2.924	10.340	0.596	1.861	4.004	0.982	20.959

Note: The inference time of the module refers to the time of a frame, while the total time is calculated as the average time of 200 frames.

Results

Model	Description	mAP	NDS	Infer time
Pytorch		0.3972	0.5074	96.052
Pytorch	LSS accelerate¹	0.3787	0.4941	86.236
Trt FP32	Python Preprocess²	0.3776	0.4936	25.534
Trt FP32	Bicubic sampler³	0.3723	0.3895	33.960
Trt FP32	Nearest sampler⁴	0.3703	0.4884	25.534
Trt FP16	Nearest sampler	0.3702	0.4883	13.817
Pytorch	Nearest sampler ⁵	0.3989	0.5169	——
Pytorch	LSS accelerate ⁵	0.3800	0.4997	——
Trt FP16	⁵	0.3785	0.5013	12.738

Note: The PyTorch model does not include preprocessing time, and all models were tested on an NVIDIA A4000 GPU

LSS accelerate refers to the process of pre-computing and storing the data used for BEVPool mapping during the View Transformer stage to improve inference speed. The pre-stored data is calculated based on the camera's intrinsic and extrinsic parameters. Due to slight differences in the intrinsic and extrinsic parameters of certain scenes in nuScenes, enabling the LSS accelerate can result in a decrease in precision. However, if the camera's intrinsic parameters remain unchanged and the extrinsic parameters between the camera coordinate system and the Ego coordinate system also remain unchanged, using LSS Accelerate will not result in a decrease in precision.
Some networks are very sensitive to input, and the Pytorch models use PIL's resize function with default Bicubic interpolation for preprocessing. During inference, neither OpenCV's Bicubic interpolation nor our own implementation of Bicubic interpolation can achieve the accuracy of Pytorch. We speculate that the network may be slightly overfitting or learning certain features of the sampler, which leads to a decrease in accuracy when the interpolation method is changed. Here, using Python preprocessing as input, it can be seen that the accuracy of the TRT model does not decrease in some cases
If as stated in point 2, we use our own implementation of Bicubic interpolation, it cannot achieve the performance of Python preprocessing.
Nearest is 20 times slower than Bicubic and will be used as the default sampling method.
Fine-tune the network, and the preprocess uses the resize based on Nearest sampling implemented in C++. After fine-tuning, the network is adapted to Nearest sampling

DataSet

The Project provides a test sample that can also be used for inference on the nuScenes dataset. When testing on the nuScenes dataset, you need to use the data_infos folder provided by this project. The data folder should have the following structure:

└── data
    ├── nuscenes
        ├── data_infos
            ├── samples_infos
                ├── sample0000.yaml
                ├── sample0001.yaml
                ├── ...
            ├── samples_info.yaml
            ├── time_sequence.yaml
        ├── samples
        ├── sweeps
        ├── ...

the data_infos folder can be downloaded from Google drive or Baidu Netdisk

Environment

For desktop or server：

CUDA 11.8
cuDNN 8.6.0
TensorRT 8.5.2.2
yaml-cpp
Eigen3
libjpeg

For Jetson AGX Orin

Jetpack 5.1.1
CUDA 11.4.315
cuDNN 8.6.0
TensorRT 8.5.2.2
yaml-cpp
Eigen3
libjpeg

Compile && Run

Please use the ONNX file provided by this project to generate the TRT engine based on the script:

python tools/export_engine.py cfgs/bevdet_lt_depth.yaml model/img_stage_lt_d.onnx model/bev_stage_lt_d.engine --postfix="_lt_d_fp16" --fp16=True

ONNX files, cound be downloaded from Baidu Netdisk or Google Drive

mkdir build && cd build
cmake .. && make
./bevdemo ../configure.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
cfgs		cfgs
doc		doc
include		include
sample0		sample0
src		src
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
configure.yaml		configure.yaml
demo_bevdet.cpp		demo_bevdet.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEVDet implemented by TensorRT, C++

Results && Speed

Inference Speed

Results

DataSet

Environment

Compile && Run

References

About

Releases

Packages

Languages

LCH1238/bevdet-tensorrt-cpp

Folders and files

Latest commit

History

Repository files navigation

BEVDet implemented by TensorRT, C++

Results && Speed

Inference Speed

Results

DataSet

Environment

Compile && Run

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages