Anchor Free Object Detection

Introduction

This project is about the development of an Anchor free 2D object detection model using PyTorch, that aims to provide a comprehensive guide for enthusiasts, researchers, and practitioners. Here the object detection model is trained from scratch, incorporating a ImageNet pre-trained backbone from PyTorch. The model is trained using a modest system configuration ( NVIDIA RTX A2000 4 GB Laptop GPU ), thus enabling users with low computational resources to train object detection models that give resonably good performance. An easy to understand and extend codebase is developed in this project. The following are the key highlights:

Training a 2D object detection Model in PyTorch from scratch by utilizing Imagenet dataset pre-trained backbone from PyTorch.
Development of an easy to understand and well documented codebase.
Implementation of a method for tuning the detection threshold parameters.
Utilizing training samples from two publicly available datasets: KITTI and BDD, so as to provide a technique to merge samples from multiple training datasets, enabling users to utilize a diverse range of training data for model generalization.

Anchor Free Network Architecture.

Detected Bounding Boxes (BDD).

Detections in video (KITTI).

About The Project

Requirements

opencv_python>=4.8.0.74
imageio>=2.34.0
matplotlib>=3.7.2
numpy>=1.25.0
torch>=2.0.1
torchvision>=0.15.2
tqdm>=4.66.1

How to run the project

git clone https://github.com/UditBhaskar19/ANCHOR_FREE_OBJECT_DETECTOR_FOR_CAMERA
cd ANCHOR_FREE_OBJECT_DETECTOR_FOR_CAMERA/AnchorFree2DObjectDetection

# to run inference on bdd video frames
python video_inference_bdd.py

# to run inference on kitti video frames
python video_inference_kitti.py

# to create the labels file
python script1_create_dataset.py

# to train the model use script3_train_model.ipynb 

# to write detections to video
cd write_detections_to_video
python write_detection_to_video_bdd.py

Project Folder Structure

AnchorFree2DObjectDetection
│───doc                          # Project documents
│───hyperparam                   # Statistical data of the Bounding Box offsets
│───labels                       # aggregated GT labels data of KITTI and BDD dataset
│───mAP                          # module to compute mAP ( https://github.com/Cartucho/mAP.git )
│───model_weights                # model weights data after training
│───tensorboard                  # data folder for loss visualization in tensorboard.
│───modules                      # main modules 
      │───augmentation           # scripts for image augmentation functions            
      │───dataset_utils          # scripts for data analysis and dataset generation
      │───evaluation             # scripts for detector evaluation and threshold determination   
      │───first_stage            # scripts for defining the model and ground truth generation function for dense object detection
      │───hyperparam             # scripts for computing the bounding box offsets statistics from training data    
      │───loss                   # loss functions
      │───neural_net             # scripts for defining various neural net blocks             
            │───backbone               # model backbone blocks
            │───bifpn                  # BIFPN blocks for model neck            
            │───fpn                    # FPN blocks for model neck
            │───head                   # blocks for model head            
            │   common.py              # common model building blocks
            │   constants.py           # constants for model construction  
      │───plot                   # contains plotting functions
      │───pretrained             # scripts for loading the pre-trained backbone from pytorch            
      │───proposal               # scripts for proposal generation
      │───second-stage           # <work under progress> scripts for defining the model and ground truth generation function for second stage object detection              
│───tests                                    # folder for testing and validation scripts
│───video_inference                          # detection results saved as video
│───write_detections_to_video                # scripts to save detections as video, results are saved in 'video_inference' folder
│   config_dataset.py                        # parameters and constants for dataset 
│   config_neuralnet_stage1.py               # model design parameters
│   script1_create_datasets.py               # aggregate gt labels and save it inside the 'labels' folder
│   script2_gen_hyperparam.py                # aggregate and save the box offsets and its statistics inside the 'hyperparam' folder
│   script3_train_model.ipynb                # notebook to train the model 
│   script4_inference_bdd.ipynb              # run inference on the bdd dataset images
│   script4_inference_kitti.ipynb            # run inference on the kitti dataset images      
│   script5_compute_mAP_bdd.ipynb            # compute mean average precison (mAP) on the bdd dataset   
│   script5_compute_mAP_kitti.ipynb          # compute mean average precison (mAP) on the kitti dataset
│   video_inference_bdd.py                   # run inference on the bdd dataset video
│   video_inference_kitti.py                 # run inference on the kitti dataset frame sequence video

TOC

Exploratory Data Analysis

To have good performance from a trained object detection model, the training dataset needs to be large, diverse, balanced and the annotation has to be correct. BDD dataset is adequately large to train a resonably good performing model. Below are the data analysis conducted to get an insight about the quality of the dataset where good quality means that the training dataset has to be diverse and balanced.

Scene and Label Instance

Number of instances of different classes and scenes.

Observations

There is a huge intra-class as well as inter-clss imbalance in the dataset (depends on how we are considering the intra and inter class).
The intra-class imbalance is present in the number of instances of traffic light, where there are much less number of yellow traffic lights. The red and green instances are resonably balanced.
The intra-class imbalance is also observed in the number of instances of road vehicles, where the car class has huge number of instances than other classes like truck and bus.
The inter-class imbalance can be seen in the number of instances of vehicles and non-vehicles, where the car class has huge number of instances than other classes like person, rider, train etc.

TOC

Bounding box distribution

Annotated bounding box dimension scatter plot.

Observations

From the plot we can observe that there are some boxes that are probably incorrect annotations. These either have extreme aspect ratio or the area is too small

TOC

Wrong annotations

If we select those boxes from the previous scatter plot that has some extreme aspect ratio or the area is very small, we would be able to identfy annotation errors. Some of them can be categorized as follows.

Box area too small
Extreme Box Aspect Ratio
Incorrect Class

TOC

Dataset Modification

Based on the above analysis the training samples and the dataset annotations are modified to

Simplify the development of object detection model in version 1 by reducing the number of classes and removing the highly imbalanced and irrelevant classes.
Reduce the number of wrong and low quality annotations.

The modifications are as follows:

Car, bus, truck are merged as vehicle; person and rider are merged as person. The remaining classes are part of negative class.

Select boxes that satisfy the below conditions:
- Box width ≥ 5 pixels
- Box heighth ≥ 5 pixels
- 0.1 ≤ Box aspect ratio ≤ 10

Relevant Scripts (BDD)

SCRIPT	LINK
1_1_eda_vis_anno_data.ipynb	Link
1_2_eda_plot_label_count_distrib.ipynb	Link
1_3_eda_bbox_distrib.ipynb	Link
1_4_eda_vis_different_obj_categories.ipynb	Link
1_5_eda_identifying_anno_errors.ipynb	Link
2_1_eda_vis_remapped_anno_data.ipynb	Link
2_2_eda_plot_remapped_label_count_distrib.ipynb	Link
2_3_eda_remapped_bbox_distrib.ipynb	Link
2_4_eda_vis_remapped_obj_categories.ipynb	Link
2_5_eda_identifying_outliers.ipynb	Link

Relevant Scripts (KITTI)

SCRIPT	LINK
eda_identifying_outliers.ipynb	Link
eda_plot_remapped_label_count_distrib.ipynb	Link
eda_remapped_bbox_distrib.ipynb	Link

TOC

Model Architecture

Concept Level Architecture

Backbone for Feature Computation

Neck for Feature Aggregation

Head for Dense Object Detection

Architecture Summary

TOC

Ground Truth Generation

Each of the anchors corrospond to an object hypothesis where the network shall learn to predict 4 values : box offsets, centerness score, objectness score, and classification score from the image. The groundtruth for training is computed as follows.

Bounding Box Offsets

Centerness Score

Objectness and Object Class

TOC

Training

Augmentation

Augmentation is performed during training. The augmentation process is depicted as follows

Loss Functions

TASK	LOSS FUNCTION
Class Prediction	Class Weighted Cross Entrophy Loss
Objectness Prediction	Focal Loss
Box Offset Regression	Smooth L1 Loss
Centerness Score Regression	Binary Cross Entrophy Loss

Optimization Method

Either SGD with momentum or AdamW oprimization method can be used. Refer to these scripts for more details:

SCRIPT	LINK
set_parameters_for_training.py	Link
script3_train_model.ipynb	Link

TOC

Performance Evaluation

BDD Dataset

Detection Rate vs False Positives per image (ROC Curve)

Recall vs Precision (PR Curve)

Comparing performance for Vehicle and Person class

Result

Visualization

Vehicle Detection Threshold	Precision (%)	Recall (%)	mAP@0.5 (%)
0.4	62.74%	79.77%	76.50%
0.5	80.15%	75.06%	73.11%
0.6	90%	69.13%	68.06%
0.7	95.58%	61.21%	60.70%
Person Detection Threshold	Precision (%)	Recall (%)	mAP@0.5 (%)
0.3	44.7%	65.42%	56.41%
0.4	63.48%	59.52%	53.18%
0.5	77.08%	50.68%	46.92%
0.6	86.46%	40.49%	38.54%

mAP at different detection threshold ( computed using Link )

SCRIPT	LINK
bdd_score_tuning.ipynb	Link
bdd_nms_tuning.ipynb	Link
script5_compute_mAP_bdd.ipynb	Link

Relevant Scripts

TOC

KITTI Dataset

Detection Rate vs False Positives per image (ROC Curve)

Recall vs Precision (PR Curve)

Comparing performance for Vehicle and Person class

Result

Visualization

Vehicle Detection Threshold	Precision (%)	Recall (%)	mAP@0.5 (%)
0.5	79.24%	89.71%	88.03%
0.6	85.77%	87.92%	86.60%
0.7	91.15%	85.62%	84.56%
0.8	95.18%	80.20%	79.50%
Person Detection Threshold	Precision (%)	Recall (%)	mAP@0.5 (%)
0.4	45.69%	79.73%	70.60%
0.5	57.61%	75.63%	68.62%
0.6	69.73%	70.44%	65.38%
0.7	81.84%	62.53%	59.50%

mAP at different detection threshold ( computed using Link )

SCRIPT	LINK
kitti_score_tuning.ipynb	Link
kitti_nms_tuning.ipynb	Link
script5_compute_mAP_kitti.ipynb	Link

Relevant Scripts

TOC

Conclusion

Person class suffers from low recall due to much less number of training samples
The basic building block of the model is weight standardized conv2d followed by group norm and a non-linear activation. This helped in setting the batch size small (6 in this case) so that it fits in the gpu memory. It also helps in keeping the training stable (no NaNs).
There are ways to improve the performance. Some of them are: fine-tuning the backbone, utilizing several other open source datasets, taking a second stage to improve recall, training the model end to end for different tasks such as segmentation and tracking. These shall be part of future releases

TOC

Reference

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

FCOS: A simple and strong anchor-free object detector

HybridNets: End-to-End Perception Network

https://www.cvlibs.net/datasets/kitti/

TOC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Anchor Free Object Detection

Introduction

Table of Contents

About The Project

Requirements

How to run the project

Project Folder Structure

Exploratory Data Analysis

Scene and Label Instance

Bounding box distribution

Wrong annotations

Dataset Modification

Model Architecture

Concept Level Architecture

Backbone for Feature Computation

Neck for Feature Aggregation

Head for Dense Object Detection

Architecture Summary

Ground Truth Generation

Bounding Box Offsets

Centerness Score

Objectness and Object Class

Training

Augmentation

Loss Functions

Optimization Method

Performance Evaluation

BDD Dataset

KITTI Dataset

Conclusion

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Anchor Free Object Detection

Introduction

Table of Contents

About The Project

Requirements

How to run the project

Project Folder Structure

Exploratory Data Analysis

Scene and Label Instance

Bounding box distribution

Wrong annotations

Dataset Modification

Model Architecture

Concept Level Architecture

Backbone for Feature Computation

Neck for Feature Aggregation

Head for Dense Object Detection

Architecture Summary

Ground Truth Generation

Bounding Box Offsets

Centerness Score

Objectness and Object Class

Training

Augmentation

Loss Functions

Optimization Method

Performance Evaluation

BDD Dataset

KITTI Dataset

Conclusion

Reference