Skip to content

Anchor free object detection from camera images using a segmentation like architecture

License

Notifications You must be signed in to change notification settings

UditBhaskar19/ANCHOR_FREE_OBJECT_DETECTOR_FOR_CAMERA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anchor Free Object Detection

Generic badge PyTorch - Version Python - Version

Code
Project Documents
Output Videos

Introduction

This project is about the development of an Anchor free 2D object detection model using PyTorch, that aims to provide a comprehensive guide for enthusiasts, researchers, and practitioners. Here the object detection model is trained from scratch, incorporating a ImageNet pre-trained backbone from PyTorch. The model is trained using a modest system configuration ( NVIDIA RTX A2000 4 GB Laptop GPU ), thus enabling users with low computational resources to train object detection models that give resonably good performance. An easy to understand and extend codebase is developed in this project. The following are the key highlights:

  • Training a 2D object detection Model in PyTorch from scratch by utilizing Imagenet dataset pre-trained backbone from PyTorch.
  • Development of an easy to understand and well documented codebase.
  • Implementation of a method for tuning the detection threshold parameters.
  • Utilizing training samples from two publicly available datasets: KITTI and BDD, so as to provide a technique to merge samples from multiple training datasets, enabling users to utilize a diverse range of training data for model generalization.

Anchor Free Network Architecture.


Detected Bounding Boxes (BDD).


Detections in video (KITTI).



Table of Contents


About The Project

Requirements

opencv_python>=4.8.0.74
imageio>=2.34.0
matplotlib>=3.7.2
numpy>=1.25.0
torch>=2.0.1
torchvision>=0.15.2
tqdm>=4.66.1

How to run the project

git clone https://github.com/UditBhaskar19/ANCHOR_FREE_OBJECT_DETECTOR_FOR_CAMERA
cd ANCHOR_FREE_OBJECT_DETECTOR_FOR_CAMERA/AnchorFree2DObjectDetection

# to run inference on bdd video frames
python video_inference_bdd.py

# to run inference on kitti video frames
python video_inference_kitti.py

# to create the labels file
python script1_create_dataset.py

# to train the model use script3_train_model.ipynb 

# to write detections to video
cd write_detections_to_video
python write_detection_to_video_bdd.py

Project Folder Structure

AnchorFree2DObjectDetection
│───doc                          # Project documents
│───hyperparam                   # Statistical data of the Bounding Box offsets
│───labels                       # aggregated GT labels data of KITTI and BDD dataset
│───mAP                          # module to compute mAP ( https://github.com/Cartucho/mAP.git )
│───model_weights                # model weights data after training
│───tensorboard                  # data folder for loss visualization in tensorboard.
│───modules                      # main modules 
      │───augmentation           # scripts for image augmentation functions            
      │───dataset_utils          # scripts for data analysis and dataset generation
      │───evaluation             # scripts for detector evaluation and threshold determination   
      │───first_stage            # scripts for defining the model and ground truth generation function for dense object detection
      │───hyperparam             # scripts for computing the bounding box offsets statistics from training data    
      │───loss                   # loss functions
      │───neural_net             # scripts for defining various neural net blocks             
            │───backbone               # model backbone blocks
            │───bifpn                  # BIFPN blocks for model neck            
            │───fpn                    # FPN blocks for model neck
            │───head                   # blocks for model head            
            │   common.py              # common model building blocks
            │   constants.py           # constants for model construction  
      │───plot                   # contains plotting functions
      │───pretrained             # scripts for loading the pre-trained backbone from pytorch            
      │───proposal               # scripts for proposal generation
      │───second-stage           # <work under progress> scripts for defining the model and ground truth generation function for second stage object detection              
│───tests                                    # folder for testing and validation scripts
│───video_inference                          # detection results saved as video
│───write_detections_to_video                # scripts to save detections as video, results are saved in 'video_inference' folder
│   config_dataset.py                        # parameters and constants for dataset 
│   config_neuralnet_stage1.py               # model design parameters
│   script1_create_datasets.py               # aggregate gt labels and save it inside the 'labels' folder
│   script2_gen_hyperparam.py                # aggregate and save the box offsets and its statistics inside the 'hyperparam' folder
│   script3_train_model.ipynb                # notebook to train the model 
│   script4_inference_bdd.ipynb              # run inference on the bdd dataset images
│   script4_inference_kitti.ipynb            # run inference on the kitti dataset images      
│   script5_compute_mAP_bdd.ipynb            # compute mean average precison (mAP) on the bdd dataset   
│   script5_compute_mAP_kitti.ipynb          # compute mean average precison (mAP) on the kitti dataset
│   video_inference_bdd.py                   # run inference on the bdd dataset video
│   video_inference_kitti.py                 # run inference on the kitti dataset frame sequence video               

TOC


Exploratory Data Analysis

To have good performance from a trained object detection model, the training dataset needs to be large, diverse, balanced and the annotation has to be correct. BDD dataset is adequately large to train a resonably good performing model. Below are the data analysis conducted to get an insight about the quality of the dataset where good quality means that the training dataset has to be diverse and balanced.

Scene and Label Instance

Number of instances of different classes and scenes.


Observations

  • There is a huge intra-class as well as inter-clss imbalance in the dataset (depends on how we are considering the intra and inter class).
  • The intra-class imbalance is present in the number of instances of traffic light, where there are much less number of yellow traffic lights. The red and green instances are resonably balanced.
  • The intra-class imbalance is also observed in the number of instances of road vehicles, where the car class has huge number of instances than other classes like truck and bus.
  • The inter-class imbalance can be seen in the number of instances of vehicles and non-vehicles, where the car class has huge number of instances than other classes like person, rider, train etc.

TOC

Bounding box distribution

Annotated bounding box dimension scatter plot.


Observations

  • From the plot we can observe that there are some boxes that are probably incorrect annotations. These either have extreme aspect ratio or the area is too small

TOC

Wrong annotations

If we select those boxes from the previous scatter plot that has some extreme aspect ratio or the area is very small, we would be able to identfy annotation errors. Some of them can be categorized as follows.

  • Box area too small

  • Extreme Box Aspect Ratio

  • Incorrect Class

  • TOC

    Dataset Modification

    Based on the above analysis the training samples and the dataset annotations are modified to

    • Simplify the development of object detection model in version 1 by reducing the number of classes and removing the highly imbalanced and irrelevant classes.
    • Reduce the number of wrong and low quality annotations.

    The modifications are as follows:

    • Car, bus, truck are merged as vehicle; person and rider are merged as person. The remaining classes are part of negative class.

    • Select boxes that satisfy the below conditions:
      • Box width ≥ 5 pixels
      • Box heighth ≥ 5 pixels
      • 0.1 ≤ Box aspect ratio ≤ 10


    Relevant Scripts (BDD)

    SCRIPT LINK
    1_1_eda_vis_anno_data.ipynb Link
    1_2_eda_plot_label_count_distrib.ipynb Link
    1_3_eda_bbox_distrib.ipynb Link
    1_4_eda_vis_different_obj_categories.ipynb Link
    1_5_eda_identifying_anno_errors.ipynb Link
    2_1_eda_vis_remapped_anno_data.ipynb Link
    2_2_eda_plot_remapped_label_count_distrib.ipynb Link
    2_3_eda_remapped_bbox_distrib.ipynb Link
    2_4_eda_vis_remapped_obj_categories.ipynb Link
    2_5_eda_identifying_outliers.ipynb Link

    Relevant Scripts (KITTI)

    SCRIPT LINK
    eda_identifying_outliers.ipynb Link
    eda_plot_remapped_label_count_distrib.ipynb Link
    eda_remapped_bbox_distrib.ipynb Link

    TOC

    Model Architecture

    Concept Level Architecture


    Backbone for Feature Computation


    Neck for Feature Aggregation



    Head for Dense Object Detection

    Architecture Summary


    TOC


    Ground Truth Generation

    Each of the anchors corrospond to an object hypothesis where the network shall learn to predict 4 values : box offsets, centerness score, objectness score, and classification score from the image. The groundtruth for training is computed as follows.

    Bounding Box Offsets


    Centerness Score

    Objectness and Object Class

    TOC


    Training

    Augmentation

    Augmentation is performed during training. The augmentation process is depicted as follows



    Loss Functions

    TASK LOSS FUNCTION
    Class Prediction Class Weighted Cross Entrophy Loss
    Objectness Prediction Focal Loss
    Box Offset Regression Smooth L1 Loss
    Centerness Score Regression Binary Cross Entrophy Loss

    Optimization Method

    Either SGD with momentum or AdamW oprimization method can be used. Refer to these scripts for more details:

    SCRIPT LINK
    set_parameters_for_training.py Link
    script3_train_model.ipynb Link

    TOC


    Performance Evaluation

    BDD Dataset


    Detection Rate vs False Positives per image (ROC Curve)


    Recall vs Precision (PR Curve)


    Comparing performance for Vehicle and Person class


    Result Visualization
    Vehicle Detection Threshold Precision (%) Recall (%) mAP@0.5 (%)
    0.4 62.74% 79.77% 76.50%
    0.5 80.15% 75.06% 73.11%
    0.6 90% 69.13% 68.06%
    0.7 95.58% 61.21% 60.70%
    Person Detection Threshold Precision (%) Recall (%) mAP@0.5 (%)
    0.3 44.7% 65.42% 56.41%
    0.4 63.48% 59.52% 53.18%
    0.5 77.08% 50.68% 46.92%
    0.6 86.46% 40.49% 38.54%

    mAP at different detection threshold ( computed using Link )


    SCRIPT LINK
    bdd_score_tuning.ipynb Link
    bdd_nms_tuning.ipynb Link
    script5_compute_mAP_bdd.ipynb Link

    Relevant Scripts

    TOC


    KITTI Dataset


    Detection Rate vs False Positives per image (ROC Curve)


    Recall vs Precision (PR Curve)


    Comparing performance for Vehicle and Person class


    Result Visualization
    Vehicle Detection Threshold Precision (%) Recall (%) mAP@0.5 (%)
    0.5 79.24% 89.71% 88.03%
    0.6 85.77% 87.92% 86.60%
    0.7 91.15% 85.62% 84.56%
    0.8 95.18% 80.20% 79.50%
    Person Detection Threshold Precision (%) Recall (%) mAP@0.5 (%)
    0.4 45.69% 79.73% 70.60%
    0.5 57.61% 75.63% 68.62%
    0.6 69.73% 70.44% 65.38%
    0.7 81.84% 62.53% 59.50%

    mAP at different detection threshold ( computed using Link )


    SCRIPT LINK
    kitti_score_tuning.ipynb Link
    kitti_nms_tuning.ipynb Link
    script5_compute_mAP_kitti.ipynb Link

    Relevant Scripts

    TOC


    Conclusion

    • Person class suffers from low recall due to much less number of training samples
    • The basic building block of the model is weight standardized conv2d followed by group norm and a non-linear activation. This helped in setting the batch size small (6 in this case) so that it fits in the gpu memory. It also helps in keeping the training stable (no NaNs).
    • There are ways to improve the performance. Some of them are: fine-tuning the backbone, utilizing several other open source datasets, taking a second stage to improve recall, training the model end to end for different tasks such as segmentation and tracking. These shall be part of future releases

    TOC


    Reference


    TOC

About

Anchor free object detection from camera images using a segmentation like architecture

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages