Skip to content

Implementation of algorithms like CNN, Vision Transformers, VAE, GAN, Diffusion .... for image data

Notifications You must be signed in to change notification settings

khetansarvesh/CV

Repository files navigation

********** Digital Image Preprocessing **********

Here our aim is to enhance the image quality in the image dataset that we are dealing with. For a comprehensive list of datasets, refer to this linkedin post by me. Following are few approaches that can be used :

  1. Normalizing Image : Convert pixel values from 0-255 to 0-1 range
  2. Illumination Correction : Various methods available, choose based on project requirements and previous research
  3. Handling Noise
  4. Duplicate Image Removal : Check if there are duplicate images or not (both exact and near duplicates) and remove if any
  5. Remove blurry images
  6. Remove images with unusual aspect ratios
  7. Resize images with odd dimensions : Check if there are images with odd size, if yes then resize them
  8. Remove low-information images

****** Representation Learning / Tokenization ******

********** Non-Generative Downstream CV Tasks (Supervised CV) **********

Note : In NLP we already saw when to use freeze learning / partial finetuning / complete finetuning, same logic also applies here!!

  1. Object Classification

    • Say we have a small dataset of images on which we need to perform object classification task.
    • Since majority of the foundation models are built on this task we dont need to finetune the foundation models, instead we will just extract the last layer representation from the foundation model
    • Now we will pass this representation to any classification model like KNN / Logistic Regression / Naive Bayes / FLDA / SVM / Decision Tree / … / Neural Network to perform the classification.
    • Here is a code implementation using Neural Network!!
  2. Object Localization / Image Classification + Localization Task

  3. Object Detection / Recognition Task

  4. Semantic Segmentation Task

  5. Instance Segmentation / Simultaneous Detection and Segmentation (SDS) Task :

    • This is an advanced form of object detection task, in object detection task you just made bounding boxes but here you can make exact outline of the object
    • Check out this image here
  6. Single Human Key-Point Detection / Pose Estimation Task :

    • Problem Statement : Represent a person by K points. Most of the time K =14 cause almost everyone has 14 joints. Refer this image here.
    • Solution Architecture :
      1. This is very similar to Image Classification Architecture just that in the output layer you add 14*2 neurons representing coordinates of all these 14 joints. Refer architecture diagram here
      2. Instead of classification this is a regression problem hence loss function = Regression Loss = L2 Loss
    • If you change the dataset this same approach can also be used for other similar tasks like Face Mesh Detection / Hand Detection
  7. Multiple Human Key Point Detection Task :

    • Same as above single human key point detection problem just that here we have multiple persons and for each person you need to do key point detection.
    • Now during test time you will not know how many people will be there in the image. There can be 5 / 10 / 1000, we saw this similar kind of issue in case of object detection tasks.
    • Hence use the same strategy as used in the Object Detection task to solve this issue here.
  8. Panoptic Segmentation Task

  9. Photo Optical Camera Recognition (OCR) Task

    • Problem Statement : Identify the texts present in the image, as shown in image here

    • Solution :

      a. Step 1 : Text Bounding Box Detection => Identify the regions in the image which have text using the object detection model. Refer image here

      b. Step 2 : Character Segmentation => Now for each of the text region, you need to segment out the character in that text. Refer image here

      c. Step 3 : Character Classification => Now for each segmented character you need to run a character classification model. Refer image here

********** Generative Downstream CV Tasks (UnSupervised CV) **********

  1. Style Transfer [Image2Image Translation]
  2. Depth Estimation
    • It is a difficult task because it requires the model to understand 3D structure using only 2D images. There are two ways to solve this.
    • Non-Learning Based Method: This was used earlier and not used anymore. This required 2 camera setup and hence called stereo pair. This method is called stereo vision.
    • Learning-Based Method: This method is SOTA and requires only 1 camera setup. Hence called monocular depth estimation (MDE). There have been many proposed models but the few best models are

********** Reinforcement Learning (RL) for Images **********

Task : Playing Ping Pong Atari Game

Method 1 : Deep Reinforcement Learning

There are essentially 3 ways to use Neural Networks for Reinforcement Learning :

  • Q Learning / Value Learning By Google Deepmind : Seeing RL problem as Regression Problem
  • [Better Approach] Policy Learning : seeing RL problem as Classification Problem
  • Action Critic - combining both Q-Learning and Policy-Learning

Method 2 : Traditional Reinforcement Learning

Above we saw all the implementation using Neural Networks but earlier people used MDPs to model these instead of Neural Networks. Since MDPs were not scalable, Neural Networks became prominent. You can understand this here in this video very clearly.

But if you still want to learn more about how to use RL with MDPs I would recommend following courses to you (watch in the given sequence)

About

Implementation of algorithms like CNN, Vision Transformers, VAE, GAN, Diffusion .... for image data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published