Skip to content

Latest commit

 

History

History
executable file
·
218 lines (192 loc) · 19.6 KB

README.md

File metadata and controls

executable file
·
218 lines (192 loc) · 19.6 KB

Few-Shot Learning Summarization

This repository summaries Few-shot learning in the Computer Vision aspect, focusing on image classification, object detection, and object segmentation.

The main purpose of this list is to review and recap several exemplary models/methods/approaches to capture the overview of Few-shot Learning in Computer Vision, focusing on main approaches, learning method, and model pipeline of these model. I understand it is really difficult to recap a whole paper but I will try my best to make it as easy to read as possible. For better intuition, please read the original paper and review the implementation code which are attached along with the review/recap sections. If there are any review mistakes or inaccurate information, please feel free to inform me.

Currently, my priority is summarizing papers about few-shot image classification first, then few-shot learning papers in the areas of object detection and semantic/instance object segmentation later.

This repository will be updated frequently. You can also check my previous paper summarization list Transformers4Vision if you interested.

Table of contents

Preface, Abbeviations, and Notations

Many equations/formulas in this review repository might differ from the original paper. Not only because of the inability of rendering latex math on markdown Github (there are several ways to solve this but it will be very complex) but also I wish to keep these formulas as simple as possible for beginners (like me) to understand these papers.

Many papers often work with different notations and abbeviations which might bring confusion to the reader (many refer to use K for the number of classes; N for the number of support examples and vice versa). To provide a uniform notation for summarizing papers and convenient for writing/comparing, I will use the following abbreviations and notations list throughout my papers' summarization. Specific symbols/notations for particular paper will be defined with their own recap section.

  • Abbeviations:
    • FSL: Few-shot learning
    • FLC: Few-shot classification
    • FLOD: Few-shot Object Detection
    • FLSS: Few-shot Semantic Segmentation
    • FLIS: Few-shot Instance Segmentation
    • N-way K-shot: few-shot learning on N classes with K examples for each. Described in detailed in the follow section.
  • Notations:
    • Be: training episodic batch
    • C: usually stand for Classifier
    • D_base: Base class data
    • D_novel: Novel unseen class data
    • F: usually stand for Feature extractor
    • G: weight generator
    • K: number of few-shot example on novel classes/size of support set S per novel class
    • N, n: number of classes/list of classes
    • N_base : number/list of base classes
    • N_novel: number/list of novel classes
    • Q, q: query set for few-shot testing
    • S, s: support set for few-shot learning
    • Te: testing episodic batch
    • W, w: learnable weight for specific task
    • W_base: learnable weight for base classes
    • W_novel: learnable weight for novel class.
    • θ: usually is the learning parameter for C
    • φ: usually is the learning parameter for G

Basic concepts

Definition

  • Few-shot Learning is an example of meta-learning, where a learner is trained on several related data during the meta-training phase, so that it can generalize well to unseen (but related) data with just few examples during the meta-testing phase.
    • In other words, Few-shot Learning aims to develop models that can learn to identify unseen (query) objects with just a few (support) examples.
    • This is why Few-Shot Learning (and meta-learning) is also known as a learning to learn method
  • An effective approach to the Few-Shot Learning problem is to learn a global representation for various tasks and train tasks specific classifier/detector on top of this representation for specific task.

N-way K-shot setting

  • Suppose we have a dataset D that is split into two subsets D_base and D_novel with two disjointed class label set N_based and N_novel. In the cases of few-shot learning, a model is trained to learn some prior or share knowledge from D_base, then modified on tasks on D_novel.
    • In other words, D_base is used for training model in the meta-training phrase, and D_model is used for testing model in the meta-testing phase
  • For D_novel, we can split the data into 2 set including:
    • A support set S for learning: which is a small data that contains only few K labeled samples for each of N_novel classes
    • A query set Q for predicting: which is a small unlabeled data that share the same set of N_novel classes
    • Our purpose is classifying Q query images into these N_novel classes based on S support images.
  • A Few-shot Learning setting with support set S includes N classes and K samples is called N-way K-shot
    • One-shot Learning is the setting with K = 1
    • An example of 2-way 4-shot setting in Few-shot:
    • Typically, the prediction accuracy will decrease when N increase, and increase when K increase

Episodic learning

  • In the context of deep learning, a training iteration is known as an episode. An episode is a single step to train the network once, calculate loss and backpropagate the error for a gradient descent step. An episode can also be called episodic batch (Be).
  • An episode is defined by 3 variables {Nc, Ns, Nq} where Nc = N (the way) is the number of classes per episode, Ns = K (the shot) is the number of support images per class, and Nq is the number of query examples per class. Nc and Ns define the N-way K-shot setting.
    • During the training phase, {Nc, Ns, Nq} can be seen as a set of hyperparameters that control the batch/episode creation, and during the testing phase, {Nc, Ns, Nq} defines the problem setup.
  • This figure below is an example of the episodic setting in Few-shot learning (in this case, 3-way 2-shot), with data is split into 2 set for training and testing:
    • In the meta-training phase, each epoch (the process that the model runs through the entire training set) consists of 2 episodic batches. The episodic batch Be defined as {Nc=3, Ns=2, Nq=3}, which are Nc = 3 training classes, Ns = 2 images per class for the Support set, and Nq = 3 images for the Query set.
    • In the meta testing phase, the testing setup Te is {Nc=3, Ns=2, Nq=4}.
    • Note that the classes from meta-training and meta-testing are disjoint with each other.

Common approaches

  • Preface:
    • Rather than training to recognize specific objects in the training phase, we train the model to learn the similarity and classify/recognize N_base classes with D_base(Q, S).
    • Then the model can use this knowledge to learn to recognize unseen N_novel classes in D_novel(Q) based on the provided information from the D_test(S).
    • In the case of few-shot learning, a meta-learner is trained to learn prior knowledge from D_base, and then, its parameter is modified with the on a specific task on D_novel using a base-learner
  • There are four common approaches to tackle a few-shot learning problem, including:
    • Similarity-based approach: Which focuses on building an efficient similarity distance to identify Query set Q given labeled Support set S. The predicted probability of Q is a weighted sum of labels of S samples, with the weight is the similarity distance function between S and Q. This approach is quite similar to nearest neighbor algorithms (i.e., k-NN, k-mean clustering).
    • Model-based approach: Which designs a model specifically for fast learning with internal or external memory that can help storage and update its parameter rapidly with a few-training steps.
    • Initialization-based approach: Which focuses on finding a shared initialization learned from D_base that can fast adapt to the unseen tasks from D_novel, as the task-specific parameters are close to this global initialization. Then, the learned parameters θ from the D_base can be fine-tuned into ϕ on the D_novel dataset for the specific task.
    • Optimization-based approach: Which focuses on optimizing the meta-learner parameters and the optimization algorithms so that the deep model can be good at learning with only a few examples. The parameter θo is continuously refined and adjusted by the meta-learner based on the base_learner performance via a few effective gradient descent step. Finally, a specific-task parameter ϕ is achieved
      • Or in some methods, instead of using gradient descent, an optimizer can be learned to output the update θo directly.

Basic Few-shot Learning Algorithms

Siamese Network

  • Paper: https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf
  • Siamese Network;
    • Siamese Network is a twin neural network architecture with two inputs and a distance output for few-shot/one-shot learning
    • The overall approach is feeding two images x1, x2 into the siasame network, then these images are passed through a CNN to encode into feature vectors via function f0
    • Then, L1-distance (or other metric such as Cosine) is calculated by |f0(x1) - f0(x2)|, then the L1-distance is converted into probability p by a linear Feedforward and a sigmoid activation:
      • p(x1, x2) = sigmoid(W.|f0(x1) - f0(x2)|)
    • Finally, the cross-entropy loss is calculated to learn if two images belong to the same class or not.
    • Siamese Network can be trained by using Triplet Loss or Contrastive loss to augment its performance.
  • Triplet loss:
    • Given a set of three images [Positive, Anchor, Negative], where the Anchor (A) and Positive (P) images belong to the same class, and the Negative (N) image is different. Thus, the dissimilarity of (A, P) must be low and (A, N) must be high.
    • The Triplet Loss can be calculated by taking the A and comparing it with both P and N by the formula:
      • L = max(d[f(A), f(P)] - d[f(A), f(N)] + margin, 0)
        • Where d[f(A), f(n)] is the distance function between two embedding features of A and B, which can be L1, L2, Cosine, etc;
        • margin is a small number similar to the bias value which used to "stretch" the distance differences between similar and dissimilar pairs of the triplet.
    • We must minimize the loss, which push:
      • d[f(A), f(P)] −> 0
      • d[f(A), f(N)] >= d[f(A), f(P)] + margin.
      • This means the positive example will be close to the anchor and the negative examples must be far from it.
  • Contrastive loss:
    • The idea of contrastive loss is quite similar to Triplet Loss. However, it only uses a pair of (image) data, either in the same class or different classes.
    • Given a par of images [A, B], contrastive loss takes the feature extraction of A and calculates its distance to B. If A and B are in the same class, then the distance is minimized by the loss, and if otherwise, the distance is maximized.
    • During training, A and B are fed to the model with their ground truth relationship Y. Y equals 1 if A and B are the same class and 0 otherwise. Mathematically, the constative loss is calculated by the formula:
      • L = Y*D^2 + (1-Y)*max(margin - D, 0)^2
        Where D = D[f(A), f(A)] and the margin is used to "tighten" the constraint: if A and B are dissimilar, then their distance should be at least margin or the contrastive loss will be incurred
  • Code: https://github.com/fangpin/siamese-pytorch
  • Reference: https://jdhao.github.io/2017/03/13/some_loss_and_explanations/

Matching Network

  • Paper: https://arxiv.org/pdf/1606.04080.pdf
  • Approach:
    • The idea of Matching Network is given Support set S(x_s, y_s) and Query set Q(x_s), an attention kernel a(x_s, x_q) is calculated to determined the similarity between x_s and x_q. Then, the Query labels y_q can be calculated by the probability distribution:
      • y_q = P(y_q|x_p, S) = Sum[a(x_s, x_p)*y_s]
    • The attention kernel a(x_s, x_p) between two images is the cosine similarity between their embedding vectors and normalized by the softmax:
      • a(x_s, x_p) = exp(cosine[f(x_s), g(x_q)])/Sum[exp(cosine[f(x_s), g(x_q)])]
        Where g and f are the embedding function of the Support and Query set, respectively.
    • Simple version: In the simple version of Matching network, an embedding function is a CNN, DNN with single data sample as input (for one-shot learning setting), as g == f
    • Full Context version: For the few-shot learning setting, the Matching Network must take the full Support set S with K samples as input to match with the Query set Q. Therefore, the embedding function g and f are as follow:
      • g(x_s, S) uses a bidirectional LSTM to encode x_q in the context of entire support set S.
      • f(x_q, S) encode the test sample x_q via an LSTM with read-attention over the whole set S with the formula:
        • f(x_q, S) = attLSTM(f'(x_q), g(x_s, S), K)
          Where f'(x_q) is the embedding feature of x_q and K is a fixed number of unrolling steps of the LSTM.
  • Code: https://github.com/BoyuanJiang/matching-networks-pytorch

Prototypical Networks

  • Paper: https://arxiv.org/pdf/1703.05175.pdf
  • Idea:
    • The objective of Prototypical Networks is to learn the prototype representation on each of C classes by calculating the mean of Support Set on an embedding space (which can be done by a simple CNN). These prototype representations are then compared with the Query Set for the Few-shot learning classification task.
    • The basic idea of Prototypical Networks is quite similar to the K-mean Clustering algorithms.
  • Approach:
    • At first, Prototypical Networks used an embedding function f (via a CNN) to encode input support set S into an M-dimensional feature space (color dot). Then, a prototype feature vector V_c of class c is calculated by:
      • V_c = 1/|S_c|*Sum[f(x_k)] where x_k ∈ S_c
        With S_c is the set of Support set on class c with k samples for each
    • Then, the distribution over classes for given Query input Q is the softmax over the inverse of distances between the query data embedding f(Q) and the prototype vectors V_c and that can be used as the basis for classification:
      • P(y=c|Q) = softmax(-d[f(Q), V_c])
      • Therefore, the closer f(Q) is to any V_c, the more likely Q is to be in this class.
    • To optimize the training process, Prototypiccal Networks use the negative log-likelihood L = -logP(y = c|Q). The loss computing process is presented in detail below:
  • Code: https://github.com/jakesnell/prototypical-networks

Model-Agnostic Meta Learning (MAML)

  • Paper: https://arxiv.org/pdf/1703.03400.pdf
  • Model-agnostic Meta Learning is an optimization-based approach, which refines the global representation θ from meta-learner into a specific-tasks p(R) with parameters θ*i through a small number of gradient steps with only small amount of support data.
  • Approach:
    • Suppose we have a meta-learner model f(θ) with parameter θ. Given a task ti and its associated dataset (Di_train, Di_test), the learner model parameters θ' can be updated by one (or more) gradient descent step updates on task ti by:
      • θ'i = θ - α.∇θ.L_ti[f(θ)]
        Where L_ti is the loss of task ti computed from model f(θ), with the step size α may be fixed as a hyperparameter
    • The model parameters are then trained by optimizing the performance of f(θ'i) which respect to θ across tasks sample from p(R). Then, the meta-objective θ* is calculated by:
      • θ* = argmin[Sum(L_ti[f(θ'i)])]
    • Note that the meta-optimization is performed over the meta-model parameter θ, whereas the objective θ* is computed using the updated model parameter θ'. In the paper, the meta-optimization across tasks is performed via SGD algorithm:
      • θ <− θ - β.∇θ.Sum(L_ti[f(θ'i)])
        where β is the step size
    • The detail of MAML full algorithms is presented in detail below:
  • Code: https://github.com/dragen1860/MAML-Pytorch

Specific topic

References





These notes were created by quanghuy0497@2022