What is Machine Learning?

Definition:

The study of computer algorithms that improve automatically through experience

Formally:

Improve at task T
With respect to performance measure P
Based on experience E

Type of Problems in ML

Clustering

Given some data we want to figure out how to group them together.

Regression

Given some data, we want to predict an outcome value.

Classification

Given data, can we predict which category something belongs to.

Types of Algorithms in ML

Instance Based Algorithms

After training, performing the desired task (classification, regression, clustering, etc..) is based on training data samples themselves

Model Based Algorithms

Training involves learning a model such that performing the desired task is done by running the data through model (which does not directly reference training data samples)

Parametric Algorithms

The model we’re learning is based on blending information via parameters. These parameters must be learned during the training process.

Non-Parametric Algorithms

These algorithms/models are not defined by determining blending parameters.
However there still may be use-specified parameters

A large part of machine learning is attempting to divide the hypothesis space, like separate it into clusters and separate it into classes.These “dividers” can be either Linear or Non-Linear.

Types of Data in ML

Categorical

Examples: Car Model, School

Finite Discrete Valued

Ordering still matters, but there’s only so many values out there

Continuous Valued

Examples: Blood Pressure, Height

No Free Lunch Theorem

Unfortunately there’s no single machine learning algorithm to rule them all
Typically we must consider the previous information in deciding which set of algorithms to train and test and then we select the best.

Standardizing Data

Standardized data has Zero mean and Unit deviation.

When do we standardizing data?

Before we start using data we typically want to standardize non-categorical features (this is sometimes referred to as normalizing them).

How to standardizing data?

Treat each feature independently
Center it (subtract the mean from all samples)
Make them all have the same span (divide by standard deviation)

Why do we need to standardizing data?

If we used the data as-is, then one feature may have more influence than the other.

Data Dimensionality

Definition

If data sample X_i has D values associated with it (which we call features) then we can write this as and call D the dimensionality of X_i.

Drawbacks of too much information

Computation Cost
- Both time and space efficiency
Statistical Cost
- Too specific? Doesn’t generalize enough?
- Too susceptible to noise?
Visualization
- How can we look at the data to understand its structure?

Curse of Dimensionality

In higher dimension space we need exponentially more samples for equivalent coverage.
Therefore if our data is in high dimensional space it will likely sparsely cover that space and instances will be far apart from one another

Dimensionality Reduction

Goal

Represent instances with fewer variables
Try to preserve as much structure in the data as possible
If there is class information, increase class separability (Discriminative)

Benefits

Need less data to cover the feature space
Easier learning – fewer parameters to learn
Easier visualization – hard to visualize more than 3D or 4D

Approaches

Feature selection
- Pick a subset of the original dimensions
- If there is class information, pick good class predictors (Discriminative)
Feature construction
- Construct a new set of features from existing
- In particular we’ll look at feature projection

Feature Selection

Give set of features, some features are more important than others. We want to select some subset of features to be used by learning algorithms.
Score each feature then Select set of features with best score

Entropy

Given probability of event v₁, ..., v_n as P(v₁), ..., P(v_n) we can compute entropy
Entropy measure the randomness of the data

Reminder with respect A

Let p = #{1}, be the number of samples with label one over the entire dataset.
Let n = #{0}, be the number of samples with label zero over the entire dataset.
Let p_i, n_i, be the number samples in subset E_i.
Define reminder with respect A with

Information Gain (IG) / reduction in entropy

We should choose the attribute with the largest IG!

IG Example

For reference

Samples

Class 1: {(1,1),(1,3),(2,2)}
Class 2: {(1,2),(3,2),(2,2)}

Feature 1

Data:
- p₁=2, n₁=1
- p₂=1, n₂=1
- p₃=0, n₃=1
Reminder
IG

Feature 2

Data:
- p₁=1, n₁=0
- p₂=1, n₂=3
- p₃=1, n₃=0
Reminder
IG

Feature projection

Projecting points onto a projection matrix

PCA (Principal Component Analysis)

Intuition

Defines a set of principal components (basis)
- 1^st: direction of the greatest variability in the data
- 2^nd: perpendicular to 1^st, greatest variability of what's left
- Etc... until D, the original dimensionality
Choose the number of dimensions we want, k < D and project the original data onto the principal components.
Each projection will result in a point on that axis, resulting in an new k-dimensional feature vector.

Mechanics

Standardize data
- Compute the mean
- Compute the Standard deviation
- data = (data - mean) / std
Find covariance matrix
- N = Number of observations
- cov = data^T data / (N-1)
Find eigenvalues
Find eigenvector for the largest eigenvalue
- value = largest eigenvalue
- vector = the eigenvector for given eigenvalue
- find vector, where (cov - value) * vector = 0
Project the data
- data * vector

LDA (Linear Discriminant Analysis)

Intuition

Main idea: find projection to a line such that samples from different classes are well separated

Mechanics

Standardize data for each class
Compute the mean for each class
Compute scatter matrices for each class
- cov(C) = C^T C / (N-1)
- 𝜎² = (N-1) * cov(C)
Within class scatter matrix
- S_W = 𝜎₁² + 𝜎₂²
- S_W^-1 = inverse of S_W
Preform eigen-decomposition
- 𝑆_B = (𝜇₁-𝜇₂)^T(𝜇₁-𝜇₂)
- Compute S_W^-1S_B
Find eigenvalues
Find eigenvector for the only none-zero eigenvalue
- W, the eigenvector will be used to project data.
Project the data
- class₁ * W
- class₂ * W

Clustering

Definition

The process of grouping a set of objects into classes of similar objects
- Items within cluster should be similar
- Observations from different clusters should be dissimilar
Clustering is the most common form of unsupervised learning
Hard vs Soft Clustering
- Hard clustering: Each observation belongs to exactly one cluster
- Soft clustering: An observation can belong to more than one cluster

K-Means

Intuition

Assumes observations are real-valued vectors
Each observation is associated with a single reference vector
The initial reference vectors are often chosen at random
At each iteration
- Each observation is reassignment to the reference vector closest to it
- A new reference vector is computed based on the observations that are associated with it
This continues until some termination criteria is met
- Number of iterations
- Threshold on change of reference vectors.

Prove (intra-cluster LSE)

TODO: Converges -> intra-cluster LSE won't change.

Expectation Maximization

Intuition

These types of algorithms try to find a solution by alternating between two steps:
1. Expectation – Here’s we predict out outcome.
  - For k-means, this is assigning each instance to a cluster.
2. Maximization – Here we update our model to maximize/minimize something.
  - For k-means this is moving the reference vector to the mean of the cluster in order to minimize the distance.
  - Conversely we sometimes maximize the log-likelihood

Gaussian Mixture Models

Start with two random starting vectors μ₁ and μ₂ and compute sample standard deviations σ₁ and σ₂ to initialize Gaussians N₁(μ₁, σ₁) and N₂(μ₂, σ₂)
Initialize probabilities of each Gaussian. We will need this for the E-Step computations. Typically this will be done uniformly unless there is prior information: where k is the number of Gaussians.
[E-Step] For each observation compute the likelihood (expectation) of each Gaussian given the observation.
- That is compute and
[M-Step] Adjust N₁(μ₁, σ₁) and N₂(μ₂, σ₂), P(N₁) and P(N₂) to better fit the estimates (maximize the likelihood).
Iterate (repeat 3-4) until convergence.

Hierarchical Clustering

Building cluster tree

With hierarchical agglomerative clustering we’re building a clustering binary tree
At each iteration we have a new set of clusters
[Top-down Approach]
- At first everything is part of a single cluster.
- Then split this into two clusters based on some criterial
- Now choose one of these two clusters, and split it into two
- Etc.. until each cluster only has one observation in it (called a singleton)
[Bottom-Up Approach]
- At first everything is its own cluster.
- Choose two of these clusters to merge.
- From these N − 1 clusters, choose two to merge.
- Etc.. until there is only one cluster.

Intra vs inter cluster distance

Intra-Cluster Distance
- Distances between nodes in the same cluster
Inter-Cluster Distance
- Distances between nodes between clusters

Link Types

Single link
- Similarity of the most similar
Complete link
- Similarity of the furthest points
Average link
- Average pair-wise similarity between clusters

Purity

Let N_ij be the number of instances of (supervised) label j within cluster C_i
Define the average purity of this clustering as

Supervised learning

Data sets

Training set
- build/train system using the training data
Testing set
- test system using the testing data
Validation set
- for model selection

Cross Validation

How to do cross validation
- [S-Folds] Here we’ll divide our data up into 𝑆 parts, train on 𝑆 − 1 of them and test the remaining part. Do this 𝑆 times.
- [Leave-one-out] If our data set is really small we may want to built our system on 𝑁 − 1 samples and test on just one sample. And do this 𝑁 times (so it’s basically N-folds)
Why cross validation
- Do not have that much data.

Over / under fitting

Identifying (detect)
- [under-fitting] If we don’t do well on either the training or the testing set
- [over-fitting] If we do well on the training set but poorly on the testing set
Solving
- under-fitting
  - Make a more complex model (May involve need more features)
  - Trying a different algorithm
- over-fitting
  - Use a less complex model (May involve using less features)
  - Try a different algorithm
  - Get more data
  - Use a third set to choose between hypothesis (called a validation set)
  - Add a penalization (or regularization) term to the equations to penalize model complexity

Bias / variance

If our trained model is very dependent on the training data set (that is it fits it very well), then there will be a large variance in the models given different training sets.
If our model barely changes at all when it sees different training data, then we say it is biased.
Relation
- variance and overfitting are related
- bias and underfitting are related

Linear Regression

Classification

Error Types

	Predicted Positive	Predicted Negative
Positive Examples	True Positive	False Negative
Negative Examples	False Positive	True Negative

True positive = Hit
True negative = Correct rejection
False positive = False Alarm (Type 1 error)
False negative = Miss (Type 2 error)

Binary evaluation metrics

measure to evaluate the quality of a classifier
Precision-Recall graph
Impact on varying Threshold
- Plotting precision and recall as a function of the threshold creates something called a precision-recall curve (PR)

Accuracy metric

True positive rate
False positive rate

bias -> model not change variance -> model change to quick

LR - regularization term J𝜃 = 𝑌−𝑋𝜃2+𝜆𝜃𝑇𝜃 𝜆 from 0 ~ 1. 𝜆 larger -> bias

Files

MidtermReview.md

Latest commit

History

MidtermReview.md

File metadata and controls

What is Machine Learning?

Definition:

Formally:

Type of Problems in ML

Clustering

Regression

Classification

Types of Algorithms in ML

Instance Based Algorithms

Model Based Algorithms

Parametric Algorithms

Non-Parametric Algorithms

Types of Data in ML

Categorical

Finite Discrete Valued

Continuous Valued

No Free Lunch Theorem

Standardizing Data

When do we standardizing data?

How to standardizing data?

Why do we need to standardizing data?

Data Dimensionality

Definition

Drawbacks of too much information

Curse of Dimensionality

Dimensionality Reduction

Goal

Benefits

Approaches

Feature Selection

Entropy

Reminder with respect A

Information Gain (IG) / reduction in entropy

IG Example

For reference

Samples

Feature 1

Feature 2

Feature projection

PCA (Principal Component Analysis)

Intuition

Mechanics

LDA (Linear Discriminant Analysis)

Intuition

Mechanics

Clustering

Definition

K-Means

Intuition

Prove (intra-cluster LSE)

Expectation Maximization

Intuition

Gaussian Mixture Models

Hierarchical Clustering

Building cluster tree

Intra vs inter cluster distance

Link Types

Purity

Supervised learning

Data sets

Cross Validation

Over / under fitting

Bias / variance

Linear Regression

Classification

Error Types

Binary evaluation metrics

Accuracy metric