CS 383 - Final Review

Time

Wednesday March 22^nd, 08:00 ~ 10:00 a.m. @Randel 327

Rules

Close book, one piece of paper with note.
No calculator. (compute by hand or leave in formula)
Provide a sheet with necessary rules of calculus.

Things to think about

What are you trying to do?

Find underlying pattern
Predict value
Predict class
- Do you need / want class probabilities?

What do you have?

How many instances / observations?
How many classes?
How many features?
Do you have labels?

Enough data

Enough data?
Do you need to do cross validation?

Features

Do you need / want to "normalize" them?
Do you need all of them?
How correlated are they?
Would you benefit from PCA, LDA or Greedy Selection?
Is the data linearly separable?
- Do you need to work in a higher dimension (kernel)?

What do you want to maximize?

Likelihood
AUC
Precision & Recall
Accuracy

Initialize parameter

If necessary can you initialize parameter in an "intelligent" way?
- Use large learning rate do a few steps, and than use that result as initialize parameter.
- Some problem we family with, manually estimate initialize parameter.

Course Algorithms

For each algorithm think about:

Supervised or un-supervised
Classification, Clustering, or Regression
Model based or instanced based algorithm
Linear vs Non-Linear
How to deal with categorical data
How to deal with continuous valued data
Training complexity (memory and computation)
Testing complexity (memory and computation)
How to deal with overfitting
How to use for multi-class classification

PCA

Un-supervised
Regression
Instanced based algorithm
Linear

LDA

Un-supervised
Regression
Instanced based algorithm
Linear

K-means Clustering

Agglomerative Clustering

Linear Regression

(Naive) Bayesian Inference

Decision Trees

K-Nearest Neighbors

Support Vector Machines

Logistic Regression

Artificial Neural Network

Ensemble learning

Markov Systems / Models

Basic Ideas

Projection

Projecting points onto a projection matrix
PCA (Principal Component Analysis)
- Choose the number of dimensions we want, k < D and project the original data onto the principal components.
LDA (Linear Discriminant Analysis)
- Find projection to a line such that samples from different classes are well separated

Information gain / entropy

entropy
- Given probability of event v₁, ..., v_n as P(v₁), ..., P(v_n) we can compute entropy
- Entropy measure the randomness of the data
Information gain (IG)
- Choose the attribute with the largest IG!

Feature types

Categorical
- Examples: Car Model, School
Finite Discrete Valued
- Ordering still matters, but there’s only so many values out there
Continuous Valued
- Examples: Blood Pressure, Height

Standardization

Standardized data has Zero mean and Unit deviation.
When do we standardizing data?
- Before we start using data we typically want to standardize non-categorical features (this is sometimes referred to as normalizing them).
How to standardizing data?
1. Treat each feature independently
2. Center it (subtract the mean from all samples)
3. Make them all have the same span (divide by standard deviation)
Why do we need to standardizing data?
- If we used the data as-is, then one feature may have more influence than the other.

LSE (least squared estimate)

MLE (maximum likelihood estimate)

Overfitting / Underfitting

Identifying (detect)
- [under-fitting] Don’t do well on either the training or the testing set
- [over-fitting] Do well on the training set but poorly on the testing set
Solving
- under-fitting
  - Make a more complex model (May involve need more features)
  - Trying a different algorithm
- over-fitting
  - Use a less complex model (May involve using less features)
  - Try a different algorithm
  - Get more data
  - Use a third set to choose between hypothesis (called a validation set)
  - Add a penalization (or regularization) term to the equations to penalize model complexity

Dataset

Training set
- build/train system using the training data
Testing set
- test system using the testing data
Validation set
- for model selection

Bayes' Rule

In Bayes’ Rule we call
- 𝑃(𝑦=𝑖|𝑓=𝑥) the posterior (what we want)
- 𝑃(𝑦=𝑖) the prior (probability that 𝑦=𝑖)
- 𝑃(𝑓=𝑥|𝑦=𝑖) the likelihood (likelihood of generating 𝑥 given 𝑦)
- 𝑃(𝑓=𝑥) the evidence

Evaluation

RMSE
- root mean squared error
Accuracy
Precision, Recall, F-Measure
- Error Type
  - Example: FP = Negative Examples, Predicted Positive
  - True positive = Hit
  - True negative = Correct rejection
  - False positive = False Alarm (Type 1 error)
  - False negative = Miss (Type 2 error)
PR Graph, ROC Graph
- PR Graph: Precision VS Recall
- ROC Graph:TPR vs FPR
Area Under Curve (AUC)
Thresholding
- Anything below that threshold is class 0, anything above it is class 1

Binary vs Multi-Class classification

Decision Trees

Using

Building via ID3

Kernels

A function that takes two samples and returns a similarity is called kernel function,

Mapping Functions

Linear / Cosine:
Polynomial kernel:
Gaussian Radial Basis kernel (RBF):
Histogram intersection:
Hellinger kernel:

Kernel Trick

Sometimes we may want to go to a higher feature space. Because we have a linear classifier and the data is not directly linearly separable.
One solution would be to map our current space to another separable space. Then project data to higher dimension using mapping function .
Using the polynomial kernel of degree two on observations with a single feature is equivalent to compute the cosine similarity on observations in 3D space.

K-Nearest Neighbors

Idea: Assign label to 𝑥 according to the label of the training example nearest

Support Vector Machines

Maximize distance to closest example (of each type)

Intuition

We want hyperplane as far as possible from any sample
New samples close to old samples will then be classified correctly
Our goal is to maximize the margin
- The margin is twice the absolute value of distance b of the closest example to the hyperplane

Logistic Regression

Logistic Regression is not regression. It’s classification!
sigmoid or logistic function .

Model form and how to use

logistic function 用一遍

Maximum Log Likelihood Estimate approach (MLE)

Ideally we’d like to take the derivative of this with respect to 𝜃, set it equal to zero, and solve for 𝜃 to find the maxima
- The closed form approach
- But this isn’t easy
So what’s our other approach
- Do partial derivatives on the parameters and use gradient descent! (actually in this case gradient ascent, since we’re trying to maximize)

Gradient Ascent derivation

We want this to go towards zero (local maxima)
So, update 𝜃_j as

Artificial Neural Networks

Computing Forward Propagation

hidden = 1 ./ ( 1 + exp(-1 .* data * beta) );
output = 1 ./ ( 1 + exp(-1 .* hidden * theta) );

Performing Backwards Propagation

delta_out = correctValue - output;
theta = theta + (eta/N) .* (hidden' * delta_out);
delta_hid = (delta_out * theta') .* hidden .* (1 - hidden);
beta = beta + (eta/N) .* (data' * delta_hid);

Derivation of back propagation rules

Deep Learning

Same idea as regular ANNs but with additional hidden layers.

Multi-Layer Intuition

Output layer – Here predicting a supervised target
Hidden layers – These learn more abstract representations as you head up
Input layer – Raw sensory inputs (roughly)

Issues with multi-layers

First layer learns 1st order features (e.g. edges, etc..)
2^nd layer learns high order features (combinations of first layer features, combination of edges, etc..)
Then final layer features are fed into supervised layer(s)

Auto-Encoders

Attempts to find a hidden layer that can reproduce the input
Basic process to get a hidden layer from one auto-encoder is:
1. Take the input, add some noise to it, and add a bias node
2. Choose the hidden layer size to be less than the input size
3. The output layer should be the same size as the input (minus the bias node)
4. Train this auto-encoder using the uncorrupted data as the desired output values.
5. After training, remove the output layer (and its weights). Now you have your hidden layer to act as the input to the next layer!
Stacked auto-encoders
- Do supervised training on last layer
- Then do supervised training on whole network to fine tune the weights

Ensemble Learning

Basic idea: Build different “experts” and let them collaborate to come up with a final decision.

![](Ensemble Learning.png)

Intuition

Advantages:
- Improve predictive performance
- Different types of classifiers can be directly included
- Easy to implement
- Not too much parameter tuning (other than that of the individual classifiers themselves)
Disadvantages
- Not compact
- Combine classifier not intuitive to interpret

Voting

Classification: Given unseen sample x
- Each classifier c_j returns the probability that x belongs to class i =1, ...C as P_ji(x)
  - Or if they can't return probabilities, they will return P_ji(x)∈{0,1}
- Decide how to combine these "votes" to get a value (probability) for each class y_i, and make final decision.
How to combine the opinions of classifiers

Files

final.md

Latest commit

History

final.md

File metadata and controls

CS 383 - Final Review

Time

Rules

Things to think about

What are you trying to do?

What do you have?

Enough data

Features

What do you want to maximize?

Initialize parameter

Course Algorithms

For each algorithm think about:

PCA

LDA

K-means Clustering

Agglomerative Clustering

Linear Regression

(Naive) Bayesian Inference

Decision Trees

K-Nearest Neighbors

Support Vector Machines

Logistic Regression

Artificial Neural Network

Ensemble learning

Markov Systems / Models

Basic Ideas

Projection

Information gain / entropy

Feature types

Standardization

LSE (least squared estimate)

MLE (maximum likelihood estimate)

Overfitting / Underfitting

Dataset

Bayes' Rule

Evaluation

Binary vs Multi-Class classification

Decision Trees

Using

Building via ID3

Kernels

Mapping Functions

Kernel Trick

K-Nearest Neighbors

Support Vector Machines

Intuition

Logistic Regression

Model form and how to use

Maximum Log Likelihood Estimate approach (MLE)

Gradient Ascent derivation

Artificial Neural Networks

Computing Forward Propagation

Performing Backwards Propagation

Derivation of back propagation rules

Deep Learning

Multi-Layer Intuition

Issues with multi-layers

Auto-Encoders

Ensemble Learning

Intuition

Voting

Bagging

Boosting (Adaboost algorithm)

Random Forests

Intuition

Building