Wednesday March 22nd, 08:00 ~ 10:00 a.m. @Randel 327
- Close book, one piece of paper with note.
- No calculator. (compute by hand or leave in formula)
- Provide a sheet with necessary rules of calculus.
- Find underlying pattern
- Predict value
- Predict class
- Do you need / want class probabilities?
- How many instances / observations?
- How many classes?
- How many features?
- Do you have labels?
- Enough data?
- Do you need to do cross validation?
- Do you need / want to "normalize" them?
- Do you need all of them?
- How correlated are they?
- Would you benefit from PCA, LDA or Greedy Selection?
- Is the data linearly separable?
- Do you need to work in a higher dimension (kernel)?
- Likelihood
- AUC
- Precision & Recall
- Accuracy
- If necessary can you initialize parameter in an "intelligent" way?
- Use large learning rate do a few steps, and than use that result as initialize parameter.
- Some problem we family with, manually estimate initialize parameter.
- Supervised or un-supervised
- Classification, Clustering, or Regression
- Model based or instanced based algorithm
- Linear vs Non-Linear
- How to deal with categorical data
- How to deal with continuous valued data
- Training complexity (memory and computation)
- Testing complexity (memory and computation)
- How to deal with overfitting
- How to use for multi-class classification
- Un-supervised
- Regression
- Instanced based algorithm
- Linear
- Un-supervised
- Regression
- Instanced based algorithm
- Linear
- Projecting points onto a projection matrix
- PCA (Principal Component Analysis)
- Choose the number of dimensions we want, k < D and project the original data onto the principal components.
- LDA (Linear Discriminant Analysis)
- Find projection to a line such that samples from different classes are well separated
- entropy
- Information gain (IG)
- Categorical
- Examples: Car Model, School
- Finite Discrete Valued
- Ordering still matters, but there’s only so many values out there
- Continuous Valued
- Examples: Blood Pressure, Height
- Standardized data has Zero mean and Unit deviation.
- When do we standardizing data?
- Before we start using data we typically want to standardize non-categorical features (this is sometimes referred to as normalizing them).
- How to standardizing data?
- Treat each feature independently
- Center it (subtract the mean from all samples)
- Make them all have the same span (divide by standard deviation)
- Why do we need to standardizing data?
- If we used the data as-is, then one feature may have more influence than the other.
- Identifying (detect)
- [under-fitting] Don’t do well on either the training or the testing set
- [over-fitting] Do well on the training set but poorly on the testing set
- Solving
- under-fitting
- Make a more complex model (May involve need more features)
- Trying a different algorithm
- over-fitting
- Use a less complex model (May involve using less features)
- Try a different algorithm
- Get more data
- Use a third set to choose between hypothesis (called a validation set)
- Add a penalization (or regularization) term to the equations to penalize model complexity
- under-fitting
- Training set
- build/train system using the training data
- Testing set
- test system using the testing data
- Validation set
- for model selection
- In Bayes’ Rule we call
- 𝑃(𝑦=𝑖|𝑓=𝑥) the posterior (what we want)
- 𝑃(𝑦=𝑖) the prior (probability that 𝑦=𝑖)
- 𝑃(𝑓=𝑥|𝑦=𝑖) the likelihood (likelihood of generating 𝑥 given 𝑦)
- 𝑃(𝑓=𝑥) the evidence
-
RMSE
-
Accuracy
-
Precision, Recall, F-Measure
-
PR Graph, ROC Graph
-
Area Under Curve (AUC)
-
Thresholding
- Anything below that threshold is class 0, anything above it is class 1
- Linear / Cosine:
- Polynomial kernel:
- Gaussian Radial Basis kernel (RBF):
- Histogram intersection:
- Hellinger kernel:
- Sometimes we may want to go to a higher feature space. Because we have a linear classifier and the data is not directly linearly separable.
- One solution would be to map our current space to another separable space. Then project data to higher dimension using mapping function .
- Using the polynomial kernel of degree two on observations with a single feature is equivalent to compute the cosine similarity on observations in 3D space.
- Maximize distance to closest example (of each type)
- We want hyperplane as far as possible from any sample
- New samples close to old samples will then be classified correctly
- Our goal is to maximize the margin
- The margin is twice the absolute value of distance b of the closest example to the hyperplane
logistic function 用一遍
- Ideally we’d like to take the derivative of this with respect to 𝜃, set it equal to zero, and solve for 𝜃 to find the maxima
- The closed form approach
- But this isn’t easy
- So what’s our other approach
- Do partial derivatives on the parameters and use gradient descent! (actually in this case gradient ascent, since we’re trying to maximize)
hidden = 1 ./ ( 1 + exp(-1 .* data * beta) );
output = 1 ./ ( 1 + exp(-1 .* hidden * theta) );
delta_out = correctValue - output;
theta = theta + (eta/N) .* (hidden' * delta_out);
delta_hid = (delta_out * theta') .* hidden .* (1 - hidden);
beta = beta + (eta/N) .* (data' * delta_hid);
- Same idea as regular ANNs but with additional hidden layers.
- Output layer – Here predicting a supervised target
- Hidden layers – These learn more abstract representations as you head up
- Input layer – Raw sensory inputs (roughly)
- First layer learns 1st order features (e.g. edges, etc..)
- 2nd layer learns high order features (combinations of first layer features, combination of edges, etc..)
- Then final layer features are fed into supervised layer(s)
- Attempts to find a hidden layer that can reproduce the input
- Basic process to get a hidden layer from one auto-encoder is:
- Take the input, add some noise to it, and add a bias node
- Choose the hidden layer size to be less than the input size
- The output layer should be the same size as the input (minus the bias node)
- Train this auto-encoder using the uncorrupted data as the desired output values.
- After training, remove the output layer (and its weights). Now you have your hidden layer to act as the input to the next layer!
- Stacked auto-encoders
- Do supervised training on last layer
- Then do supervised training on whole network to fine tune the weights
Basic idea: Build different “experts” and let them collaborate to come up with a final decision.
![](Ensemble Learning.png)
- Advantages:
- Improve predictive performance
- Different types of classifiers can be directly included
- Easy to implement
- Not too much parameter tuning (other than that of the individual classifiers themselves)
- Disadvantages
- Not compact
- Combine classifier not intuitive to interpret
-
Classification: Given unseen sample x
- Each classifier cj returns the probability that x belongs to class i =1, ...C as Pji(x)
- Or if they can't return probabilities, they will return Pji(x)∈{0,1}
- Decide how to combine these "votes" to get a value (probability) for each class yi, and make final decision.
- Each classifier cj returns the probability that x belongs to class i =1, ...C as Pji(x)