APMAE4990 - Introduction to Data Science in Industry

Instructor: Dorian Goldman

Term: Spring 2019

Location: R 5:30pm-8:00pm 413 Kent

Objectives:

This course is designed for graduate and advanced undergraduate students who wish to learn the fundamentals of data science and machine learning in the context of real world applications. An emphasis will be placed on problems encountered by companies such as Amazon, Booking.com, Netflix, Uber/Lyft, The New York Times and others. Despite a focus on applications, the course will be mathematically rigorous, but the goal is to motivate each tool by a concrete problem arising in industry. The course will follow an online iPython notebook where students can try out various algorithms in real time as we go through the course.

There will be no midterms or exams, but rather assignments which will be handed in periodically throughout the term.

Update: While in prevoius years the students were free to select their own projects, for various reasons I have decided to have everyone work with the same dataset this year. Due to the growing size of the class, this will allow me to more efficiently answer questions and to focus on the relevant data science concepts. The project will be announced during the first few lecture of the class.

Prerequisites:

Exposure to undergraduate-level probability, statistics, calculus, programming, and linear algebra.

Grading:

50% Assignments
50% Final Project

Tentative Course Outline:

Introduction

Problems that arise in industry involving data.
Introduction to regression, classification, clustering. Model training and evaluation.

Supervised Learning

Regression: Linear Regression, Random Forest, Gradient Boosting. Examples: ETA prediction for taxis, real estate prediction, news paper demand forecasting.
Classification: Logistic Regression, Random Forest, Gradient Boosting. Examples: User Churn, Acquisition and Conversion.
Model selection and feature selection. Regularization. Real world performance evaluation and monitoring.
Examples from publishing, ride sharing, online commerce and more.

Unsupervised Learning

Clustering: K means, DBScan, Gaussian Mixture Models and Expectation Maximization.
Correlation of features. Principle Component Analysis. Problem of dimensionality.
LDA and topic modeling.

A/B tests and Causal Inference

A/B experiments. Causal inference introduction.
Offline and Online policy discovery.

Intro to Data Engineering

Map Reduce. SQL.
Feature engineering: Testing out new features and verifying their predictive power.
The basics of API building.

Recommendation Engines and Personalization

Collaborative Filtering: Matrix Factorization, Neighborhood Models and Graph Diffusion.
Content Filtering: Topic Modeling, Regression, Classification.
Cold Starts. Continous Cold starts. Warm Starts. Performance Comparison and Analysis.
Introduction to Bayesian statistics. Bayesian vs. Frequentist approach.

Reinforcement Learning

Multi-armed Bandits. Thompson Sampling. LinUCB.
Markov Decision Processes.

Deep Learning

When and why? The problem surrounding hype in deep learning.
Image and sound signal processing.
Embeddings.

References

These are references to deepen your understanding of material presented in lecture. The list is by no means exhaustive.

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning, Springer 2013

Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning, Springer 2013

Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

Cameron Davidson-Pilon, Bayesian Methods for Hackers, https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
data		data
flaskapi		flaskapi
homework		homework
html		html
img		img
lectures		lectures
notebooks		notebooks
pdfs		pdfs
recengine		recengine
src		src
webapp		webapp
.gitignore		.gitignore
README.md		README.md
final_project.md		final_project.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APMAE4990 - Introduction to Data Science in Industry

Instructor: Dorian Goldman

Term: Spring 2019

Location: R 5:30pm-8:00pm 413 Kent

Objectives:

Prerequisites:

Grading:

Tentative Course Outline:

Introduction

Supervised Learning

Unsupervised Learning

A/B tests and Causal Inference

Intro to Data Engineering

Recommendation Engines and Personalization

Reinforcement Learning

Deep Learning

References

About

Releases

Packages

Contributors 2

Languages

Columbia-Intro-Data-Science/APMAE4990-

Folders and files

Latest commit

History

Repository files navigation

APMAE4990 - Introduction to Data Science in Industry

Instructor: Dorian Goldman

Term: Spring 2019

Location: R 5:30pm-8:00pm 413 Kent

Objectives:

Prerequisites:

Grading:

Tentative Course Outline:

Introduction

Supervised Learning

Unsupervised Learning

A/B tests and Causal Inference

Intro to Data Engineering

Recommendation Engines and Personalization

Reinforcement Learning

Deep Learning

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages