Skip to content

Columbia-Intro-Data-Science/APMAE4990-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

APMAE4990 - Introduction to Data Science in Industry

Instructor: Dorian Goldman

Term: Spring 2019

Location: R 5:30pm-8:00pm 413 Kent

Objectives:

This course is designed for graduate and advanced undergraduate students who wish to learn the fundamentals of data science and machine learning in the context of real world applications. An emphasis will be placed on problems encountered by companies such as Amazon, Booking.com, Netflix, Uber/Lyft, The New York Times and others. Despite a focus on applications, the course will be mathematically rigorous, but the goal is to motivate each tool by a concrete problem arising in industry. The course will follow an online iPython notebook where students can try out various algorithms in real time as we go through the course.

There will be no midterms or exams, but rather assignments which will be handed in periodically throughout the term.

Update: While in prevoius years the students were free to select their own projects, for various reasons I have decided to have everyone work with the same dataset this year. Due to the growing size of the class, this will allow me to more efficiently answer questions and to focus on the relevant data science concepts. The project will be announced during the first few lecture of the class.

Prerequisites:

Exposure to undergraduate-level probability, statistics, calculus, programming, and linear algebra.

Grading:

  • 50% Assignments
  • 50% Final Project

Tentative Course Outline:

Introduction

  • Problems that arise in industry involving data.
  • Introduction to regression, classification, clustering. Model training and evaluation.

Supervised Learning

  • Regression: Linear Regression, Random Forest, Gradient Boosting. Examples: ETA prediction for taxis, real estate prediction, news paper demand forecasting.
  • Classification: Logistic Regression, Random Forest, Gradient Boosting. Examples: User Churn, Acquisition and Conversion.
  • Model selection and feature selection. Regularization. Real world performance evaluation and monitoring.
  • Examples from publishing, ride sharing, online commerce and more.

Unsupervised Learning

  • Clustering: K means, DBScan, Gaussian Mixture Models and Expectation Maximization.
  • Correlation of features. Principle Component Analysis. Problem of dimensionality.
  • LDA and topic modeling.

A/B tests and Causal Inference

  • A/B experiments. Causal inference introduction.
  • Offline and Online policy discovery.

Intro to Data Engineering

  • Map Reduce. SQL.
  • Feature engineering: Testing out new features and verifying their predictive power.
  • The basics of API building.

Recommendation Engines and Personalization

  • Collaborative Filtering: Matrix Factorization, Neighborhood Models and Graph Diffusion.
  • Content Filtering: Topic Modeling, Regression, Classification.
  • Cold Starts. Continous Cold starts. Warm Starts. Performance Comparison and Analysis.
  • Introduction to Bayesian statistics. Bayesian vs. Frequentist approach.

Reinforcement Learning

  • Multi-armed Bandits. Thompson Sampling. LinUCB.
  • Markov Decision Processes.

Deep Learning

  • When and why? The problem surrounding hype in deep learning.
  • Image and sound signal processing.
  • Embeddings.

References

These are references to deepen your understanding of material presented in lecture. The list is by no means exhaustive.

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning, Springer 2013

Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning, Springer 2013

Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

Cameron Davidson-Pilon, Bayesian Methods for Hackers, https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers