layout
default

Machine Learning Systems (Spring 2022)

When: Mondays from 1:00 to 4:00
Where: Soda 405 (and on zoom with with link posted on Slack).
Instructor: Joseph E. Gonzalez
Co-Instructor: Amir Gholami
Office Hours:: Arrange by email.
Announcements: Slack (please send us an email if you are not added yet)
Sign-up to Present: Every student should sign-up to present in at least three rows and as different roles each time. Note that the Backup/Scribe presenter may be asked to fill in for one of the other roles with little notice.
If you have reading suggestions please send a pull request to this course website on Github by modifying the index.md file.

Course Description

The recent success of AI has been in large part due in part to advances in hardware and software systems. These systems have enabled training increasingly complex models on ever larger datasets. In the process, these systems have also simplified model development, enabling the rapid growth in the machine learning community. These new hardware and software systems include a new generation of GPUs and hardware accelerators (e.g., TPU), open source frameworks such as Theano, TensorFlow, PyTorch, MXNet, Apache Spark, Clipper, Horovod, and Ray, and a myriad of systems deployed internally at companies just to name a few. At the same time, we are witnessing a flurry of ML/RL applications to improve hardware and system designs, job scheduling, program synthesis, and circuit layouts.

In this course, we will describe the latest trends in systems designs to better support the next generation of AI applications, and applications of AI to optimize the architecture and the performance of systems. The format of this course will be a mix of lectures, seminar-style discussions, and student presentations. Students will be responsible for paper readings, and completing a hands-on project. For projects, we will strongly encourage teams that contains both AI and systems students.

New Course Format

Two previous versions of this course were offered in Spring 2019, and Fall 2019. The format of this third offering is slightly different. Each week will cover a different research area in AI-Systems. The lecture will be organized around a mini program committee meeting for the weeks readings. Students will be required to submit detailed reviews for a subset of the papers and lead the paper review discussions. For some of the topics, we have also invited prominent researchers for each area and they will present an overview of the field, followed by discussions raised during the "committee meeting". The goal of this new format is to both build a mastery of the material and also to develop a deeper understanding of how to evaluate and review research and hopefully provide insight into how to write better papers.

Course Syllabus

{% capture dates %} 1/24/22 1/31/22 2/07/22 2/14/22 2/21/22 2/28/22 03/07/22 03/14/22 03/21/22 03/28/22 04/04/22 04/11/22 04/18/22 04/25/22 05/02/22 05/09/22 {% endcapture %} {% assign dates = dates | split: " " %}

This is a tentative schedule. Specific readings are subject to change as new material is published.

Jump to Today

{% include syllabus_entry %} [//]: <> (lecture 1)

Introduction and Course Overview

This lecture will be an overview of the class, requirements, and an introduction to the history of machine learning and systems research.

Lecture slides: [pdf, pptx]

* [SysML: The New Frontier of Machine Learning Systems](https://arxiv.org/abs/1904.03257) * Read Chapter 1 of [_Principles of Computer System Design_](https://www.sciencedirect.com/book/9780123749574/principles-of-computer-system-design). You will need to be on campus or use the Library VPN to obtain a free PDF. * [A Few Useful Things to Know About Machine Learning](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

* [How to read a paper](https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf) provides some pretty good advice on how to read papers effectively. * Timothy Roscoe's [writing reviews for systems conferences](https://people.inf.ethz.ch/troscoe/pubs/review-writing.pdf) will also help you in the reviewing process.

{% include syllabus_entry %} [//]: <> (lecture 2)

Big Data Systems

Guest Speaker: Reynold Xin (Databricks)

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX
Guest Lecture slides: PDF

* [Towards a Unified Architecture for in-RDBMS Analytics](https://www.cs.stanford.edu/people/chrismre/papers/bismarck.pdf) * [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) * [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)

* [Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores](https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf) * [Spark SQL: Relational Data Processing in Spark](https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) * [The MADlib Analytics Library or MAD Skills, the SQL](https://arxiv.org/pdf/1208.4165.pdf)

{% include syllabus_entry %} [//]: <> (lecture 3)

Hardware for Machine Learning

Guest Speaker: Prof. Sophia Shao (UC Berkeley)

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX
Guest Lecture slides: PDF

* [Mixed precision training. [ICLR’18]](https://openreview.net/pdf?id=r1gs9JgRZ) * [Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [SIGRAPH, 2016]](https://dspace.mit.edu/handle/1721.1/102369) * [Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators (formerly: DNN Dataflow Choice Is Overrated)](https://arxiv.org/pdf/1809.04070.pdf) * [Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration [Best Paper Award, DAC’21]](https://people.eecs.berkeley.edu/~ysshao/assets/papers/genc2021-dac.pdf)

* [A New Golden Age for Computer Architecture](https://cacm.acm.org/magazines/2019/2/234352-a-new-golden-age-for-computer-architecture/fulltext) * [Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures [CACM’08]](https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf)

{% include syllabus_entry %} [//]: <> (lecture 4)

Distributed deep learning, Part I: Systems

Guest Speaker: Microsoft DeepSpeed Team

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX
Guest Lecture slides: PDF

* [Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines [SC’21, Best Paper finalist]](https://arxiv.org/pdf/2107.06925.pdf) * [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM [SC’21, Best Student Paper]](https://arxiv.org/pdf/2104.04473.pdf) * [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning [SC’21]](https://arxiv.org/abs/2104.07857)

* [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale [Blog post]](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/) * [Large Scale Distributed Deep Networks [NeurIPS’12]](https://papers.nips.cc/paper/2012/hash/6aca97005c68f1206823815f66102863-Abstract.html) * [Gpipe: Efficient training of giant neural networks using pipeline parallelism [NeurIPS’19]](https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf) * [PipeDream: Fast and Efficient Pipeline Parallel DNN Training [SOSP'19]](https://arxiv.org/pdf/1806.03377.pdf)

{% include syllabus_entry %}

Holiday (Presidents Day)

{% include syllabus_entry %} [//]: <> (lecture 6)

Distributed deep learning, Part II: Scaling Constraints

Guest Speaker: Michael Houston (Nvidia)

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX
Guest Lecture slides: PDF

* [Measuring the Effects of Data Parallelism on Neural Network Training](https://arxiv.org/pdf/1811.03600.pdf) * [Scaling Laws for Neural Language Models [OpenAI, 2020]](https://arxiv.org/pdf/2001.08361.pdf) * [Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems](https://arxiv.org/abs/2003.09518)

* [On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima [ICLR’16]](https://arxiv.org/pdf/1609.04836.pdf) * [Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [ICML’20]](https://arxiv.org/pdf/2002.11794.pdf) * [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes [ICLR’20]](https://arxiv.org/pdf/1904.00962.pdf) * [Scaling Vision Transformers](https://arxiv.org/pdf/2106.04560.pdf)

{% include syllabus_entry %} [//]: <> (lecture 7)

Project Proposals

Lecture slides: PDF, PPTX

{% include syllabus_entry %} [//]: <> (lecture 8)

Machine learning Applied to Systems

Guest Speaker: Prof. Tim Kraska (MIT)

Submit your review before 1:00PM.
Lecture slides: [PDF], [PPTX]

* [The Case for Learned Index Structures [ICMD’18]](https://arxiv.org/abs/1712.01208) * [Device Placement Optimization with Reinforcement Learning [ICML’17]](https://arxiv.org/pdf/1706.04972.pdf) * [Neural Adaptive Video Streaming with Pensieve [SIGCOMM’17]](https://people.csail.mit.edu/hongzi/content/publications/Pensieve-Sigcomm17.pdf)

{% include syllabus_entry %}

Spring Break

{% include syllabus_entry %} [//]: <> (lecture 10)

Machine Learning Frameworks and Automatic Differentiation

Guest Speaker: Prof. Tianqi Chen (OctoML and CMU)

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX
Joey's Lecture slides: PDF, PPTX

* [Automatic differentiation in ML: Where we are and where we should be going](https://papers.nips.cc/paper/8092-automatic-differentiation-in-ml-where-we-are-and-where-we-should-be-going) * [TensorFlow: A System for Large-Scale Machine Learning](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf) * [TVM: An Automated End-to-End Optimizing Compiler for Deep Learning](https://arxiv.org/abs/1802.04799)

{% include syllabus_entry %} [//]: <> (lecture 11)

Efficient Machine Learning

Guest Speaker: Vikas Chandra (Facebook)

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX

* [Linear Mode Connectivity and the Lottery Ticket Hypothesis](https://arxiv.org/pdf/1912.05671.pdf) * [Integer-only Quantization of Neural Networks for Efficient Integer-Arithmetic-Only Inference](https://arxiv.org/pdf/1712.05877.pdf) * [FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search](https://arxiv.org/abs/1812.03443)

* [Hessian Aware trace-Weighted Quantization of Neural Networks](https://proceedings.neurips.cc/paper/2020/file/d77c703536718b95308130ff2e5cf9ee-Paper.pdf) * [The state of sparsity in deep neural networks](https://arxiv.org/pdf/1902.09574.pdf) * [Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks](https://arxiv.org/pdf/2102.00554.pdf)

{% include syllabus_entry %} [//]: <> (lecture 12)

Fundamentals of Machine Learning in the Cloud, the Modern Data Stack

Guest Speaker: Prof. Matei Zaharia (Databricks and Stanford)

Submit your review before 1:00PM.
Lecture slides: PDF

* [The Sky Above The Clouds](https://drive.google.com/file/d/16xs_-3XRym34z60-Ji4dlDwW487vOmpz/view?usp=sharing) * [FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply](https://arxiv.org/abs/2006.07512) * [Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning](https://www.usenix.org/system/files/osdi21-qiao.pdf)

* [RubberBand: cloud-based hyperparameter tuning](https://dl.acm.org/doi/10.1145/3447786.3456245)

{% include syllabus_entry %} [//]: <> (lecture 13)

Benchmarking Machine Learning Workloads

Guest Speaker: Prof. Vijay Reddi (Harvard)

Submit your review before 1:00PM.
Lecture slides: PDF, PPTX

* [MLPerf Training Benchmark](https://proceedings.mlsys.org/paper/2020/hash/02522a2b2726fb0a03bb19f2d8d9524d-Abstract.html) * [MLPerf Inference Benchmark](https://arxiv.org/pdf/1911.02549.pdf) * [Benchmark Analysis of Representative Deep Neural Network Architectures](https://arxiv.org/pdf/1810.00736.pdf)

{% include syllabus_entry %} [//]: <> (lecture 14)

Machine learning and Security

Submit your review before 1:00PM.
Lecture slides: PDF

Communication-Efficient Learning of Deep Networks from Decentralized Data
Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform
Robust Physical-World Attacks on Deep Learning Models

Helen: Maliciously Secure Coopetitive Learning for Linear Models
Faster CryptoNets: Leveraging Sparsity for Real-World Encrypted Inference
Rendered Insecure: GPU Side Channel Attacks are Practical
The Algorithmic Foundations of Differential Privacy
Federated Learning: Collaborative Machine Learning without Centralized Training Data
Federated Learning at Google ... A comic strip?
SecureML: A System for Scalable Privacy-Preserving Machine Learning
More reading coming soon ...

{% include syllabus_entry %}

RRR Week

{% include syllabus_entry %}

Project Presentations

Week	Date	Topic

Projects

Detailed candidate project descriptions will be posted shortly. However, students are encourage to find projects that relate to their ongoing research.

Grading

Grades will be largely based on class participation and projects. In addition, we will require weekly paper summaries submitted before class.

Projects: 60%
Weekly Summaries: 20%
Class Participation: 20%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Machine Learning Systems (Spring 2022)

Course Description

New Course Format

Course Syllabus

Introduction and Course Overview

Big Data Systems

Guest Speaker: Reynold Xin (Databricks)

Hardware for Machine Learning

Guest Speaker: Prof. Sophia Shao (UC Berkeley)

Distributed deep learning, Part I: Systems

Guest Speaker: Microsoft DeepSpeed Team

Holiday (Presidents Day)

Distributed deep learning, Part II: Scaling Constraints

Guest Speaker: Michael Houston (Nvidia)

Project Proposals

Machine learning Applied to Systems

Guest Speaker: Prof. Tim Kraska (MIT)

Spring Break

Machine Learning Frameworks and Automatic Differentiation

Guest Speaker: Prof. Tianqi Chen (OctoML and CMU)

Efficient Machine Learning

Guest Speaker: Vikas Chandra (Facebook)

Fundamentals of Machine Learning in the Cloud, the Modern Data Stack

Guest Speaker: Prof. Matei Zaharia (Databricks and Stanford)

Benchmarking Machine Learning Workloads

Guest Speaker: Prof. Vijay Reddi (Harvard)

Machine learning and Security

RRR Week

Project Presentations

Projects

Grading

Files

index.md

Latest commit

History

index.md

File metadata and controls

Machine Learning Systems (Spring 2022)

Course Description

New Course Format

Course Syllabus

Introduction and Course Overview

Big Data Systems

Hardware for Machine Learning

Distributed deep learning, Part I: Systems

Holiday (Presidents Day)

Distributed deep learning, Part II: Scaling Constraints

Project Proposals

Machine learning Applied to Systems

Spring Break

Machine Learning Frameworks and Automatic Differentiation

Efficient Machine Learning

Fundamentals of Machine Learning in the Cloud, the Modern Data Stack

Benchmarking Machine Learning Workloads

Machine learning and Security

RRR Week

Project Presentations

Projects

Grading