From 33353a2ba0c0c56dc7e7c5406aff43ab96e02f20 Mon Sep 17 00:00:00 2001 From: Maggie Liu Date: Mon, 11 Sep 2023 11:32:08 -0700 Subject: [PATCH] update Maggie OH --- README.md | 111 +++++++++++++++++++++++++++--------------------------- 1 file changed, 55 insertions(+), 56 deletions(-) diff --git a/README.md b/README.md index 9a6f6c3..6625562 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ ## Time and Place | Section | Days | Time | Room | Instructor | -|:--------------------|:--------|:--------------|:----------|:-------------------| +| :------------------ | :------ | :------------ | :-------- | :----------------- | | DSCI 100 - 003 (R) | Tue/Thu | 15:30 - 17:00 | HENN 200 | Maggie Liu | | DSCI 100 - 004 (R) | Tue/Thu | 11:00 - 12:30 | SWING 222 | Anthony Christidis | | DSCI 100 - 008 (R) | Tue/Thu | 08:00 - 09:30 | HENN 200 | Vivian Meng | @@ -38,10 +38,10 @@ All other required software will be provided by the instructors. Students will l ## Prerequisite Knowledge -- distance between points on a graph -- percentages, average -- powers, roots, basic operations, logarithm, exponential -- equation of a line / plane +- distance between points on a graph +- percentages, average +- powers, roots, basic operations, logarithm, exponential +- equation of a line / plane As an example, British Columbia's Math 12 or Pre-Calculus 12 courses would satisfy the prerequisite. @@ -49,33 +49,33 @@ As an example, British Columbia's Math 12 or Pre-Calculus 12 courses would satis By the end of the course, students will be able to: -- Read data using computation from various sources (local and remote plain text files, spreadsheets and databases) -- Wrangle data from their original format into a fit-for-purpose format. -- Identify the most common types of research/statistical questions and map them to the appropriate type of data analysis. -- Create, and interpret, meaningful tables from wrangled data. -- Create, and interpret, impactful figures from wrangled data. -- Collaborate with others using version control. -- Apply, and interpret the output of simple classifier and regression models. -- Make and evaluate predictions using a simple classifier and a regression model. -- Apply, and interpret the output of, a simple clustering algorithm. -- Distinguish between in-sample prediction, out-of-sample prediction, and cross-validation. -- Calculate a point estimate in the context of statistical inference and explain how that relates to the population quantity being estimated. -- Accomplish all of the above using workflows and communication strategies that are sensible, clear, reproducible, and shareable. +- Read data using computation from various sources (local and remote plain text files, spreadsheets and databases) +- Wrangle data from their original format into a fit-for-purpose format. +- Identify the most common types of research/statistical questions and map them to the appropriate type of data analysis. +- Create, and interpret, meaningful tables from wrangled data. +- Create, and interpret, impactful figures from wrangled data. +- Collaborate with others using version control. +- Apply, and interpret the output of simple classifier and regression models. +- Make and evaluate predictions using a simple classifier and a regression model. +- Apply, and interpret the output of, a simple clustering algorithm. +- Distinguish between in-sample prediction, out-of-sample prediction, and cross-validation. +- Calculate a point estimate in the context of statistical inference and explain how that relates to the population quantity being estimated. +- Accomplish all of the above using workflows and communication strategies that are sensible, clear, reproducible, and shareable. ## Teaching Team -*Note that your TAs are students too; they may have class right before their office hours, and they may run a few minutes late. Please be patient!* +_Note that your TAs are students too; they may have class right before their office hours, and they may run a few minutes late. Please be patient!_ | Section | Position | Name | Email | Office Hours | Office Hour Location | -|---------|:-------------------|:---------------------|:------------------------------------|:----------------|:---------------------| +| ------- | :----------------- | :------------------- | :---------------------------------- | :-------------- | :------------------- | | All | Course coordinator | Julia Peng | courses[-at-]stat.ubc.ca | n/a | n/a | | 002 | Instructor | Daniel Chen | daniel.chen[-at-]stat.ubc.ca | Mon 5.30-6.30pm | HENN 200 | -| 003 | Instructor | Maggie Liu | yitong.liu[-at-]stat.ubc.ca | | | +| 003 | Instructor | Maggie Liu | yitong.liu[-at-]stat.ubc.ca | Thu 1:30-2:30pm | ESB 1043 | | 004 | Instructor | Anthony Christidis | anthony.christidis[-at-]stat.ubc.ca | | | | 008 | Instructor | Vivian Meng | vivian.meng[-at-]stat.ubc.ca | | | | 101 | Instructor | Joel Ostblom | joel.ostblom[-at-]ubc.ca | Fri 4-5pm | Zoom | | 003 | TA | Andal Abro Khan | n/a | | | -| 003 | TA | Jossie Jiang | n/a | Thu 11-12pm | Zoom | +| 003 | TA | Jossie Jiang | n/a | Thu 11-12pm | Zoom | | 003 | TA | Ali Mehrabian | n/a | | | | 003 | TA | Eros Rojas | n/a | Wed 4-5pm | ESB 1041 | | 004 | TA | Lily Xie | n/a | | | @@ -86,41 +86,40 @@ By the end of the course, students will be able to: | 008 | TA | Angelique Clara | n/a | Tue 10-11am | Zoom | | 008 | TA | Richard Yang | n/a | | | | 008 | TA | Nelson Li | n/a | | | -| 008 | TA | Kevin Wang | n/a | Thu 1-2pm | ESB 1041 | +| 008 | TA | Kevin Wang | n/a | Thu 1-2pm | ESB 1041 | | 009 | TA | Moira Renata | n/a | | | | 009 | TA | Samuel Leung | n/a | | | | 009 | TA | Kaylee (Min-Er) Li | n/a | | | | 009 | TA | Eric Li | n/a | Fri 10-11am | Zoom | | 101 | TA | Atabak Eghbal | n/a | | | | 101 | TA | Feifei Yang | n/a | Wed 11-12pm | Zoom | -| 101 | TA | Indu Kant Deo | n/a | Wed 4-5pm | Zoom | +| 101 | TA | Indu Kant Deo | n/a | Wed 4-5pm | Zoom | **Please contact the course coordinator about any administrative questions. Please read the course policy (e.g., late registration, missing exam/assignment due to sickness) below before contacting**. - When sending emails, please include DSCI 100 in the subject line. ## Assessment - One midterm covering ~6 weeks of material - - Same time & location of week 7's tutorial. Invigilated in-person. + - Same time & location of week 7's tutorial. Invigilated in-person. - One final covering all the material in the course - - To be scheduled by Classroom Services. Invigilated in-person. + - To be scheduled by Classroom Services. Invigilated in-person. **Note**: Since DSCI 100 is a large course with multiple sections (hence, multiple versions of exams), the instructors reserve the rights to scale grades in order to maintain equity among sections according the [UBC campus wide policies and regulations](https://www.calendar.ubc.ca/Vancouver/index.cfm?tree=3,42,96,0). In each class (lecture and tutorial) there will be an assignment: -- Lecture and tutorial worksheet **due dates are posted on Canvas.** -- To open the assignment, click the link (e.g. `worksheet_intro`) from Canvas. -- To submit your assignment, just make sure your work is saved **on our server** (`File -> Save Notebook` to be sure). -- At the deadline, our server will automatically snapshot your work. -- You **must access the lecture and tutorial worksheets through our Canvas course page** (as opposed to the worksheets publicly available via Github). Otherwise your worksheets may not be marked! +- Lecture and tutorial worksheet **due dates are posted on Canvas.** +- To open the assignment, click the link (e.g. `worksheet_intro`) from Canvas. +- To submit your assignment, just make sure your work is saved **on our server** (`File -> Save Notebook` to be sure). +- At the deadline, our server will automatically snapshot your work. +- You **must access the lecture and tutorial worksheets through our Canvas course page** (as opposed to the worksheets publicly available via Github). Otherwise your worksheets may not be marked! ### Course breakdown | Deliverable | Percent Grade | -|-----------------------|---------------| +| --------------------- | ------------- | | Worksheets | 5 | | Tutorials | 8 | | Group project | 11 | @@ -131,7 +130,7 @@ In each class (lecture and tutorial) there will be an assignment: ### Group project breakdown | Deliverable | Percent Grade | -|----------------|---------------| +| -------------- | ------------- | | Proposal | 2 | | Final report | 6 | | Team work | 2 | @@ -141,25 +140,25 @@ In each class (lecture and tutorial) there will be an assignment: -| Week | Topic | Description | -|------|--------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 1 | Introduction | Learn to use a programming language and Jupyter notebooks as you walk through a real world data Science application that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization. | -| 2 | Reading in data locally and from the web | Learn to read in various cases of data sets locally and from the web. Once read in, these data sets will be used to walk through a real world data Science application that includes wrangling the data into a useable format and creating an effective data visualization. | -| 3 | Cleaning and wrangling data | This week will be centered around tools for cleaning and wrangling data. Again, this will be in the context of a real world data science application and we will continue to practice working through a whole case study that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization. | -| 4 | Effective data visualization | Expand your data visualization knowledge and tool set beyond what we have seen and practiced so far. We will move beyond scatter plots and learn other effective ways to visualize data, as well as some general rules of thumb to follow when creating visualations. All visualization tasks this week will be applied to real world data sets. Again, this will be in the context of a real world data science application and we will continue to practice working through a whole case study that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization. | -| 5 | Version control | This chapter will introduce the concept of using version control systems to track changes to a project over its lifespan, to share and edit code in a collaborative team, and to distribute the finished project to its intended audience. This chapter will also introduce how to use the two most common version control tools: Git for local version control, and GitHub for remote version control. We will focus on the most common version control operations used day-to-day in a standard data science project. There are many user interfaces for Git; in this chapter we will cover the Jupyter Git interface. | -| 5 | Group contract due | | -| 6 | Classification | This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on classification, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. | -| 7 | Classification, continued | This chapter continues the introduction to predictive modeling through classification. While the previous chapter covered training and data preprocessing, this chapter focuses on how to evaluate the performance of a classifier, as well as how to improve the classifier (where possible) to maximize its accuracy. | -| 7 | Midterm | Covers week 1-6 concepts | -| 8 | Regression | This chapter continues our foray into answering predictive questions. Here we will focus on predicting numerical variables and will use regression to perform this task. This is unlike the past two chapters, which focused on predicting categorical variables via classification. However, regression does have many similarities to classification: for example, just as in the case of classification, we will split our data into training, validation, and test sets, we will use scikit-learn workflows, we will use a K-nearest neighbors (KNN) approach to make predictions, and we will use cross-validation to choose K. We will focus on prediction in cases where there is a response variable of interest and a single explanatory variable. | -| 8 | Group proposal due | | -| 9 | Regression, continued | Up to this point, we have solved all of our predictive problems—both classification and regression—using K-nearest neighbors (KNN)-based approaches. In the context of regression, there is another commonly used method known as linear regression. This chapter provides an introduction to the basic concept of linear regression, shows how to use scikit-learn to perform linear regression in Python, and characterizes its strengths and weaknesses compared to KNN regression. The focus is, as usual, on the case where there is a single predictor and single response variable of interest; but the chapter concludes with an example using multivariable linear regression when there is more than one predictor. | -| 10 | Clustering | As part of exploratory data analysis, it is often helpful to see if there are meaningful subgroups (or clusters) in the data. This grouping can be used for many purposes, such as generating new questions or improving predictive analyses. This chapter provides an introduction to clustering using the K-means algorithm, including techniques to choose the number of clusters. | -| 11 | Introduction to statistical inference | A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that population; we typically do not get data on the entire population. Data analysis questions regarding how summaries, patterns, trends, or relationships in a data set extend to the wider population are called inferential questions. This chapter will start with the fundamental ideas of sampling from populations and then introduce two common techniques in statistical inference: point estimation and interval estimation. | -| 12 | Introduction to statistical inference, continued | Unfortunately, we cannot construct the exact sampling distribution without full access to the population. However, if we could somehow approximate what the sampling distribution would look like for a sample, we could use that approximation to then report how uncertain our sample point estimate is (as we did above with the exact sampling distribution). There are several methods to accomplish this; in this course, we will use the bootstrap. We will discuss interval estimation and construct confidence intervals using just a single sample from a population. A confidence interval is a range of plausible values for our population parameter. | -| 12 | Group report due | | -| 13 | Final | Covers all the material. To be Scheduled by To be scheduled by Classroom Services | +| Week | Topic | Description | +| ---- | ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 1 | Introduction | Learn to use a programming language and Jupyter notebooks as you walk through a real world data Science application that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization. | +| 2 | Reading in data locally and from the web | Learn to read in various cases of data sets locally and from the web. Once read in, these data sets will be used to walk through a real world data Science application that includes wrangling the data into a useable format and creating an effective data visualization. | +| 3 | Cleaning and wrangling data | This week will be centered around tools for cleaning and wrangling data. Again, this will be in the context of a real world data science application and we will continue to practice working through a whole case study that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization. | +| 4 | Effective data visualization | Expand your data visualization knowledge and tool set beyond what we have seen and practiced so far. We will move beyond scatter plots and learn other effective ways to visualize data, as well as some general rules of thumb to follow when creating visualations. All visualization tasks this week will be applied to real world data sets. Again, this will be in the context of a real world data science application and we will continue to practice working through a whole case study that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization. | +| 5 | Version control | This chapter will introduce the concept of using version control systems to track changes to a project over its lifespan, to share and edit code in a collaborative team, and to distribute the finished project to its intended audience. This chapter will also introduce how to use the two most common version control tools: Git for local version control, and GitHub for remote version control. We will focus on the most common version control operations used day-to-day in a standard data science project. There are many user interfaces for Git; in this chapter we will cover the Jupyter Git interface. | +| 5 | Group contract due | | +| 6 | Classification | This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on classification, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. | +| 7 | Classification, continued | This chapter continues the introduction to predictive modeling through classification. While the previous chapter covered training and data preprocessing, this chapter focuses on how to evaluate the performance of a classifier, as well as how to improve the classifier (where possible) to maximize its accuracy. | +| 7 | Midterm | Covers week 1-6 concepts | +| 8 | Regression | This chapter continues our foray into answering predictive questions. Here we will focus on predicting numerical variables and will use regression to perform this task. This is unlike the past two chapters, which focused on predicting categorical variables via classification. However, regression does have many similarities to classification: for example, just as in the case of classification, we will split our data into training, validation, and test sets, we will use scikit-learn workflows, we will use a K-nearest neighbors (KNN) approach to make predictions, and we will use cross-validation to choose K. We will focus on prediction in cases where there is a response variable of interest and a single explanatory variable. | +| 8 | Group proposal due | | +| 9 | Regression, continued | Up to this point, we have solved all of our predictive problems—both classification and regression—using K-nearest neighbors (KNN)-based approaches. In the context of regression, there is another commonly used method known as linear regression. This chapter provides an introduction to the basic concept of linear regression, shows how to use scikit-learn to perform linear regression in Python, and characterizes its strengths and weaknesses compared to KNN regression. The focus is, as usual, on the case where there is a single predictor and single response variable of interest; but the chapter concludes with an example using multivariable linear regression when there is more than one predictor. | +| 10 | Clustering | As part of exploratory data analysis, it is often helpful to see if there are meaningful subgroups (or clusters) in the data. This grouping can be used for many purposes, such as generating new questions or improving predictive analyses. This chapter provides an introduction to clustering using the K-means algorithm, including techniques to choose the number of clusters. | +| 11 | Introduction to statistical inference | A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that population; we typically do not get data on the entire population. Data analysis questions regarding how summaries, patterns, trends, or relationships in a data set extend to the wider population are called inferential questions. This chapter will start with the fundamental ideas of sampling from populations and then introduce two common techniques in statistical inference: point estimation and interval estimation. | +| 12 | Introduction to statistical inference, continued | Unfortunately, we cannot construct the exact sampling distribution without full access to the population. However, if we could somehow approximate what the sampling distribution would look like for a sample, we could use that approximation to then report how uncertain our sample point estimate is (as we did above with the exact sampling distribution). There are several methods to accomplish this; in this course, we will use the bootstrap. We will discuss interval estimation and construct confidence intervals using just a single sample from a population. A confidence interval is a range of plausible values for our population parameter. | +| 12 | Group report due | | +| 13 | Final | Covers all the material. To be Scheduled by To be scheduled by Classroom Services | ## Policies @@ -189,7 +188,7 @@ For all other assignments and the course project, a **late submission will recei Many of the questions in assignments are graded automatically by software. The grading computer has exactly the same hardware setup as the server that students work on. No assignment, when completed, should take longer than 5 minutes to run on the server. The autograder will automatically stop (time out) for each student assignment after a maximum of 5 minutes; **any ungraded questions at that point will receive a score of 0.** -Students are responsible for making sure their assignments are *reproducible*, and run from beginning to end on the autograding computer. In particular, **please ensure that any data that needs to be downloaded is done so by the assignment notebook with the correct filename to the correct folder.** A common mistake is to manually download data when working on the assignment, making the autograder unable to find the data and often resulting in an assignment grade of 0. +Students are responsible for making sure their assignments are _reproducible_, and run from beginning to end on the autograding computer. In particular, **please ensure that any data that needs to be downloaded is done so by the assignment notebook with the correct filename to the correct folder.** A common mistake is to manually download data when working on the assignment, making the autograder unable to find the data and often resulting in an assignment grade of 0. In short: whatever grade the autograder returns after 5 minutes (assuming the teaching team did not make an error) is the grade that will be assigned. @@ -225,9 +224,9 @@ A more detailed description of academic integrity, including the University's po Students must correctly cite any code or text that has been authored by someone else or by the student themselves for other assignments. Cases of plagiarism may include, but are not limited to: -- the reproduction (copying and pasting) of code or text with none or minimal reformatting (e.g., changing the name of the variables) -- the translation of an algorithm or a script from a language to another -- the generation of code and/or text by automatic code-generation software or large language model +- the reproduction (copying and pasting) of code or text with none or minimal reformatting (e.g., changing the name of the variables) +- the translation of an algorithm or a script from a language to another +- the generation of code and/or text by automatic code-generation software or large language model An "adequate acknowledgement" requires a detailed identification of the (parts of the) code or text reused and a full citation of the original source code that has been reused.