Skip to content

Latest commit

 

History

History
79 lines (69 loc) · 4.18 KB

ds2001_semester_project_guidelines.md

File metadata and controls

79 lines (69 loc) · 4.18 KB

DS 2001 Semester Project

Project Specifications:

• The instructor will place you into a group of 2-3 students
• Pick a data set that you and your group find interesting. (Example sources found below. Feel free to select your data  
  from any other source as appropriate.)
• Form a research question
• Perform data pre-processing, data cleaning, outlier removal, and so on to sanitize your data as necessary.
• Save your data in a .csv file (or other format as appropriate for your data set and project scenario).
• Explore your data to reveal interesting/useful information based on your project scenario. 
• Create at least 2 visualizations that you find interesting/useful.  
• Do at least one of the following, depending in your interests and background:
  - perform a statistical test on the data (e.g., t-test)  
  - compute meaningful statistical quantities (e.g., means, correlations)  
  - fit a model to the data (e.g., regression)  
• Write at least two unit tests. For example, these might be short tests to show that two different functions work as intended.

Some Data Source Suggestions:

Kaggle
UCI Machine Learning Repository
Carnegie Mellon StatLib Datasets Archive
Federal Reserve Economic Data
Datasets for Data Science and Data Mining
Physionet Physiological signals including ECGs
NYC OpenData

Deliverables:

1. WRITTEN REPORT (no more than 10 pages) containing:

  • Abstract: Paragraph outline describing your question, what you did, and what you learned
  • Introduction: Describe your project scenario. Starting out, what did you hope to accomplish/learn?
  • Data description: Describe your data set and its significance. Where did you obtain this data set from?
    Why did you choose the data set that you did?
    Indicate if you carried out any preprocessing/data cleaning/outlier removal, and so on to sanitize your data.
  • Data processing methodology: Describe briefly your process, starting from where you obtained your data all the way to means of obtaining results/output.
  • Results:
    • Show at least two visualizations
    • Display and discuss the results. Describe what you have learned and mention the relevance/significance of the results you have obtained.
    • Testing: Describe what testing you did. Describe the unit tests that you wrote. Show a sample run of 1 or 2 of your tests (screen captures or copy-and-paste is fine).
  • Conclusions: Summarize your findings, explain how these results could be used by others (if applicable), and describe ways you could improve your program. You could describe ways you might like to expand the functionality of your program if given more time.

2. PRESENTATION

  • Each group will give a presentation not to exceed 10 minutes

  • The presentation should briefly include:

    • research question
    • data summary
    • data processing methodology
    • visualizations
    • results
    • conclusions
  • The presentation file format should be powerpoint or pdf

  • The file name should begin with GroupName[n]_ where [n] is your group number Be sure to practice beforehand, and time yourselves.

3. CODE

  • Clearly document, organize, and name your code file or files
  • The files can be in Jupyter Notebooks or Python scripts

SUBMISSION

  • In one Zip file submit through Collab: (1) written report (2) code files

RUBRIC
Total Points = 100

Description Possible Points
Paper includes abstract 10
Paper includes introduction 10
Paper discusses data source and provides data summary 10
Paper discusses data preprocessing 10
Paper includes at least two visualizations 10
Paper includes results, clearly shown 10
Code presents/discusses unit tests 10
Code is clear and well-documented 10
Presentation skills 20

where Presentation skills comprises:

  • All group members presented
  • The presentation was of good quality, clear and easy to understand