Skip to content

Latest commit

 

History

History
41 lines (34 loc) · 1.83 KB

outline.md

File metadata and controls

41 lines (34 loc) · 1.83 KB

BIOS 611

Doing reliable, reproducible and traceable research is a real challenge. While computers have drastically amplified our ability to perform modeling, visualization, and analysis, they have also multiplied the number of technical choices and dependencies involved. Scientists are naturally focused on the content of their research and thus technology has led to large amounts of un-portable, difficult to reproduce results with far reaching implications for both the pursuit of science and its public reputation.

This course is meant to provide strategies to attack these challenges, primarily by adapting methods and tools from software engineering to the task of ingesting, visualizing, modeling and reporting on results in such a way that:

  1. The environment the scientist uses through the process is documented and reproducible (via Docker).
  2. Each step of an analysis is clearly separated and the dependencies between each step are clearly documented in such a way that results are repeatable from scratch, each time (via build systems like Make).
  3. That the history of the project is entirely recorded and explorable in such a way that any previous stage of the project can be called up at will and executed with its appropriate data and context (via git).

To get to this stage we will also develop familiarity with the world of Unix, upon which most of our tools naturally depend. And, so that we have something to do with these tools we will also review basic methods of data cleansing, visualization and modeling in common scientific programming languages like R and Python. Some attention will naturally be devoted to understanding how these tools work in a generalizable way.

A student who has completed this course will have a git repository with a portfolio data science project adhering close to these principles.