Pymaceuticals.ipynb Completed Assignment
You've just joined Pymaceuticals, Inc., a new pharmaceutical company that specializes in anti-cancer medications. Recently, it began screening for potential treatments for squamous cell carcinoma (SCC), a commonly occurring form of skin cancer.
As a senior data analyst at the company, you've been given access to the complete data from their most recent animal study. In this study, 249 mice who were identified with SCC tumors received treatment with a range of drug regimens. Over the course of 45 days, tumor development was observed and measured. The purpose of this study was to compare the performance of Pymaceuticals’ drug of interest, Capomulin, against the other treatment regimens.
The executive team has tasked you with generating all of the tables and figures needed for the technical report of the clinical study. They have also asked you for a top-level summary of the study results.
- Merge the datasets into a single DataFrame
- Display the total number of mice from the merged DataFrame
- Identify each duplicate mouse based on Mouse ID and Timepoint
- Create a clean DataFrame by dropping duplicate mice
- Display the total number of mice from the clean DataFrame
- Calculate the mean of the tumor volume for each regimen using
groupby
- Calculate the median of the tumor volume for each regimen using
groupby
- Calculate the variance of the tumor volume for each regimen using
groupby
- Calculate the standard deviation of the tumor volume for each regimen using
groupby
- Calculate the SEM of the tumor volume for each regimen using
groupby
- Create a new DataFrame to store the summary statistics
- Generate a bar plot using Pandas to show the total number of timepoints for all mice tested for each drug regimen
- Generate a bar plot using
pyplot
to show the total number of timepoints for all mice tested for each drug regimen - Generate a pie chart using Pandas to show the distribution of unique female vs. male mice
- Generate a pie chart using
pyplot
to show the distribution of unique female vs. male mice
- Create a DataFrame with the last timepoint for each mouse ID using
groupby
- Reset the index of the DataFrame
- Retrieve the maximum timepoint for each mouse
- Define a list of the four treatment groups: Capomulin, Ramicane, Infubinol, and Ceftamin
- Create an empty list to store tumor volume data
- Use a
for
loop to display the interquartile range (IQR) and outliers for each treatment group - Generate a box plot showing the distribution of final tumor volume for all mice in each treatment group
- Generate a line plot showing tumor volume vs. timepoint for one mouse treated with Capomulin
- Generate a scatter plot showing average tumor volume vs. mouse weight for the Capomulin regimen
- Calculate the correlation coefficient and linear regression model for mouse weight and average tumor volume for the Capomulin regimen
##CODING_PROCESS
Overall Approach
-
I reviewed my notes, PowerPoint slides, and the past PyCitySchools challenge to help guide me through the assignment.
-
I looked up how to identify duplicate mice by Mouse ID and Timepoint, then reviewed the code to understand the methods used: duplicate_mice = complete_data_df[complete_data_df.duplicated(subset=["Mouse ID", "Timepoint"], keep=False)]["Mouse ID"].unique()
Summary Statistics
- I refreshed my understanding of variance and standard deviation to better interpret these statistics. Standard deviation measures how far apart values are in a dataset and variance quantifies how much the values in the dataset vary from the mean. When the data points are close to the mean it suggests a high level of consistnacy within the data set.
Bar and Pie Charts Section
- I asked the Xpert Learning Assistant and referenced activity materials heavily on this section to understand the differences between Pandas and Pyplot, as well as the methods for creating charts from a Series vs. a DataFrame. To illustrate my understanding, I practiced creating bar and pie charts from both a Series and a DataFrame.