H2O2_incubation_experiment.Rmd

---
title: "H2O2 incubation experiment analysis"
output: html_notebook
---
This notebook is for analysis of the data from H2O2 production and decay bottle experiments performed on whole water, 100 um filtered water, and 0.22 um filtered water collected from western Lake Erie.

Analysis ran on smitdere laptop unless otherwise indicated.

Regression analyses with enviornmental parameters and H2O2 production
----
This first section of the document is for dealing solely with regression plots of H2O2 production and decay with respiration, biomass, light vs. dark, and 100 um filtered water vs whole water. The 16S rRNA data will be covered in a following section.

Load the needed libraries:
```{r}
library(ggplot2)
library(patchwork)
library(dplyr)
library(tidyr)
library(vegan)
library(lubridate)
library(plotly)
library(reticulate)
library(modelr)
library(purrr)
library(broom)
library(reshape2)
library(sjPlot)
setwd("E:/Research/2019_Erie_Bloom/Prod_Decay_Experiments")
```

First, load the H2O2 and field data into an R dataframe:  
```{r}
#Load the text-delimited data tables into R:
Prod_Decay_df <- read.table("Prod_Decay_Data.txt", header=TRUE, sep="\t")
Environ_df <- read.table("Prod_Decay_Environ_Data.txt", header=TRUE, sep="\t")
```
In the Prod_Decay_df is the modeled gross H2O2 production, the absolute Kloss_H2O2, net production rate of H2O2 in control bottles, and net decay rate of H2O2 in spiked bottles. It also has the summed error of squares between observed vs. modeled H2O2 concentrations. The replicate bottles are listed individually.  

Samples from 2017 only had whole water and 0.22 um filtered treatments exposed to light.
Samples from 2018 and 2019 had the same as above, but also included dark bottles and 105 um filtered bottles (on certain dates, usually alternating).  

The Environ_df has the other associated data that is paired with each set of H2O2 measurements. This includes Chlorophyll a, pH, CDOM (a305), DIC, DOC, respiration, primary production, and nutrient concentrations.  

DIC, DOC, respiration, primary production, and UV data only exist for 2018 and 2019.  

The goal is to determine how much of the total H2O2 production in water column is attributed to biological sources and how the biological production changes with growth rates, algal density, nutrient availability, and bacterial community composition.  

Gross H2O2 production was estimated using a model described in Vermilyea et al. 2010 and Marsico et al. 2015. Environ. Sci. Technol. and Mar Sci. This model relies on paired 2L incubations of unamended water and water spiked with an H2O2 standard. The decay in the spiked incubation are used to correct the H2O2 production rates measured in the control bottles (the net observed production is lower than the gross due to simultaneous decay, which changes as a function of H2O2 concentration as described in the above references).

The models above assume that gross H2O2 production remains constant, which is likely invalid due to light dependent processes changing with solar zenith angle. This is particularly evident in a few of our bottle incubations, where the model cannot accurately fit the dynamics in H2O2 observed over the 9 hour experiments on a few dates. Dixon et al. 2013 have attempted to allow gross H2O2 production to change nonlinearly with using curve fitting parameters, but their experimental set up is not applicable to our data (they don't model the decreasing portion of diel peak). Their model parameters also hold no real meaning, as the terms behind any changing biological production over time are are unknown. To avoid aribtrary curve fitting, I am going to calculate gross H2O2 production assuming that H2O2 production rates and Kloss do not change, only using those experiments where the model fits the data. Observed net production and decay will also be investigated to include the data from all the experiments.  

First, some formating; calculate average H2O2 production and decay rates (both gross and net) along with 95% confidence intervals for each date, then combine the averages and ranges with the other environmental data into one dataframe:  
```{r}
Avg_Prod_Decay_df <- Prod_Decay_df %>%
  group_by(Experiment_Date, Site, Condition, Model_Fit) %>%
  #calculates number of observations, averages, and standard deviation of each column along the grouping specified above
  summarise(n=n(), Sum_Error_Squares_avg=mean(Sum_Error_Squares, na.rm = TRUE), 
            Sum_Error_Squares_sd=sd(Sum_Error_Squares, na.rm = TRUE),
            PH2O2_avg=mean(PH2O2, na.rm = TRUE), PH2O2_sd=sd(PH2O2, na.rm = TRUE),
            Kloss_avg=mean(Kloss, na.rm = TRUE), Kloss_sd=sd(Kloss, na.rm = TRUE),
            Net_production_avg=mean(Net_production, na.rm = TRUE),
            Net_production_sd=sd(Net_production, na.rm = TRUE),
            Net_decay_avg=mean(Net_decay, na.rm = TRUE), Net_decay_sd=sd(Net_decay, na.rm = TRUE),
            Max_H2O2_avg=mean(Max_H2O2, na.rm = TRUE), Max_H2O2_sd=sd(Max_H2O2, na.rm = TRUE),
            FC_Net_production_avg=mean(FC_Net_production, na.rm = TRUE),
            FC_Net_production_sd=sd(FC_Net_production, na.rm = TRUE),
            FC_Net_decay_avg=mean(FC_Net_decay, na.rm = TRUE), FC_Net_decay_sd=sd(FC_Net_decay, na.rm = TRUE),
            FC_Max_H2O2_avg=mean(FC_Max_H2O2, na.rm = TRUE),
            FC_Max_H2O2_sd=sd(FC_Max_H2O2, na.rm = TRUE)) %>%
  #This part calculates 95% confidence intervals from the standard deviation
  mutate(Sum_Error_Squares_CI=Sum_Error_Squares_sd/sqrt(n)*1.96) %>%
  mutate(PH2O2_CI=PH2O2_sd/sqrt(n)*1.96) %>%
  mutate(Kloss_CI=Kloss_sd/sqrt(n)*1.96) %>%
  mutate(Net_production_CI=Net_production_sd/sqrt(n)*1.96) %>%
  mutate(Net_decay_CI=Net_decay_sd/sqrt(n)*1.96) %>%
  mutate(Max_H2O2_CI=Max_H2O2_sd/sqrt(n)*1.96) %>%
  mutate(FC_Net_production_CI=FC_Net_production_sd/sqrt(n)*1.96) %>%
  mutate(FC_Net_decay_CI=FC_Net_decay_sd/sqrt(n)*1.96)
```

How much of the total H2O2 produced was attributed to biology? Using the model published in Vermilyea et al. and Marisco et al?

I want to calculate the biotic production as the gross PH2O2 in unfiltered water - the net production in 0.22 um filtered water. This assumes that the net production observed in 0.22 um filtered water is gross abiotic production. While some H2O2 could be lost to abiotic processes, decay in spiked 0.22 um filtered bottles was only significantly nonzero on one date -- so it should be negligible here.

Calculate the biotic production in each experiment:  
```{r}
#Calculate gross giotic H2O2 production rates:
Avg_Prod_Decay_df$Biotic_PH2O2 <- Avg_Prod_Decay_df$PH2O2_avg - Avg_Prod_Decay_df$FC_Net_production_avg
#Calculate the 95% CI of gross biotic H2O2 production rate using the propagation of error formula:  
#I need to first set NANs in the confidence interval for filtered net production to zero, so that the error from PH2O2 is applied to biotic H2O2 when possible:
Avg_Prod_Decay_df$FC_Net_production_CI[is.na(Avg_Prod_Decay_df$FC_Net_production_CI)] <- 0
#Now calculate the 95% CI
Avg_Prod_Decay_df$Biotic_PH2O2_CI <- sqrt((Avg_Prod_Decay_df$PH2O2_CI^2) + (Avg_Prod_Decay_df$FC_Net_production_CI^2))

#What percentage of gross H2O2 production was attributed to biotic production?  
Avg_Prod_Decay_df$Perc_Biotic_PH2O2 <- Avg_Prod_Decay_df$Biotic_PH2O2 / Avg_Prod_Decay_df$PH2O2_avg * 100

#Merge the H2O2 dataframe with the environmental dataframe so we can relate this information to the other measured parameters later:
Merged_Prod_Decay_df <- merge(Avg_Prod_Decay_df, Environ_df, by=c("Experiment_Date", "Site", "Condition"), all=TRUE)

#What is the mean and range of biotic PH2O2 for all experiments in which it could be calculated?
#We only want to consider whole water production in the light for now, so create a data frame of just those samples:
Merged_Prod_Decay_WL_only <- Merged_Prod_Decay_df[Merged_Prod_Decay_df$Condition == "WL", ]
#Calculate the number of complete observations
num_obs <- length(Merged_Prod_Decay_WL_only$Biotic_PH2O2[!(is.na(Merged_Prod_Decay_WL_only$Biotic_PH2O2))])
#Calculate the stats:
mean(Merged_Prod_Decay_WL_only$Biotic_PH2O2, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$Biotic_PH2O2, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Biotic_PH2O2, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$Biotic_PH2O2, na.rm=TRUE)
```
Biotic production ranged from 9 - 244 nM/hr, with a mean of 73 +/- 24 nM/hr.  

What does total gross H2O2 look like in comparison?  
```{r}
mean(Merged_Prod_Decay_WL_only$PH2O2_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$PH2O2_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$PH2O2_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$PH2O2_avg, na.rm=TRUE)
```

What percentage of the total gross production can be attributed to biotic production?  
```{r}
#Get the mean, 95% CI, and range for the percent of gross H2O2 production attributed to biotic production in the experiments:  
mean(Merged_Prod_Decay_WL_only$Perc_Biotic_PH2O2, na.rm = TRUE)
(sd(Merged_Prod_Decay_WL_only$Perc_Biotic_PH2O2, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Perc_Biotic_PH2O2, na.rm = TRUE)
max(Merged_Prod_Decay_WL_only$Perc_Biotic_PH2O2, na.rm = TRUE)
```
The percentage of total production attributed to biotic sources was on average 66 +/- 5%. Biotic production ranged from 44 - 94 %.  

What do the numbers for Kloss look like?  
```{r}
mean(Merged_Prod_Decay_WL_only$Kloss_avg, na.rm = TRUE)
(sd(Merged_Prod_Decay_WL_only$Kloss_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Kloss_avg, na.rm = TRUE)
max(Merged_Prod_Decay_WL_only$Kloss_avg, na.rm = TRUE)
```

Let's compare net production in whole water and 0.22 um filtered water to check that this is working.  
```{r}
#To make a grouped barplot, we need to rearrange the dataframe a little bit.
#Let's subset the dataframe to only include Experiment date and net production rates:
Net_prod_df <- subset(Merged_Prod_Decay_WL_only, select=c(Experiment_Date, Net_production_avg,
                                                          FC_Net_production_avg))
Net_prod_CI <- subset(Merged_Prod_Decay_WL_only, select=c(Experiment_Date, Net_production_CI,
                                                          FC_Net_production_CI))

#Need to convert to long format with a key for filtered and whole water, which will make plotting easier:
Net_prod_long <- gather(Net_prod_df, Prod_Type, Net_rate, c(Net_production_avg,FC_Net_production_avg))

#Change the names in the prod type so that they merge better.
Net_prod_long$Prod_Type <- gsub("FC_Net_production_avg", "Filt", Net_prod_long$Prod_Type)
Net_prod_long$Prod_Type <- gsub("Net_production_avg", "WW", Net_prod_long$Prod_Type)

#Make sure that the CI prod_type columns match those in the net_prod_long dataframe so that the two dataframes merge by this column
Net_prod_CI_long <- gather(Net_prod_CI, Prod_Type, CI, c(Net_production_CI,FC_Net_production_CI))
Net_prod_CI_long$Prod_Type <- gsub("FC_Net_production_CI", "Filt", Net_prod_CI_long$Prod_Type)
Net_prod_CI_long$Prod_Type <- gsub("Net_production_CI", "WW", Net_prod_CI_long$Prod_Type)
Net_prod_df <- merge(Net_prod_long, Net_prod_CI_long, by=c("Experiment_Date", "Prod_Type"), all = TRUE)
rm(Net_prod_long)
rm(Net_prod_CI_long)

#Plot:
Net_prod_barplot <- ggplot(Net_prod_df, aes(fill=Prod_Type, y=Net_rate, x=Experiment_Date)) +
  geom_bar(position=position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin=Net_rate-CI, ymax=Net_rate+CI), width=0.2,
                position=position_dodge(0.9), size = 0.1) +
  scale_x_discrete(limits=c("31-May-17", "13-Jun-17", "27-Jun-17", "6-Jul-17", "12-Jul-17", "18-Jul-17", "25-Jul-17", "1-Aug-17", "15-Aug-17", "22-Aug-17", "30-Aug-17", "31-Aug-17", "6-Sep-17", "12-Sep-17", "19-Sep-17", "26-Sep-17", "4-Oct-17", "5-Oct-17", "10-Jul-18", "24-Jul-18", "31-Jul-18", "3-Aug-18", "7-Aug-18", "10-Aug-18", "14-Aug-18", "21-Aug-18", "14-Sep-18", "18-Sep-18", "23-Jul-19", "2-Aug-19", "6-Aug-19", "24-Aug-19", "17-Sep-19", "20-Sep-19")) +
  scale_fill_manual(values=c("red", "lightblue"), labels=c("0.22 um filtered", "Whole water")) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                     l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                      l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 14, color = "black", angle = 45, hjust = 1,
                                   margin = margin(t = 5, r = 5, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_blank()) +
  scale_y_continuous(breaks=seq(-50,300, by=50)) +
  coord_cartesian(ylim=c(-50,300)) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

Net_prod_barplot
```
In many cases, net production in whole water is comparable to or greater than net production in 0.22 um filtered water. This is despite simultaneous decomposition of H2O2 in whole water bottles that is absent from 0.22 um filtered water, suggesting that total production in whole water is higher than that in the filtered water and there is some particle dependent production.  

Plot of decay in whole and 0.22 um filtered water:  
```{r}
#To make a grouped barplot, we need to rearrange the dataframe a little bit.
#Let's subset the dataframe to only include Experiment date and net decay rates:
Net_decay_df <- subset(Merged_Prod_Decay_WL_only, select=c(Experiment_Date, Net_decay_avg,
                                                          FC_Net_decay_avg))
Net_decay_CI <- subset(Merged_Prod_Decay_WL_only, select=c(Experiment_Date, Net_decay_CI,
                                                          FC_Net_decay_CI))

#Need to convert to long format with a key for filtered and whole water, which will make plotting easier:
Net_decay_long <- gather(Net_decay_df, Prod_Type, Net_decay, c(Net_decay_avg,FC_Net_decay_avg))

#Change the names in the prod type so that they merge better.
Net_decay_long$Prod_Type <- gsub("FC_Net_decay_avg", "Filt", Net_decay_long$Prod_Type)
Net_decay_long$Prod_Type <- gsub("Net_decay_avg", "WW", Net_decay_long$Prod_Type)

#Make sure that the CI prod_type columns match those in the net_prod_long dataframe so that the two dataframes merge by this column
Net_decay_CI_long <- gather(Net_decay_CI, Prod_Type, CI, c(Net_decay_CI,FC_Net_decay_CI))
Net_decay_CI_long$Prod_Type <- gsub("FC_Net_decay_CI", "Filt", Net_decay_CI_long$Prod_Type)
Net_decay_CI_long$Prod_Type <- gsub("Net_decay_CI", "WW", Net_decay_CI_long$Prod_Type)
Net_decay_df <- merge(Net_decay_long, Net_decay_CI_long, by=c("Experiment_Date", "Prod_Type"), all = TRUE)
rm(Net_decay_long)
rm(Net_decay_CI_long)

#Plot:
#Only plot 2017 for which there is Decay data for both whole water and 0.22 um filtered water:  
Net_decay_barplot <- ggplot(Net_decay_df, aes(fill=Prod_Type, y=Net_decay, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Net_decay-CI, ymax=Net_decay+CI), width=0.2, position=position_dodge(0.9),
                  size = 0.1) +
  scale_x_discrete(limits=c("31-May-17", "13-Jun-17", "27-Jun-17", "6-Jul-17", "12-Jul-17", "18-Jul-17", "25-Jul-17", "1-Aug-17", "15-Aug-17", "22-Aug-17", "30-Aug-17", "31-Aug-17", "6-Sep-17", "12-Sep-17", "19-Sep-17", "26-Sep-17", "4-Oct-17", "5-Oct-17", "10-Jul-18", "24-Jul-18", "31-Jul-18", "3-Aug-18", "7-Aug-18", "10-Aug-18", "14-Aug-18", "21-Aug-18", "14-Sep-18", "18-Sep-18", "23-Jul-19", "2-Aug-19", "6-Aug-19", "24-Aug-19", "17-Sep-19", "20-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("0.22 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, hjust = 1,
                                     margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    scale_y_continuous(breaks=seq(-100,500, by=100)) +
    coord_cartesian(ylim=c(-100,500)) +
    ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

Combo_net_plot <- Net_prod_barplot + Net_decay_barplot + plot_layout(ncol = 1)
Combo_net_plot
ggsave("Combo_net_plot.pdf", Combo_net_plot, width = 12, height = 8, units = "in", dpi=300)
ggsave("Net_prod_barplot.pdf",  Net_prod_barplot, width = 3.5, height = 3.5, units = "in", dpi=300)
ggsave("Net_decay_barplot.pdf",  Net_decay_barplot, width = 3.5, height = 3.5, units = "in", dpi=300)
```
I calculated photochemical production calculated from CDOM absorbance, whole water absorbance, average light intensity over the time window used to calculate net H2O2 production, and pathlength = bottle width. See the excel file named "Photo_H2O2_RATE_EXPERIMENT" for the calculations and input data. The goal was to compare these values to the net production measured in whole water to see if photochemistry could account for the net production (ignoring the decay in spiked bottles).

I copied the numbers in the excel file into a tab-delimited text file to import into R to make plots for the manuscript. The below chunk imports the data and makes a plot:
```{r}
#Import the text file into an R object:
Photo_vs_measured <- read.table("Photo_vs_Net.txt", header=TRUE, sep="\t")

#Make a barplot:
Photo_vs_measured_barplot <- ggplot(Photo_vs_measured, aes(fill=Type, y=H2O2_production_rate, x=Date)) +
  geom_bar(position=position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin=H2O2_production_rate-CI, ymax=H2O2_production_rate+CI), width=0.2,
                position=position_dodge(0.9), size = 0.1) +
  scale_fill_manual(values=c("red", "lightblue")) +
  scale_x_discrete(limits=c("23-Jul-19", "2-Aug-19", "6-Aug-19", "24-Aug-19", "17-Sep-19", "20-Sep-19")) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                     l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                      l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 14, color = "black", angle = 45, hjust = 1,
                                   margin = margin(t = 5, r = 5, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_blank()) +
  coord_cartesian(ylim=c(0,250)) +
  ylab(expression("H"[2]*"O"[2]*" production rate (nM/hr)"))

Photo_vs_measured_barplot
ggsave("Photo_vs_measured_barplot.pdf", Photo_vs_measured_barplot, width = 12, height = 8, units = "in", dpi=300)
```
Only three dates had significantly different net measured and calculated photochemical H2O2 production rates. There is some uncertainty in the light pathlenght, and because higher pathlengths increase photochemical reaction rates, the true photochemical production rates could be higher if the light path was underestimated due to upwelling light and changes in solar zenith angle over the course of the day.

I calculated new photochemical production rates after multiplying the pathlength by a factor of 1.1 - 2 (see the "Photo_H2O2_RATE_EXPERIMENT" excel file). There is a summary tab in the tab labeled as "Path_multiplier_comparison" that I transfered to this markdown file to make it nicer looking for the manuscript.
```{r}
#Import the table from the excel sheet:
Path_Compare_Table <- read.table("Path_Compare_Table.txt", header=TRUE, sep="\t")

tab_df(Path_Compare_Table, alternate.rows = T, file="TableS1.doc") #print to a file.
```

Summarize net production and decay:  
```{r}
#Net production in Whole water:
print("Net production in whole water")
mean(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)

#Net production in 0.22 um filtered water:
print("Net production in 0.22 um filtered water")
mean(Merged_Prod_Decay_WL_only$FC_Net_production_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$FC_Net_production_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$FC_Net_production_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$FC_Net_production_avg, na.rm=TRUE)

#Net decay in whole water:
print("Net decay in whole water")
mean(Merged_Prod_Decay_WL_only$Net_decay_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$Net_decay_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Net_decay_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$Net_decay_avg, na.rm=TRUE)

#Net decay in 0.22 um filtered water:
print("Net decay in 0.22 um filtered water")
mean(Merged_Prod_Decay_WL_only$FC_Net_decay_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$FC_Net_decay_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$FC_Net_decay_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$FC_Net_decay_avg, na.rm=TRUE)
```
How much higher are gross H2O2 production rates than net H2O2 production rates on average?  
```{r}
#Divide gross H2O2 production by net production:
Gross_vs_net <- Merged_Prod_Decay_WL_only$PH2O2_avg / Merged_Prod_Decay_WL_only$Net_production_avg
#Remove dates where there was zero net production:
Gross_vs_net <- Gross_vs_net[Gross_vs_net != Inf]
Gross_vs_net <- Gross_vs_net[Gross_vs_net > 0]
min(Gross_vs_net, na.rm = TRUE)
max(Gross_vs_net, na.rm = TRUE)
mean(Gross_vs_net, na.rm = TRUE)
sd(Gross_vs_net, na.rm = TRUE)

#Create a summary dataframe for the paper:
Gross_vs_net_df <- array(numeric(),c(34,6))
Gross_vs_net_df[,1] <- Merged_Prod_Decay_WL_only$Experiment_Date
Gross_vs_net_df[,2] <- Merged_Prod_Decay_WL_only$PH2O2_avg
Gross_vs_net_df[,3] <- Merged_Prod_Decay_WL_only$PH2O2_CI
Gross_vs_net_df[,4] <- Merged_Prod_Decay_WL_only$Net_production_avg
Gross_vs_net_df[,5] <- Merged_Prod_Decay_WL_only$Net_production_CI
Gross_vs_net_df[,6] <- Merged_Prod_Decay_WL_only$PH2O2_avg / Merged_Prod_Decay_WL_only$Net_production_avg

Gross_vs_net_df <- as.data.frame(Gross_vs_net_df)
colnames(Gross_vs_net_df) <- c("Experiment Date", "Total gross H2O2 production", "95 % CI", "Net H2O2 production", "95 % CI", "Fold difference")
Gross_vs_net_df$`Experiment Date` <- dmy(Gross_vs_net_df$`Experiment Date`)
Gross_vs_net_df <- Gross_vs_net_df[ order(Gross_vs_net_df$`Experiment Date`), ] #sort the dataframe by experiment date
#Remove entries where gross H2O2 production could not be calculated:
Gross_vs_net_df <- Gross_vs_net_df[ Gross_vs_net_df$`Total gross H2O2 production` != "NaN", ]
tab_df(Gross_vs_net_df, alternate.rows = T, file="TableS2.doc") #print to a file.
```
The Absolute H2O2 production CI for 8-30-2017 experiment is "NA" because this date only has n=1 because data from one replicate did not fit the model. 

The Absolute H2O2 production CI for 9-14-2018 experiment is "NA" because this date only has n=1 because data from one spike replicate did not get H2O2 addition (by mistake). This precluded calculation of Absolute H2O2 production and Kloss.

Note that for the publication, I corrected the reported values for significant digits. I then fixed the fold differences calculated above to match what would be calculated from hand based on the reported values in the table.

Re-calulate mean and range for the fold difference using the values calculated by hand:
```{r}
mean(2.2, 2.9, 2.2, 2.4, 3.9, 18.4, 2.8, 8.3, 9.4, 23.3, 13.3, 5.6, 2.7, 3.0,
     3.5, 2.1, 7.6, 3.6, 6.0, 2.7, 2.1, 3.5, 5.8, 1.9, 2.4)
min(2.2, 2.9, 2.2, 2.4, 3.9, 18.4, 2.8, 8.3, 9.4, 23.3, 13.3, 5.6, 2.7, 3.0,
     3.5, 2.1, 7.6, 3.6, 6.0, 2.7, 2.1, 3.5, 5.8, 1.9, 2.4)
max(2.2, 2.9, 2.2, 2.4, 3.9, 18.4, 2.8, 8.3, 9.4, 23.3, 13.3, 5.6, 2.7, 3.0,
     3.5, 2.1, 7.6, 3.6, 6.0, 2.7, 2.1, 3.5, 5.8, 1.9, 2.4)
```

Are net whole water and 0.22 um filtered production rates significantly different from each other on average?
```{r}
t.test(Merged_Prod_Decay_WL_only$Net_production_avg, Merged_Prod_Decay_WL_only$FC_Net_production_avg, paired = FALSE, alternative = "two.sided")
```
On which dates (if any), was net H2O2 production significantly different from 0.22 um filtered production?
```{r}
#Get a vector of dates to loop through:  
dates <- unique(Prod_Decay_df$Experiment_Date)
#Remove the 1-Aug-17 and 24-Aug-19 samples, which only has n=1 for 0.22 um filtered production:
dates <- dates[ dates != "1-Aug-17"]
dates <- dates[ dates != "24-Aug-19"]

#For each date, do a t-test of net H2O2 production in whole water vs 0.22 um filtered water:
for (i in dates){
  #Create a dataframe of samples from only that date
  t_test_df <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  #Print the date
  print(i)
  #run the t-test and print the result:
  print(t.test(t_test_df$Net_production, t_test_df$FC_Net_production, paired = FALSE, alternative = "two.sided"))
}

```
Are net decay rates in whole water significantly different from those in 0.22 um filtered water?
```{r}
t.test(Merged_Prod_Decay_WL_only$Net_decay_avg, Merged_Prod_Decay_WL_only$FC_Net_decay_avg, paired = FALSE, alternative = "two.sided", na.rm = TRUE)
```
On which dates (if any) was net decay in 0.22 um filtered water significantly different from zero?
```{r}
#Get a vector of dates to loop through:  
dates <- unique(Prod_Decay_df$Experiment_Date)
#Remove the dates from 2018 and 2019, which did not have filtered spike bottles:
drop <- c("10-Jul-18", "24-Jul-18", "31-Jul-18", "3-Aug-18", "7-Aug-18", "10-Aug-18",
          "14-Aug-18", "21-Aug-18", "14-Sep-18", "18-Sep-18", "23-Jul-19", "2-Aug-19" , "6-Aug-19",
          "24-Aug-19", "17-Sep-19", "20-Sep-19")
dates <- dates[!(dates %in% drop)]

#Create an empty list to store the t-test results:
FC_decay_t_tests <- list()

#For each date, do a t-test of net H2O2 decay in whole water vs 0.22 um filtered water:
for (i in dates){
  #Create a dataframe of samples from only that date
  t_test_df <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  #run the t-test and print the result:
  FC_decay_t_tests[[i]] <- t.test(t_test_df$FC_Net_decay, mu = 0, paired = FALSE, alternative = "two.sided")
  print(FC_decay_t_tests[[i]])
}

```
Are net production and decay rates in whole water correlated?
```{r}
cor(Merged_Prod_Decay_WL_only$Net_production_avg, Merged_Prod_Decay_WL_only$Net_decay_avg, method="pearson", use="complete.obs")
WW_net_prod_vs_net_decay <- lm(Merged_Prod_Decay_WL_only$Net_decay_avg ~ Merged_Prod_Decay_WL_only$Net_production_avg,
                               na.action = na.omit)
summary(WW_net_prod_vs_net_decay)
```
Net decay is significantly correlated with net production rate:  
```{r}
#Plot the relationship:
NetDecay_vs_Net_Prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Net_production_avg, y=Net_decay_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE, size=0.25) +
  geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI, ymax=Net_decay_avg+Net_decay_CI), width=2,
                size = 0.1) +
  geom_errorbarh(aes(xmin=Net_production_avg-Net_production_CI, xmax=Net_production_avg+Net_production_CI), height=9, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  #scale_y_continuous(breaks=c(0,50,100,150,200,250,300,350)) +
  coord_cartesian(ylim=c(-50,500), xlim=c(-50,300)) +
  xlab(expression("Net H"[2]*"O"[2]*" production (nM/hr)")) +
  ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

NetDecay_vs_Net_Prod
```

Is absolute Kloss correlated with total gross H2O2 production?
```{r}
cor(Merged_Prod_Decay_WL_only$PH2O2_avg, Merged_Prod_Decay_WL_only$Kloss_avg, method="pearson", use="complete.obs")
WW_gross_prod_vs_kloss <- lm(Merged_Prod_Decay_WL_only$Kloss_avg ~ Merged_Prod_Decay_WL_only$PH2O2_avg,
                               na.action = na.omit)
summary(WW_gross_prod_vs_kloss)
```
There is a significant correlation between decay constants and total gross H2O2 production:
```{r}
#Plot the relationship:
AbsDecay_vs_GrossProd <- filter(Merged_Prod_Decay_WL_only, Model_Fit == "Yes") %>% 
ggplot(aes(x=PH2O2_avg, y=Kloss_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE, size=0.25) +
  geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI), width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=PH2O2_avg-PH2O2_CI, xmax=PH2O2_avg+PH2O2_CI), height=0.02, size= 0.1) +
  geom_point(size = 1, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(0,1), xlim=c(0,500)) +
  xlab(expression("Total gross H"[2]*"O"[2]*" production (nM/hr)")) +
  ylab(expression("Kloss,H2O2 (hr-1)"))

Combined_Prod_vs_Decay_plot <- NetDecay_vs_Net_Prod + AbsDecay_vs_GrossProd
Combined_Prod_vs_Decay_plot
ggsave("Combined_Prod_vs_Decay_plot.pdf",  Combined_Prod_vs_Decay_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
When did maximum H2O2 concentrations usually occur?  
```{r}
#Convert exact times into approximate times for binning purposes:
Prod_Decay_df$Time_Max_H2O2 <- gsub('10:.*', '11:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('8:40', '9:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('8:01', '8:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('13:.*', '14:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('14:.*', '14:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('16:.*', '17:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('17:.*', '17:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('7:4.*', '8:00', Prod_Decay_df$Time_Max_H2O2)
Prod_Decay_df$Time_Max_H2O2 <- gsub('7:5.*', '8:00', Prod_Decay_df$Time_Max_H2O2)
#make a histogram of Time Max H2O2:
#Remove the dark bottles, where there was always no net change in H2O2
Time_Max_H2O2_histogram <- filter(Prod_Decay_df, Condition != "WD") %>%
  ggplot(aes(x=Time_Max_H2O2)) +
    geom_histogram(stat="count", color="black", fill="white") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
      strip.background = element_blank(),
      panel.spacing = unit(5, "mm"),
      axis.line.x = element_line(size=0.1),
      axis.line.y = element_line(size=0.1),
      axis.text.x = element_text(size = 16, color = "black", angle = 45, hjust = 1,
                                 margin = margin(t = 6, r = 0, b = 0, l = 0)),
      axis.title.x = element_text(size = 14, color = "black",
                                  margin = margin(t = 6, r = 0, b = 0, l = 0)),
      panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
      axis.title.y = element_text(size = 16, color = "black",
                                  margin = margin(t = 0, r = 8, b = 0, l = 0)),
      axis.text.y = element_text(size = 14, color = "black",
                                 margin = margin(t = 0, r = 6, b = 0, l = 0)),
      axis.ticks.length = unit(-0.1, "cm"),
      axis.ticks = element_line(size=0.1),
      legend.title = element_blank()) +
    scale_x_discrete(limits=c("8:00", "9:00", "11:00", "14:00", "17:00", "18:00")) +
    coord_cartesian(ylim=c(0,40)) +
    xlab(expression("Time of Max H"[2]*"O"[2]*" (EDT)")) +
    ylab("Count")

Time_Max_H2O2_histogram
ggsave("Time_Max_H2O2_histogram.pdf",  Time_Max_H2O2_histogram, width = 3.5, height = 3.5, units = "in", dpi=300)
```
Most of the time, max H2O2 occurred between 14:00 and 17:00 EDT. In bottles were max H2O2 occurred at 8-9 am, there was net decay.  

The time of max H2O2 in each replicate was the same on all but 6 dates.  

Next, make a similar histogram of time of maximum H2O2 concentration, but for 0.22 um filtered bottles:  
```{r}
#Convert exact times into approximate times for binning purposes:
Prod_Decay_df$FC_Time_Max_H2O2 <- gsub('17:.*', '17:00', Prod_Decay_df$FC_Time_Max_H2O2)
Prod_Decay_df$FC_Time_Max_H2O2 <- gsub('16:.*', '17:00', Prod_Decay_df$FC_Time_Max_H2O2)
Prod_Decay_df$FC_Time_Max_H2O2 <- gsub('7:5.*', '8:00', Prod_Decay_df$FC_Time_Max_H2O2)
Prod_Decay_df$FC_Time_Max_H2O2 <- gsub('13:5.*', '14:00', Prod_Decay_df$FC_Time_Max_H2O2)
#make a histogram of Time Max H2O2:
#Remove the dark bottles, where there was always no net change in H2O2
#Remove the FL bottles, because that would created redundant 0.22 um filtered entries
Time_Max_H2O2_histogram_022um <- filter(Prod_Decay_df, Condition == "WL") %>%
  ggplot(aes(x=FC_Time_Max_H2O2)) +
    geom_histogram(stat="count", color="black", fill="white") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
      strip.background = element_blank(),
      panel.spacing = unit(5, "mm"),
      axis.line.x = element_line(size=0.1),
      axis.line.y = element_line(size=0.1),
      axis.text.x = element_text(size = 16, color = "black", angle = 45, hjust = 1,
                                 margin = margin(t = 6, r = 0, b = 0, l = 0)),
      axis.title.x = element_text(size = 14, color = "black",
                                  margin = margin(t = 6, r = 0, b = 0, l = 0)),
      panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
      axis.title.y = element_text(size = 16, color = "black",
                                  margin = margin(t = 0, r = 8, b = 0, l = 0)),
      axis.text.y = element_text(size = 14, color = "black",
                                 margin = margin(t = 0, r = 6, b = 0, l = 0)),
      axis.ticks.length = unit(-0.1, "cm"),
      axis.ticks = element_line(size=0.1),
      legend.title = element_blank()) +
    scale_x_discrete(limits=c("8:00", "9:00", "11:00", "14:00", "17:00", "18:00")) +
    coord_cartesian(ylim=c(0,60)) +
    xlab(expression("Time of Max H"[2]*"O"[2]*" (EDT)")) +
    ylab("Count")

Time_Max_H2O2_histogram_022um
ggsave("Time_Max_H2O2_histogram_022um.pdf",  Time_Max_H2O2_histogram_022um, width = 3.5, height = 3.5, units = "in", dpi=300)
```
The two rows removed had NAs in their fields. These were the two replicate bottles with problems in the concentration data.  

What was max H2O2 concentration in whole water on average?  
```{r}
print("Whole water concentrations")
mean(Merged_Prod_Decay_WL_only$Max_H2O2_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$Max_H2O2_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Max_H2O2_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$Max_H2O2_avg, na.rm=TRUE)

print("Filtered control concentrations")
mean(Merged_Prod_Decay_WL_only$FC_Max_H2O2_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$FC_Max_H2O2_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$FC_Max_H2O2_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$FC_Max_H2O2_avg, na.rm=TRUE)
```
Is the error between observed H2O2 concentrations and model fit related to any environmental parameters?  
```{r}
#I want to include 105 um filtered water in the regression analysis, but remove the dark bottles which always fit the data, so I'll make a dataframe without the dark data:
Merged_Prod_Decay_no_WD <- Merged_Prod_Decay_df[Merged_Prod_Decay_df$Condition != "WD", ]

#Calculate the correlations between error sum of squares (SSE) and the environmental variables:
#This is a vector of columns to regress Biotic PH2O2 over:
vars <- c("Kloss_avg", "Chla", "DIC", "H2CO3", "HCO3", "CO3", "DOC", "Resp", "PrimProd", "CDOM", "Day_Integrated_UVA", "Day_Integrated_UVB", "Day_Integrated_UV", "pH", "TP", "TDP", "Nitrate", "NH4", "SRP", "Incubation_Temp", "Incubation_Temp_SD", "peakA", "peakC", "peakT", "C_A_ratio", "T_A_ratio", "IntFlour", "FI", "SlopeRatio")

#Create empty lists to save results of the loop into:  
WL_cor_results_SSE <- list()
WL_lm_results_SSE <- list()
WL_lm_results_SSE_table <- list()

#Loop through each item of the vector and find the correlation with SSE in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_SSE[[i]] <- cor(Merged_Prod_Decay_no_WD[, colnames(Merged_Prod_Decay_no_WD) %in% i],
      Merged_Prod_Decay_no_WD$Sum_Error_Squares_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_SSE[[i]] <- lm(Merged_Prod_Decay_no_WD$Sum_Error_Squares_avg ~ Merged_Prod_Decay_no_WD[, colnames(Merged_Prod_Decay_no_WD) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_SSE_table[[i]] <- glance(WL_lm_results_SSE[[i]])
}

#print out the stats for the ones that are significant:
for (i in 1:length(WL_lm_results_SSE)){
  if (WL_lm_results_SSE_table[[i]]$p.value < 0.05){
    print(vars[i])
    print(WL_cor_results_SSE[[i]])
    print(WL_lm_results_SSE_table[[i]]$p.value)
    print(WL_lm_results_SSE_table[[i]]$r.squared)
  }
}
```
There is a significant relationship between SSE, chlorophyll concentration, and primary production. I want to compare this to the relationship with CDOM, so get that stats for that one as well.  
```{r}
WL_cor_results_SSE[["CDOM"]]
WL_lm_results_SSE_table[["CDOM"]]$p.value
WL_lm_results_SSE_table[["CDOM"]]$r.squared
```

Plot the regression:
```{r}
#Plot Chlorophyll vs SSE
Chla_vs_SSE <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Chla, y=Sum_Error_Squares_avg,
                                                     color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE, size=0.5) +
  geom_errorbar(aes(ymin=Sum_Error_Squares_avg-Sum_Error_Squares_CI,
                    ymax=Sum_Error_Squares_avg+Sum_Error_Squares_CI), width=1.5, size=0.08) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=20000, size=0.08) +
  geom_point(size = 0.5, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(-200,1200000), xlim=c(0,200)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("Sum error of squares in PH2O2 model"))

#Plot Primary Production vs SSE
PrimProd_vs_SSE <- ggplot(Merged_Prod_Decay_no_WD, aes(x=PrimProd, y=Sum_Error_Squares_avg,
                                                         color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE, size=0.5) +
  geom_errorbar(aes(ymin=Sum_Error_Squares_avg-Sum_Error_Squares_CI,
                    ymax=Sum_Error_Squares_avg+Sum_Error_Squares_CI), width=1.5, size=0.08) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size=0.08) +
  geom_point(size = 0.5, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(-200,1200000), xlim=c(0,90)) +
  scale_x_continuous(breaks=seq(0,90, by=15)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Sum error of squares in PH2O2 model"))

#Plot CDOM vs SSE
CDOM_vs_SSE <- ggplot(Merged_Prod_Decay_no_WD, aes(x=CDOM, y=Sum_Error_Squares_avg,
                                                         color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE, size=0.5) +
  geom_errorbar(aes(ymin=Sum_Error_Squares_avg-Sum_Error_Squares_CI,
                    ymax=Sum_Error_Squares_avg+Sum_Error_Squares_CI), width=0.5, size=0.08) +
  geom_errorbarh(aes(xmin=CDOM-CDOM_CI, xmax=CDOM+CDOM_CI), height=11, size=0.08) +
  geom_point(size = 0.5, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(-200,1200000), xlim=c(0,30)) +
  scale_x_continuous(breaks=seq(0,30, by=5)) +
  xlab(expression("CDOM absorbance (a305)")) +
  ylab(expression("Sum error of squares in PH2O2 model"))

Combo_SSE_regression_plot <- Chla_vs_SSE + PrimProd_vs_SSE + CDOM_vs_SSE
Combo_SSE_regression_plot
```
Repeat the above analysis, but without the points where the model had a poor fit.
```{r}
#Exclude the data points where there was poor fit to the model
Merged_Prod_Decay_no_WD_no_poor_fit <- Merged_Prod_Decay_no_WD[Merged_Prod_Decay_no_WD$Model_Fit != "No", ]

#Calculate the correlations between error sum of squares (SSE) and the environmental variables:
#This is a vector of columns to regress Biotic PH2O2 over:
vars <- c("Kloss_avg", "Chla", "DIC", "H2CO3", "HCO3", "CO3", "DOC", "Resp", "PrimProd", "CDOM", "Day_Integrated_UVA", "Day_Integrated_UVB", "Day_Integrated_UV", "pH", "TP", "TDP", "Nitrate", "NH4", "SRP", "Incubation_Temp", "Incubation_Temp_SD", "peakA", "peakC", "peakT", "C_A_ratio", "T_A_ratio", "IntFlour", "FI", "SlopeRatio")

#Create empty lists to save results of the loop into:  
WL_cor_results_SSE_no_poor_fit <- list()
WL_lm_results_SSE_no_poor_fit <- list()
WL_lm_results_SSE_no_poor_fit_table <- list()

#Loop through each item of the vector and find the correlation with SSE in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_SSE_no_poor_fit[[i]] <- cor(Merged_Prod_Decay_no_WD_no_poor_fit[, colnames(Merged_Prod_Decay_no_WD_no_poor_fit) %in% i],
      Merged_Prod_Decay_no_WD_no_poor_fit$Sum_Error_Squares_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_SSE_no_poor_fit[[i]] <- lm(Merged_Prod_Decay_no_WD_no_poor_fit$Sum_Error_Squares_avg ~ Merged_Prod_Decay_no_WD_no_poor_fit[, colnames(Merged_Prod_Decay_no_WD_no_poor_fit) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_SSE_no_poor_fit_table[[i]] <- glance(WL_lm_results_SSE_no_poor_fit[[i]])
}

#print out the stats for the ones that were cosindered previously:
for (i in c("Chla", "PrimProd", "CDOM")){
    print(i)
    print(WL_cor_results_SSE_no_poor_fit[[i]])
    print(WL_lm_results_SSE_no_poor_fit_table[[i]]$p.value)
    print(WL_lm_results_SSE_no_poor_fit_table[[i]]$r.squared)
}
```
Import the dataframe:  
```{r}
Poor_fit_curves_df <- read.table("Poor_fit_curves_df.txt", header=TRUE, sep="\t")
```

Plot:
```{r}
#Plot for WL 22-Aug-17:
WL_22Aug17_1 <- filter(Poor_fit_curves_df, Date == "22-Aug-17" & Rep == 1) %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 22-Aug-17 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,2400), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,2400, by=400)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
WL_22Aug17_2 <- filter(Poor_fit_curves_df, Date == "22-Aug-17" & Rep == 2) %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 22-Aug-17 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,2400), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,2400, by=400)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

WL_22Aug17_plot <- WL_22Aug17_1 + WL_22Aug17_2
WL_22Aug17_plot
ggsave("WL_22Aug17_plot.pdf",  WL_22Aug17_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
```{r}
#Plot for WL 30-Aug-17:
WL_30Aug17_1 <- filter(Poor_fit_curves_df, Date == "30-Aug-17" & Rep == 1) %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 30-Aug-17 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,1200), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,1200, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
WL_30Aug17_2 <- filter(Poor_fit_curves_df, Date == "30-Aug-17" & Rep == 2) %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 30-Aug-17 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,1200), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,1200, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

WL_30Aug17_plot <- WL_30Aug17_1 + WL_30Aug17_2
WL_30Aug17_plot
ggsave("WL_30Aug17_plot.pdf",  WL_30Aug17_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
```{r}
#Plot for WL 23-Jul-19:
WL_23Jul19_1 <- filter(Poor_fit_curves_df, Date == "23-Jul-19" & Rep == 1) %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 23-Jul-19 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,1400), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,1400, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
WL_23Jul19_2 <- filter(Poor_fit_curves_df, Date == "23-Jul-19" & Rep == 2) %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 23-Jul-19 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,1400), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,1400, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

WL_23Jul19_plot <- WL_23Jul19_1 + WL_23Jul19_2
WL_23Jul19_plot
ggsave("WL_23Jul19_plot.pdf",  WL_23Jul19_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
```{r}
#Plot for WL 6-Aug-19:
WL_6Aug19_1 <- filter(Poor_fit_curves_df, Date == "6-Aug-19" & Rep == 1 & Condition == "WL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 6-Aug-19 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,800), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,800, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
WL_6Aug19_2 <- filter(Poor_fit_curves_df, Date == "6-Aug-19" & Rep == 2 & Condition == "WL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 6-Aug-19 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,800), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,800, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

WL_6Aug19_plot <- WL_6Aug19_1 + WL_6Aug19_2
WL_6Aug19_plot
ggsave("WL_6Aug19_plot.pdf",  WL_6Aug19_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
```{r}
#Plot for Fl 6-Aug-19:
FL_6Aug19_1 <- filter(Poor_fit_curves_df, Date == "6-Aug-19" & Rep == 1 & Condition == "FL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("105 um filtered water 6-Aug-19 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,1000), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,1000, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
FL_6Aug19_2 <- filter(Poor_fit_curves_df, Date == "6-Aug-19" & Rep == 2 & Condition == "FL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("105 um filtered water 6-Aug-19 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,1000), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,1000, by=200)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

FL_6Aug19_plot <- FL_6Aug19_1 + FL_6Aug19_2
FL_6Aug19_plot
ggsave("FL_6Aug19_plot.pdf",  FL_6Aug19_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
```{r}
#Plot for WL 24-Aug-19:
WL_24Aug19_1 <- filter(Poor_fit_curves_df, Date == "24-Aug-19" & Rep == 1 & Condition == "WL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 24-Aug-19 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,900), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,900, by=300)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
WL_24Aug19_2 <- filter(Poor_fit_curves_df, Date == "24-Aug-19" & Rep == 2 & Condition == "WL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("Whole water 24-Aug-19 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,900), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,900, by=300)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

WL_24Aug19_plot <- WL_24Aug19_1 + WL_24Aug19_2
WL_24Aug19_plot
ggsave("WL_24Aug19_plot.pdf",  WL_24Aug19_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
```{r}
#Plot for 100 um 24-Aug-19:
FL_24Aug19_1 <- filter(Poor_fit_curves_df, Date == "24-Aug-19" & Rep == 1 & Condition == "FL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("105 um filtered water 24-Aug-19 Rep 1") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top") +
    coord_cartesian(ylim=c(0,900), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,900, by=300)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))
    
FL_24Aug19_2 <- filter(Poor_fit_curves_df, Date == "24-Aug-19" & Rep == 2 & Condition == "FL") %>%
  ggplot(aes(x=Hours_after_spike, y=measured_H2O2, color=Bottle_type)) +
    geom_point() +
    geom_line(aes(x=Hours_after_spike, y=model_H2O2, color=Bottle_type), linetype="dashed") +
    geom_errorbar(aes(ymin=measured_H2O2-measured_H2O2_se,
                    ymax=measured_H2O2+measured_H2O2_se), width=0.5, size=0.08) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Bottle:") +
    ggtitle("105 um filtered water 24-Aug-19 Rep 2") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,900), xlim=c(0,9)) +
    scale_x_continuous(breaks = seq(0,9, by=3)) +
    scale_y_continuous(breaks = seq(0,900, by=300)) +
    xlab(expression("Incubation time (hours)")) +
    ylab(expression("[H"[2]*"O"[2]*"] (nM)"))

FL_24Aug19_plot <- FL_24Aug19_1 + FL_24Aug19_2
FL_24Aug19_plot
ggsave("FL_24Aug19_plot.pdf",  FL_24Aug19_plot, width = 8, height = 3.5, units = "in", dpi=300)
```
Are there any significant differences in the environmental parameters and the dates where the model did and did not fit?  
```{r}
#Separate the data frame without the dark bottles into two data frames based on model fit:
Merged_Prod_Decay_no_WD_poor_fit <- Merged_Prod_Decay_no_WD[Merged_Prod_Decay_no_WD$Model_Fit == "No", ]
Merged_Prod_Decay_no_WD_OK_fit <- Merged_Prod_Decay_no_WD[Merged_Prod_Decay_no_WD$Model_Fit == "Yes", ]

####Perform a Welch's t-test for all the environmental parameters:
#This is a vector of variables to loop through during T-test calculations:
vars <- c("Chla", "DIC", "H2CO3", "HCO3", "CO3", "DOC", "Resp", "PrimProd", "CDOM", "Day_Integrated_UVA", "Day_Integrated_UVB", "Day_Integrated_UV", "pH", "TP", "TDP", "Nitrate", "NH4", "SRP", "Incubation_Temp", "Incubation_Temp_SD", "peakA", "peakC", "peakT", "C_A_ratio", "T_A_ratio", "IntFlour", "FI", "SlopeRatio")

for (i in vars){
  #Get the vector for the poor fit data:
  PF_vector <- Merged_Prod_Decay_no_WD_poor_fit[ , colnames(Merged_Prod_Decay_no_WD_poor_fit) == i]
  #Get the vector for the rest of the data:
  OK_vector <- Merged_Prod_Decay_no_WD_OK_fit[ , colnames(Merged_Prod_Decay_no_WD_OK_fit) == i]
  print(i)
  print(t.test(PF_vector, OK_vector, paired = FALSE, alternative = "two.sided"))
  rm(PF_vector)
  rm(OK_vector)
}
```
Plot the distributions of the data in each parameter with significant differences: 
```{r}
#Plot for chlorophyll:
Chla_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=Chla, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,200)) +
  xlab("Model fit") +
  ylab(expression("Chlorophyll a ("*mu*"g/L)"))

H2CO3_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=H2CO3, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  xlab("Model fit") +
  ylab(expression("H"[2]*"CO"[3]*" ("*mu*"M)"))

CO3_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=CO3, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  xlab("Model fit") +
  coord_cartesian(ylim=c(0,150)) +
  ylab(expression("CO"[3]*""^2-" ("*mu*"M)"))

DOC_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=DOC, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  xlab("Model fit") +
  coord_cartesian(ylim=c(200,600)) +
  ylab(expression("DOC ("*mu*"M)"))

CDOM_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=CDOM, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  xlab("Model fit") +
  coord_cartesian(ylim=c(0,25)) +
  ylab("CDOM absorbance (a305)")

UVA_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=Day_Integrated_UVA, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(200000,1200000)) +
  xlab("Model fit") +
  ylab(expression("Day Integrated UVA (J/m"^2*")"))

pH_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=pH, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(7,10)) +
  xlab("Model fit") +
  ylab("pH")

NH4_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=NH4, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  xlab("Model fit") +
  coord_cartesian(ylim=c(0,400)) +
  ylab(expression("NH"[4]*" ("*mu*"M)"))

SRP_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=SRP, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  xlab("Model fit") +
  coord_cartesian(ylim=c(0,80)) +
  ylab(expression("Soluble Reactive P ("*mu*"M)"))

PeakA_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=peakA, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,4)) +
  xlab("Model fit") +
  ylab(expression("FDOM Peak A"))

PeakC_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=peakC, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,1.5)) +
  xlab("Model fit") +
  ylab(expression("FDOM Peak C"))

PeakT_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=peakT, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,0.6)) +
  xlab("Model fit") +
  ylab(expression("FDOM Peak T"))

IntFlour_box_plot <- ggplot(Merged_Prod_Decay_no_WD, aes(x=Model_Fit, y=IntFlour, color=Model_Fit)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(850,16500)) +
  xlab("Model fit") +
  ylab(expression("FDOM (Integrated Fluorescence)"))

All_box_plot <- Chla_box_plot + H2CO3_box_plot + CO3_box_plot + DOC_box_plot + CDOM_box_plot + PeakA_box_plot + PeakC_box_plot + PeakT_box_plot + IntFlour_box_plot + UVA_box_plot + pH_box_plot + NH4_box_plot + SRP_box_plot + plot_layout(ncol=7)

ggsave("All_box_plot.pdf",  All_box_plot, width = 14, height = 10, units = "in", dpi=300)
```

What is the relationship between biotic H2O2 production and other environmental variables?  
Only looking at whole water bottles in the light for now.  
```{r}
#Calculate the correlations between PH2O2 and the environmental variables:
#This is a vector of columns to regress Biotic PH2O2 over:
vars <- c("Kloss_avg", "Chla", "DIC", "H2CO3", "HCO3", "CO3", "DOC", "Resp", "PrimProd", "CDOM", "Day_Integrated_UVA", "Day_Integrated_UVB", "Day_Integrated_UV", "pH", "TP", "TDP", "Nitrate", "NH4", "SRP", "Incubation_Temp", "Incubation_Temp_SD", "peakA", "peakC", "peakT", "C_A_ratio", "T_A_ratio", "IntFlour", "FI", "SlopeRatio")

#Create empty lists to save results of the loop into:  
WL_cor_results <- list()
WL_lm_results <- list()
WL_lm_results_table <- list()

#Loop through each item of the vector and find the correlation in with Biotic PH2O2 in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results[[i]] <- cor(Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i],
      Merged_Prod_Decay_WL_only$Biotic_PH2O2, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results[[i]] <- lm(Merged_Prod_Decay_WL_only$Biotic_PH2O2 ~ Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_table[[i]] <- glance(WL_lm_results[[i]])
}
```

Print the pearson's r, p-values and R^2 statistics of each linear model:  
```{r}
for (i in 1:length(WL_lm_results)){
  print(vars[i])
  print("Pearson's R")
  print(WL_cor_results[[i]])
  print("F-test p-value")
  print(WL_lm_results_table[[i]]$p.value)
  print("R2")
  print(WL_lm_results_table[[i]]$r.squared)
  print("MAE:")
  print(mean(abs(WL_lm_results[[i]]$residuals)))
}
```
Kloss, Chlorophyll a, Respiration rate, Primary Production, CDOM, Peak A, Peak C, C/A, IntFluor, TP, TDP, Nitrate, NH4, and SRP all have significant correlations. Only Chlorophyll, Respiration rate, and Primary Production have R2 values above 0.3, so those likely have the most explanatory power.  

Make a regression plot of Biotic PH2O2 and the parameters for which there is a significant relationship:  
```{r}
#Plot Chlorophyll vs Biotic PH2O2
Chla_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Chla, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size=0.1) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.05, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(0,350),xlim=c(0,80)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Respiration vs Biotic PH2O2
Resp_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Resp, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size=0.1) +
  geom_errorbarh(aes(xmin=Resp-Resp_CI, xmax=Resp+Resp_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.05, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="top",
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 12)) +
  coord_cartesian(ylim=c(0,200), xlim=c(0,50)) +
  xlab(expression("Respiration ("*mu*"M O"[2]*"/day)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Primary Production vs Biotic PH2O2
PrimProd_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=PrimProd, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size=0.1) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.05, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,200), xlim=c(0,60)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot CDOM vs Biotic PH2O2
CDOM_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=CDOM, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size = 0.1) +
  geom_errorbarh(aes(xmin=CDOM-CDOM_CI, xmax=CDOM+CDOM_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(0,350)) +
  xlab(expression("CDOM absorbance (a305)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Biotic PH2O2 vs Peak A
PeakA_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=peakA, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=0.2, size = 0.1) +
  geom_errorbarh(aes(xmin=peakA-peakA_CI, xmax=peakA+peakA_CI), height=8, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(0,350)) +
  xlab(expression("FDOM peak A")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))
```
The warning for the above plot is referring to negative values in some error bars and on the regression line being out of bounds of the plot axis, as well as the values with NAs being excluded. Did not include these plots because of the large uncertainty in Biotic PH2O2. 

Are the correlations with primary production and respiration still significant when that high value point from 2019 is removed?
```{r}
print("Respiration")
check_df <- filter(Merged_Prod_Decay_WL_only, Experiment_Date != "2-Aug-19")
cor(check_df$Resp, check_df$Biotic_PH2O2, method = "pearson", use = "complete.obs")
check_resp <- lm(check_df$Biotic_PH2O2 ~ check_df$Resp, na.action = na.omit)
summary(check_resp)
mean(abs(check_resp$residuals))

print("Primary Production")
cor(check_df$PrimProd, check_df$Biotic_PH2O2, method = "pearson", use = "complete.obs")
check_PrimProd <- lm(check_df$Biotic_PH2O2 ~ check_df$PrimProd, na.action = na.omit)
summary(check_PrimProd)
mean(abs(check_PrimProd$residuals))
```

Plot the linear regressions of biotic PH2O2 with the significantly correlated nutrients (which have weaker significance and R2 scores than the above parameters):  
```{r}
#Plot TP vs Biotic PH2O2
TP_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=TP, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size=0.1) +
  geom_errorbarh(aes(xmin=TP-TP_CI, xmax=TP+TP_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12)) +
  coord_cartesian(ylim=c(0,350),xlim=c(0,200)) +
  xlab(expression("Total P ("*mu*"g/L)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot TDP vs Biotic PH2O2
TDP_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=TDP, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size=0.1) +
  geom_errorbarh(aes(xmin=TDP-TDP_CI, xmax=TDP+TDP_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,350),xlim=c(0,90)) +
  xlab(expression("TDP ("*mu*"g/L)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot SRP vs Biotic PH2O2
SRP_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=SRP, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=1.5, size=0.1) +
  geom_errorbarh(aes(xmin=SRP-SRP_CI, xmax=SRP+SRP_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,350)) +
  xlab(expression("SRP ("*mu*"g/L)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Nitrate vs Biotic PH2O2
Nitrate_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Nitrate, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=0.2, size = 0.1) +
  geom_errorbarh(aes(xmin=Nitrate-Nitrate_CI, xmax=Nitrate+Nitrate_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(0,350),xlim=c(0,5)) +
  xlab(expression("NO"[3]*" (mg N/L)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot NH4 vs Biotic PH2O2
NH4_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=log(NH4), y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=0.2, size=0.1) +
  geom_errorbarh(aes(xmin=log(NH4)-(0.434*NH4_CI/NH4), xmax=log(NH4)+(0.434*NH4_CI/NH4)), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,350)) +
  xlab(expression("ln NH"[4]*" ("*mu*"g N/L)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Kloss vs Biotic PH2O2
Kloss_vs_Biotic_PH2O2 <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Kloss_avg, y=Biotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=0.03, size = 0.1) +
  geom_errorbarh(aes(xmin=Kloss_avg-Kloss_CI, xmax=Kloss_avg+Kloss_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position="none") +
  coord_cartesian(ylim=c(0,350)) +
  xlab(expression("Kloss,H2O2 (hr-1)")) +
  ylab(expression("Gross biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Combine all the regression plots with biological parameters into one panel:
BioPH2O2_bio_param_regressions <- Chla_vs_Biotic_PH2O2 + PrimProd_vs_Biotic_PH2O2 + Resp_vs_Biotic_PH2O2 + Kloss_vs_Biotic_PH2O2 + plot_layout(ncol = 2)

#Combine all the regression plots with chemical parameters into one panel:
BioPH2O2_chem_param_regressions <- CDOM_vs_Biotic_PH2O2 + TP_vs_Biotic_PH2O2 + Nitrate_vs_Biotic_PH2O2 + NH4_vs_Biotic_PH2O2 + SRP_vs_Biotic_PH2O2 + TDP_vs_Biotic_PH2O2 + plot_layout(ncol = 3)

BioPH2O2_bio_param_regressions
BioPH2O2_chem_param_regressions
ggsave("BioPH2O2_bio_param_regressions.pdf",  BioPH2O2_bio_param_regressions, width = 12, height = 10, units = "in", dpi=300)
ggsave("BioPH2O2_chem_param_regressions.pdf",  BioPH2O2_chem_param_regressions, width = 12, height = 10, units = "in", dpi=300)
```
With the low R^2 values and the 95% confidence intervals on PH2O2, nutrients and CDOM likely have little impact on gross biotic H2O2 production rates. Again, not included beccause of large uncertainty with Biotic PH2O2.  

For several experiments, it was not possible to obtain a value for gross bitoic H2O2 production for the reasons described above. Let's look at how the data and regressions with net production and decay rates compare with those using the estimated gross biotic production rates.  

What is the range and average net production rate?  
```{r}
#Calculate the number of complete observations:  
num_obs <- length(Merged_Prod_Decay_WL_only$Net_production_avg[!(is.na(Merged_Prod_Decay_WL_only$Net_production_avg))])
#Get the stats:  
mean(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)
(sd(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)/sqrt(num_obs))*1.96
min(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)
max(Merged_Prod_Decay_WL_only$Net_production_avg, na.rm=TRUE)
```
Net production rates were 31 +/- 15 nM/hr, and ranged from -13 to 165 nM/hr.  

Is there a relationship between Net production rate and any of the environmental parameters?  
```{r}
#List of variables:
vars <- c("Chla", "DIC", "H2CO3", "HCO3", "CO3", "DOC", "Resp", "PrimProd", "CDOM", "Day_Integrated_UVA", "Day_Integrated_UVB", "Day_Integrated_UV", "pH", "TP", "TDP", "Nitrate", "NH4", "SRP", "Incubation_Temp", "Incubation_Temp_SD", "peakA", "peakC", "peakT", "C_A_ratio", "T_A_ratio", "IntFlour", "FI", "SlopeRatio")

#Create empty lists to save results of the loop into:  
WL_cor_results_net_prod <- list()
WL_lm_results_net_prod <- list()
WL_lm_results_net_prod_table <- list()

#Loop through each item of the vector and find the correlation with Net H2O2 production in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_net_prod[[i]] <- cor(Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i],
      Merged_Prod_Decay_WL_only$Net_production_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_net_prod[[i]] <- lm(Merged_Prod_Decay_WL_only$Net_production_avg ~ Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_net_prod_table[[i]] <- glance(WL_lm_results_net_prod[[i]])
}
```

Print the pearson's r, p-values and R^2 statistics of each linear model:  
```{r}
for (i in 1:length(WL_lm_results_net_prod)){
  if (WL_lm_results_net_prod_table[[i]]$p.value < 0.05){
  print(vars[i])
  print("Pearson's R")
  print(WL_cor_results_net_prod[[i]])
  print("p-value")
  print(WL_lm_results_net_prod_table[[i]]$p.value)
  print("R2")
  print(WL_lm_results_net_prod_table[[i]]$r.squared)
  print("MAE")
  print(mean(abs(WL_lm_results_net_prod[[i]]$residuals)))
  }
}
```
The results with net production rates are similar to those using gross biotic production.  Chlorophyll, Respiration, and Primary Productivity all have the strongest correlations with H2O2 production rates. Nutrients also had a weak but significant correlation, similar to the results with gross biotic production rates. One difference is that CDOM, FDOM, and DOC have a much stronger relationship with net H2O2 production than biotic H2O2 production, but this is likely due to the contribution of photochemical H2O2 production to the net rates. Another difference is that standard deviation in incubation temperature had a weak but significant correlation with net H2O2 production. 

Get the slope and intercept for each variable that had a significant correlation:  
```{r}
#For each statistically significant correlation, get the slope and intercept:  
for (i in 1:length(WL_lm_results_net_prod)){
  if (WL_lm_results_net_prod_table[[i]]$p.value < 0.05){
  print(vars[i])
  print(WL_cor_results_net_prod[[i]])
  print(summary(WL_lm_results_net_prod[[i]]))
  }
}
```

Now get the regression statistics for total absolute H2O2 production:  
```{r}
#List of variables:
vars <- c("Chla", "DIC", "H2CO3", "HCO3", "CO3", "DOC", "Resp", "PrimProd", "CDOM", "Day_Integrated_UVA", "Day_Integrated_UVB", "Day_Integrated_UV", "pH", "TP", "TDP", "Nitrate", "NH4", "SRP", "Incubation_Temp", "Incubation_Temp_SD", "peakA", "peakC", "peakT", "C_A_ratio", "T_A_ratio", "IntFlour", "FI", "SlopeRatio")

#Create empty lists to save results of the loop into:  
WL_cor_results_abs_prod <- list()
WL_lm_results_abs_prod <- list()
WL_lm_results_abs_prod_table <- list()

#Loop through each item of the vector and find the correlation with Absolute H2O2 production in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_abs_prod[[i]] <- cor(Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i],
      Merged_Prod_Decay_WL_only$PH2O2_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_abs_prod[[i]] <- lm(Merged_Prod_Decay_WL_only$PH2O2_avg ~ Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_abs_prod_table[[i]] <- glance(WL_lm_results_abs_prod[[i]])
}

#For each statistically significant correlation, get the slope and intercept and other stats:  
for (i in 1:length(WL_lm_results_abs_prod)){
  if (WL_lm_results_abs_prod_table[[i]]$p.value < 0.05){
  print(vars[i])
  print(WL_cor_results_abs_prod[[i]])
  print(summary(WL_lm_results_abs_prod[[i]]))
  }
}
```
Make a plot of the regression between absolute H2O2 production and chlorophyll/primary production:  
```{r}
#Plot the regression with chlorophyll a:
Chla_vs_abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Chla, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 0, b = 0, l = 5)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  coord_cartesian(ylim=c(0,450), xlim=c(0,100)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("P"[spike]*"(nM/hr)"))

#Plot the regression with Respiration:
Resp_vs_abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Resp, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=Resp-Resp_CI, xmax=Resp+Resp_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0, 450, by = 50)) +
  coord_cartesian(ylim=c(0, 450), xlim=c(0,100)) +
  xlab(expression("Respiration ("*mu*"M O"[2]*"/day)")) +
  ylab(expression("P"[spike]*"(nM/hr)"))

#Plot the regression with Primary Production:
PrimProd_vs_abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=PrimProd, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12)) +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  coord_cartesian(ylim=c(0,450), xlim=c(0,75)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("P"[spike]*"(nM/hr)"))

combined_regressions_abs_prod <- Chla_vs_abs_prod + PrimProd_vs_abs_prod + Resp_vs_abs_prod + plot_layout(ncol = 3)

combined_regressions_abs_prod
ggsave("combined_regressions_abs_prod.pdf",  combined_regressions_abs_prod, width = 12, height = 6, units = "in", dpi=300)
```

Make a separate plot of all the FDOM regressions to show Rose:  
```{r}
#Plot Peak A vs Net H2O2:
PeakA_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=peakA, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.05, size = 0.1) +
  geom_errorbarh(aes(xmin=peakA-peakA_CI, xmax=peakA+peakA_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,4)) +
  xlab(expression("FDOM peak A")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Peak C vs Net H2O2:
PeakC_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=peakC, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.05, size = 0.1) +
  geom_errorbarh(aes(xmin=peakC-peakC_CI, xmax=peakC+peakC_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,1.5)) +
  xlab(expression("FDOM peak C")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Peak T vs Net H2O2:
PeakT_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=peakT, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.02, size = 0.1) +
  geom_errorbarh(aes(xmin=peakT-peakT_CI, xmax=peakT+peakT_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,0.6)) +
  xlab(expression("FDOM peak T")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot C/A ratio vs Net H2O2:
CA_Ratio_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=C_A_ratio, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.005, size = 0.1) +
  geom_errorbarh(aes(xmin=C_A_ratio-C_A_ratio_CI, xmax=C_A_ratio+C_A_ratio_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0.25,0.4)) +
  xlab(expression("FDOM C/A ratio")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot T/A ratio vs Net H2O2:
TA_Ratio_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=T_A_ratio, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.005, size = 0.1) +
  geom_errorbarh(aes(xmin=T_A_ratio-T_A_ratio_CI, xmax=T_A_ratio+T_A_ratio_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,0.8)) +
  xlab(expression("FDOM T/A ratio")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot IntFluor:
IntFluor_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=IntFlour, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=50, size = 0.1) +
  geom_errorbarh(aes(xmin=IntFlour-IntFlour_CI, xmax=IntFlour+IntFlour_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  scale_x_continuous(breaks=seq(0,18000, by = 9000)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,18000)) +
  xlab(expression("FDOM (Integrated Fluorescence)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Slope ratio vs Net H2O2:
Slope_Ratio_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=SlopeRatio, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.005, size = 0.1) +
  geom_errorbarh(aes(xmin=SlopeRatio-SlopeRatio_CI, xmax=SlopeRatio+SlopeRatio_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  scale_x_continuous(breaks=seq(0.6,2, by = 0.2)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0.6,2)) +
  xlab(expression("FDOM Slope Ratio")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot FI vs Net H2O2:
FI_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=FI, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.005, size = 0.1) +
  geom_errorbarh(aes(xmin=FI-FI_CI, xmax=FI+FI_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  scale_x_continuous(breaks=seq(1.4,2, by = 0.2)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(1.4,2)) +
  xlab(expression("FI")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

All_FDOM_plot <- PeakA_vs_Net_prod + PeakC_vs_Net_prod + PeakT_vs_Net_prod + CA_Ratio_vs_Net_prod + TA_Ratio_vs_Net_prod + IntFluor_vs_Net_prod + Slope_Ratio_vs_Net_prod + FI_vs_Net_prod + plot_layout(ncol = 4)

ggsave("All_FDOM_plot.pdf",  All_FDOM_plot, width = 12, height = 10, units = "in", dpi=300)
```
Let's plot the regressions with the bio parameters and DOC (which were the strongest predictors of net H2O2 production rates):  
```{r}
#Plot the regression with chlorophyll a:
Chla_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Chla, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)) +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,200)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with Respiration:
Resp_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Resp, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=Resp-Resp_CI, xmax=Resp+Resp_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,200)) +
  xlab(expression("Respiration ("*mu*"M O"[2]*"/day)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with Primary Production:
PrimProd_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=PrimProd, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,100)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with CDOM:
CDOM_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=CDOM, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.5, size = 0.1) +
  geom_errorbarh(aes(xmin=CDOM-CDOM_CI, xmax=CDOM+CDOM_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,25)) +
  xlab(expression("CDOM absorbance (a305)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with DOC:
DOC_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=DOC, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=10, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  scale_x_continuous(breaks=seq(200,600, by = 100)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(200,600)) +
  xlab(expression("DOC ("*mu*"M)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

combined_bio_regressions_net_prod <- CDOM_vs_Net_prod + Chla_vs_Net_prod + PrimProd_vs_Net_prod + DOC_vs_Net_prod + Resp_vs_Net_prod + plot_layout(ncol = 3)

combined_bio_regressions_net_prod
ggsave("combined_net_prod_regressions.pdf",  combined_bio_regressions_net_prod, width = 12, height = 10, units = "in", dpi=300)
```
Now plot the regressions between net production rates and nutrient data with weaker p-values and R^2 values.  
```{r}
#Plot the regression with TP:
TP_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=TP, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=TP-TP_CI, xmax=TP+TP_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,200)) +
  xlab(expression("Total P ("*mu*"g/L)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with TDP:
TDP_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=TDP, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=2, size=0.1) +
  geom_errorbarh(aes(xmin=TDP-TDP_CI, xmax=TDP+TDP_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)) +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  scale_x_continuous(breaks=seq(0,90, by = 15)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,90)) +
  xlab(expression("TDP ("*mu*"g/L)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with SRP:
SRP_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=SRP, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=SRP-SRP_CI, xmax=SRP+SRP_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,80)) +
  xlab(expression("SRP ("*mu*"g/L)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with Nitrate:
Nitrate_vs_Net_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Nitrate, y=Net_production_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                width=0.2, size =0.1) +
  geom_errorbarh(aes(xmin=Nitrate-Nitrate_CI, xmax=Nitrate+Nitrate_CI), height=11, size=0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(-50,300, by = 50)) +
  coord_cartesian(ylim=c(-50,300), xlim=c(0,5)) +
  xlab(expression("NO"[3]*" (mg N/L)")) +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

Combined_nutrient_regressions_net_prod <- TP_vs_Net_prod + TDP_vs_Net_prod + SRP_vs_Net_prod + Nitrate_vs_Net_prod + plot_layout(ncol=3)

Combined_nutrient_regressions_net_prod
ggsave("Combined_nutrient_regressions_net_prod.pdf",  Combined_nutrient_regressions_net_prod, width = 12, height = 10, units = "in", dpi=300)
```
The regressions with net production data also support that H2O2 production rates increase with increasing algal biomass and microbial growth rates. There's likely a strong photochemical signal behind the observed net production, but the modeled biotic production rates support that biotic sources were a significant contribution to the total H2O2 production. The magnitude of biotic production increased with algal biomass and microbial growth rates, but not with CDOM or DOC concentrations.

Are net decay rates correlated with any biological and chemical parameters (apart from net H2O2 production)?  
```{r}
#Create empty lists to save results of the loop into:  
WL_cor_results_net_decay <- list()
WL_lm_results_net_decay <- list()
WL_lm_results_net_decay_table <- list()

#Loop through each item of the vector and find the correlation with Net H2O2 decay rate in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_net_decay[[i]] <- cor(Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i],
      Merged_Prod_Decay_WL_only$Net_decay_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_net_decay[[i]] <- lm(Merged_Prod_Decay_WL_only$Net_decay_avg ~ Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_net_decay_table[[i]] <- glance(WL_lm_results_net_decay[[i]])
}
```

Print out the results that have significant p values:  
```{r}
for (i in 1:length(WL_lm_results_net_decay)){
  if (WL_lm_results_net_decay_table[[i]]$p.value < 0.05){
  print(vars[i])
  print("Pearson's R")
  print(WL_cor_results_net_decay[[i]])
  print("p-value")
  print(WL_lm_results_net_decay_table[[i]]$p.value)
  print("R2")
  print(WL_lm_results_net_decay_table[[i]]$r.squared)
  print("MAE")
  print(mean(abs(WL_lm_results_net_decay[[i]]$residuals)))
  }
}
```
Only chlorophyll a and primary production rates are significantly correlated with net H2O2 decay rates. Plot the data:  
```{r}
#Plot the regression between net decay and chlorophyll a concentration:
Chla_vs_Net_decay <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Chla, y=Net_decay_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI, ymax=Net_decay_avg+Net_decay_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,500), xlim=c(0,200)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

#Plot the regression between net decay and primary production rate:
PrimProd_vs_Net_decay <- ggplot(Merged_Prod_Decay_WL_only, aes(x=PrimProd, y=Net_decay_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI, ymax=Net_decay_avg+Net_decay_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,300), xlim=c(0,100)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

Net_decay_regression <- Chla_vs_Net_decay + PrimProd_vs_Net_decay + plot_layout(ncol=2)

Net_decay_regression
ggsave("Net_decay_regression.pdf",  Net_decay_regression, width = 7, height = 5, units = "in", dpi=300)
```
Are absolute decay rate constants significantly correlated with any biological or chemical parameters:
```{r}
#Create empty lists to save results of the loop into:  
WL_cor_results_Kloss <- list()
WL_lm_results_Kloss <- list()
WL_lm_results_Kloss_table <- list()

#Loop through each item of the vector and find the correlation with Kloss in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_Kloss[[i]] <- cor(Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i],
      Merged_Prod_Decay_WL_only$Kloss_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_Kloss[[i]] <- lm(Merged_Prod_Decay_WL_only$Kloss_avg ~ Merged_Prod_Decay_WL_only[, colnames(Merged_Prod_Decay_WL_only) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_Kloss_table[[i]] <- glance(WL_lm_results_Kloss[[i]])
}
```
There is a warning about a perfect fit, because Kloss is one of the parameters listed in vars vector.

Print out the the results for the parameters with significant correlations:  
```{r}
for (i in 1:length(WL_lm_results_Kloss)){
  if (WL_lm_results_Kloss_table[[i]]$p.value < 0.05){
  print(vars[i])
  print("Pearson's R")
  print(WL_cor_results_Kloss[[i]])
  print("p-value")
  print(WL_lm_results_Kloss_table[[i]]$p.value)
  print("R2")
  print(WL_lm_results_Kloss_table[[i]]$r.squared)
  print("MAE")
  print(mean(abs(WL_lm_results_Kloss[[i]]$residuals)))
  }
}
```
There are significant correlations between chlorophyll a, carbonate concentration, respiration rate, and primary production rate. Plot the data:  
```{r}
#Plot Kloss vs carbonate concentration:
CO3_vs_Kloss <- ggplot(Merged_Prod_Decay_WL_only, aes(x=CO3, y=Kloss_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=CO3-CO3_CI, xmax=CO3+CO3_CI), height=0.03, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,1), xlim=c(0,100)) +
  xlab(expression("CO32- ("*mu*"M)")) +
  ylab(expression("Kloss,H2O2 (hr-1)"))

#Plot Kloss vs Chl a:
Chla_vs_Kloss <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Chla, y=Kloss_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=0.03, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,1), xlim=c(0,100)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("Kloss,H2O2 (hr-1)"))

#Plot Kloss vs Resiration Rate:
Resp_vs_Kloss <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Resp, y=Kloss_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=Resp-Resp_CI, xmax=Resp+Resp_CI), height=0.03, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,1), xlim=c(0,50)) +
  xlab(expression("Respiration ("*mu*"M O"[2]*"/day)")) +
  ylab(expression("Kloss,H2O2 (hr-1)"))

#Plot Kloss vs Primary Production Rate
PrimProd_vs_Kloss <- ggplot(Merged_Prod_Decay_WL_only, aes(x=PrimProd, y=Kloss_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=0.03, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 16, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 14, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,1), xlim=c(0,60)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Kloss,H2O2 (hr-1)"))

Kloss_regressions_plot <- CO3_vs_Kloss + Chla_vs_Kloss + Resp_vs_Kloss + PrimProd_vs_Kloss + plot_layout(ncol = 2)
ggsave("Kloss_regressions_plot.pdf",  Kloss_regressions_plot, width = 12, height = 10, units = "in", dpi=300)
```

Are the correlations with respiration and primary production rates still significant after removing the one data point that is high?
```{r}
print("Respiration")
check_df <- filter(Merged_Prod_Decay_WL_only, Experiment_Date != "2-Aug-19")
cor(check_df$Resp, check_df$Kloss_avg, method = "pearson", use = "complete.obs")
check_resp <- lm(check_df$Kloss_avg ~ check_df$Resp, na.action = na.omit)
summary(check_resp)
mean(abs(check_resp$residuals))

print("Primary Production")
cor(check_df$PrimProd, check_df$Kloss_avg, method = "pearson", use = "complete.obs")
check_PrimProd <- lm(check_df$Kloss_avg ~ check_df$PrimProd, na.action = na.omit)
summary(check_PrimProd)
mean(abs(check_PrimProd$residuals))
```
Now lets compare the production in the light vs dark and whole water vs <100 um filtered treatments:

Make a bar plot of gross biotic and net H2O2 production rates in dark and light treatments:  
```{r}
#exclude data without paired dark data:
Light_Dark_Biotic_PH2O2 <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  #plot for gross bitoic H2O2 production:
  ggplot(aes(fill=Condition, y=Biotic_PH2O2, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=0.2,
                  position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18",
                              "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 0, b = 0,l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,150)) +
    ylab(expression("Absolute biotic H"[2]*"O"[2]*" production (nM/hr)"))

#exclude data without paired dark data:
Light_Dark_Net_PH2O2 <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  #plot for net H2O2 production:
  ggplot(aes(fill=Condition, y=Net_production_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18",
                              "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(-10,300)) +
    ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#exclude data without paired dark data:
Light_Dark_Biotic_PH2O2_zoom <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  #plot for gross bitoic H2O2 production, zooming in on lower rates.
  ggplot(aes(fill=Condition, y=Biotic_PH2O2, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Biotic_PH2O2-Biotic_PH2O2_CI, ymax=Biotic_PH2O2+Biotic_PH2O2_CI), width=0.2,
                  position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18",
                              "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 0, b = 0,l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,50)) +
    ylab(expression("Absolute biotic H"[2]*"O"[2]*" production (nM/hr)"))

#exclude data without paired dark data:
Light_Dark_Net_PH2O2_zoom <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  #plot for net H2O2 production, zooming in on lower rates
  ggplot(aes(fill=Condition, y=Net_production_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18",
                              "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(-10,40)) +
    ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#exclude data without paired dark data:
Light_Dark_Total_PH2O2 <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  #plot for gross bitoic H2O2 production:
  ggplot(aes(fill=Condition, y=PH2O2_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI), width=0.2,
                  position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18",
                              "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 0, b = 0,l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(0,200)) +
    ylab(expression("Absolute H"[2]*"O"[2]*" production (nM/hr)"))

#Combine into one plot
Light_Dark_Net_PH2O2
Light_Dark_Net_PH2O2_zoom
Light_Dark_Biotic_PH2O2
Light_Dark_Biotic_PH2O2_zoom
Light_Dark_Total_PH2O2
```
Make a plot of net H2O2 production in the filtered controls (light vs dark):
```{r}
#Import a dataframe of light and dark production in 0.22 um filtered water.
#Light-dark 0.22 um data was not included in the previous spreadsheet for the dates what the whole water manipulation was 105 um vs WW, even though dark controls for the 0.22 um filtered water existed on those dates, because the dataframe structure prevents including those data.

#The dataframe imported below includes those data:
Light_Dark_abiotic_df <- read.table("Filtered0.22um_light_dark_prod.txt", header=TRUE, sep="\t")

Light_Dark_abiotic_plot <- ggplot(Light_Dark_abiotic_df, aes(fill=Treatment, y=H2O2_Production_rate, x=Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=H2O2_Production_rate-H2O2_production_CI, ymax=H2O2_Production_rate+H2O2_production_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "3-Aug-18", "7-Aug-18", "10-Aug-18", "14-Aug-18", "21-Aug-18", "14-Sep-18","18-Sep-18", "23-Jul-19", "2-Aug-19", "6-Aug-19", "24-Aug-19", "17-Sep-19", "20-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 10, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(-10,200)) +
    ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

Light_Dark_abiotic_plot
ggsave("Light_Dark_abiotic_plot.pdf", Light_Dark_abiotic_plot, width = 5, height = 4, units = "in", dpi=300)
```

The missing value in the gross biotic production plot was a date in which the measured data did not fit the model assumptions. So absolute production could not be calculated. 

There is no net production in dark bottles, but there is a steady state H2O2 despite simultaneous decay, which indicates light-independent gross biological production. The gross biotic production however is higher in all light exposed bottles, which suggests that some biological production is light dependent.

On average, what percentage of the total absolute H2O2 production is light-independent?  
```{r}
#Get the light data:
Light_Data <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  filter(Condition == "WL")
#Get the dark data:
Dark_Data <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  filter(Condition == "WD")
print("total absolute production")
#Get the percent dark production
Perc_Dark_Total_PH2O2 <- Dark_Data$PH2O2_avg / Light_Data$PH2O2_avg * 100
#Calculate mean and 95% CI
mean(Perc_Dark_Total_PH2O2, na.rm = TRUE)
sd(Perc_Dark_Total_PH2O2, na.rm = TRUE)/sqrt(length(Perc_Dark_Total_PH2O2))*1.96
print("total absolute biological production")
#Get the percent dark production
Perc_Dark_Biotic_PH2O2 <- Dark_Data$Biotic_PH2O2 / Light_Data$Biotic_PH2O2 * 100
#Calculate mean and 95% CI
mean(Perc_Dark_Biotic_PH2O2, na.rm = TRUE)
sd(Perc_Dark_Biotic_PH2O2, na.rm = TRUE)/sqrt(length(Perc_Dark_Biotic_PH2O2))*1.96
```

On average 23 +/- 3 % of the total absolute production was light-independent. A substantial portion, but the majority was light-dependent.  

Summarize absolute H2O2 production and decay constants in the dark:  
```{r}
print("mean dark PH2O2")
mean(Dark_Data$PH2O2_avg)
print("dark PH2O2 95% CI")
sd(Dark_Data$PH2O2_avg)/sqrt(length(Dark_Data$PH2O2_avg))*1.96
print("min dark PH2O2")
min(Dark_Data$PH2O2_avg)
print("max. dark PH2O2")
max(Dark_Data$PH2O2_avg)
print("mean dark kloss")
mean(Dark_Data$Kloss_avg)
print("dark kloss 95% CI")
sd(Dark_Data$Kloss_avg)/sqrt(length(Dark_Data$Kloss_avg))*1.96
print("min dark kloss")
min(Dark_Data$Kloss_avg)
print("max. dark kloss")
max(Dark_Data$Kloss_avg)
```
Is the difference in biotic production in the light and dark related to any environmental parameters?
```{r}
#First, calculate a gross photobiotic production rate as the difference in gross biotic production in the light minus gross biotic production in the dark:
Light_Data$Photobiotic_PH2O2 <- Light_Data$Biotic_PH2O2 - Dark_Data$Biotic_PH2O2
Light_Data$Photobiotic_PH2O2_CI <- sqrt(Light_Data$Biotic_PH2O2_CI^2 + Dark_Data$Biotic_PH2O2_CI^2)

#Create empty lists to save results of the loop into:  
WL_cor_results_PhotoBio <- list()
WL_lm_results_PhotoBio <- list()
WL_lm_results_PhotoBio_table <- list()

#Loop through each item of the vector and find the correlation in with Biotic PH2O2 in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_PhotoBio[[i]] <- cor(Light_Data[, colnames(Light_Data) %in% i],
      Light_Data$Photobiotic_PH2O2, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_PhotoBio[[i]] <- lm(Light_Data$Photobiotic_PH2O2 ~ Light_Data[, colnames(Light_Data) %in% i],
                                    na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_PhotoBio_table[[i]] <- glance(WL_lm_results_PhotoBio[[i]])
}

#Print the pearson's r, p-values and R^2 statistics of each linear model:  
for (i in 1:length(WL_lm_results_PhotoBio)){
  print(vars[i])
  print(WL_cor_results_PhotoBio[[i]])
  print(WL_lm_results_PhotoBio_table[[i]]$p.value)
  print(WL_lm_results_PhotoBio_table[[i]]$r.squared)
}
```
Calculate RMSE and MAE for the photobio production regressions:  
```{r}
#Print RMSE and MAE for gross biotic production:
for (i in 1:length(WL_lm_results_PhotoBio)){
  print(vars[i])
  print("RMSE:")
  print(sqrt(mean((WL_lm_results_PhotoBio[[i]]$residuals)^2)))
  print("MAE:")
  print(mean(abs(WL_lm_results_PhotoBio[[i]]$residuals)))
}
```
Respiration rate, Primary production, and CDOM are significantly correlated with the difference in gross biotic production in the light and dark. Let's plot the relationship:
```{r}
#Plot the regression with Respiration:
Resp_vs_PhotoBio <- ggplot(Light_Data, aes(x=Resp, y=Photobiotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Photobiotic_PH2O2-Photobiotic_PH2O2_CI,
                    ymax=Photobiotic_PH2O2+Photobiotic_PH2O2_CI), width=2, size = 0.08) +
  geom_errorbarh(aes(xmin=Resp-Resp_CI, xmax=Resp+Resp_CI), height=11, size = 0.08) +
  geom_point(size = 0.5, alpha = 0.8) +
  scale_color_manual(values=c("blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12 / .pt, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14 / .pt, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14 / .pt, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12 / .pt, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.05, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,120, by = 40)) +
  coord_cartesian(ylim=c(0,120)) +
  xlab(expression("Respiration ("*mu*"M O"[2]*"/day)")) +
  ylab(expression("Light-dependent biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with Primary Production:
PrimProd_vs_PhotoBio <- ggplot(Light_Data, aes(x=PrimProd, y=Photobiotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Photobiotic_PH2O2-Photobiotic_PH2O2_CI,
                    ymax=Photobiotic_PH2O2+Photobiotic_PH2O2_CI), width=2, size = 0.08) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size = 0.08) +
  geom_point(size = 0.5, alpha = 0.8) +
  scale_color_manual(values=c("blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12 / .pt, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14 / .pt, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14 / .pt, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12 / .pt, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.05, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_text(size = 14 / .pt),
        legend.text = element_text(size = 12 / .pt)) +
  scale_y_continuous(breaks=seq(0,120, by = 40)) +
  coord_cartesian(ylim=c(0,120), xlim=c(0,60)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Light-dependent biotic H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with CDOM:
CDOM_vs_PhotoBio <- ggplot(Light_Data, aes(x=CDOM, y=Photobiotic_PH2O2, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=Photobiotic_PH2O2-Photobiotic_PH2O2_CI,
                    ymax=Photobiotic_PH2O2+Photobiotic_PH2O2_CI), width=0.5, size = 0.08) +
  geom_errorbarh(aes(xmin=CDOM-CDOM_CI, xmax=CDOM+CDOM_CI), height=11, size = 0.08) +
  geom_point(size = 0.5, alpha = 0.8) +
  scale_color_manual(values=c("blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 12 / .pt, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 14 / .pt, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 14 / .pt, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 12 / .pt, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.05, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,120, by = 40)) +
  coord_cartesian(ylim=c(0,120), xlim=c(0,15)) +
  xlab(expression("CDOM absorbance (a305)")) +
  ylab(expression("Light-dependent biotic H"[2]*"O"[2]*" production (nM/hr)"))


combined_photobio_regressions <- Resp_vs_PhotoBio + PrimProd_vs_PhotoBio + CDOM_vs_PhotoBio + plot_layout(ncol = 3)

combined_photobio_regressions
ggsave("combined_photobio_regressions.pdf",  combined_photobio_regressions, width = 3.5, height = 2, units = "in", dpi=300)
```

It looks like the relationship might be driven by one point, are the correlations still significant when this point is removed?
```{r}
#Remove the outlier point:
Light_Data_no_outlier <- Light_Data[ Light_Data$Photobiotic_PH2O2 != max(Light_Data$Photobiotic_PH2O2, na.rm = TRUE), ]

#Create empty lists to save results of the loop into:  
WL_cor_results_PhotoBio_no_outlier <- list()
WL_lm_results_PhotoBio_no_outlier <- list()
WL_lm_results_PhotoBio_no_outlier_table <- list()

#Loop through each item of the vector and find the correlation in with Biotic PH2O2 in whole water light samples: 
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_PhotoBio_no_outlier[[i]] <- cor(Light_Data_no_outlier[, colnames(Light_Data_no_outlier) %in% i],
      Light_Data_no_outlier$Photobiotic_PH2O2, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_PhotoBio_no_outlier[[i]] <- lm(Light_Data_no_outlier$Photobiotic_PH2O2 ~ Light_Data_no_outlier[, colnames(Light_Data_no_outlier) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_PhotoBio_no_outlier_table[[i]] <- glance(WL_lm_results_PhotoBio_no_outlier[[i]])
}

#Print the pearson's r, p-values and R^2 statistics of each linear model:  
for (i in 1:length(WL_lm_results_PhotoBio_no_outlier)){
  print(vars[i])
  print(WL_cor_results_PhotoBio_no_outlier[[i]])
  print(WL_lm_results_PhotoBio_no_outlier_table[[i]]$p.value)
  print(WL_lm_results_PhotoBio_no_outlier_table[[i]]$r.squared)
  print("RMSE:")
  print(sqrt(mean((WL_lm_results_PhotoBio_no_outlier[[i]]$residuals)^2)))
  print("MAE:")
  print(mean(abs(WL_lm_results_PhotoBio_no_outlier[[i]]$residuals)))
}
```

Are any biological and chemical parameters correlated with absolute dark production? 
```{r}
#Create empty lists to save results of the loop into:  
WL_cor_results_dark <- list()
WL_lm_results_dark <- list()
WL_lm_results_dark_table <- list()

#Loop through each item of the "vars" vector and calculate the correlation with absolute H2O2 production in the dark bottles:
for (i in vars){
  #Calculate the Pearson's R correlation statistic, ignoring sample pairs which have NAs
  WL_cor_results_dark[[i]] <- cor(Dark_Data[, colnames(Dark_Data) %in% i],
      Dark_Data$PH2O2_avg, method = "pearson", use = "complete.obs")
  #Build a linear model for each correlation:
  WL_lm_results_dark[[i]] <- lm(Dark_Data$PH2O2_avg ~ Dark_Data[, colnames(Dark_Data) %in% i], na.action = na.omit)
  #Get the statistics into a table format
  WL_lm_results_dark_table[[i]] <- glance(WL_lm_results_dark[[i]])
}

#Print the pearson's r, p-values and R^2 statistics of each linear model:  
for (i in 1:length(WL_lm_results_dark)){
  print(vars[i])
  print("pearson's R:")
  print(WL_cor_results_dark[[i]])
  print("p-value")
  print(WL_lm_results_dark_table[[i]]$p.value)
  print("R2:")
  print(WL_lm_results_dark_table[[i]]$r.squared)
}
```
There are no significant correlations between absolute production in dark bottles and any biological or chemical parameters.

Now lets look at the differences between the whole water and 100 um filtered water (which removed large phytoplankton and Microcystis colonies):

Make a bar plot of absolute PH2O2 in whole water and <100 um water:
```{r}
Size_Frac_PH2O2 <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac") %>%
  #plot for gross bitoic H2O2 production:
  ggplot(aes(fill=Condition, y=PH2O2_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI), width=0.2, size = 0.08, position = position_dodge(0.9)) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18",
                              "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightgreen", "green4"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    scale_y_continuous(breaks=seq(0,90, by = 15)) +
    coord_cartesian(ylim=c(0,90)) +
    ylab(expression("Absolute H"[2]*"O"[2]*" production (nM/hr)"))

#exclude data without paired size fractionated data:
Size_Frac_Net_PH2O2 <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac") %>%
  #plot for net H2O2 production:
  ggplot(aes(fill=Condition, y=Net_production_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI, ymax=Net_production_avg+Net_production_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18",
                              "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightgreen", "green4"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(0,50)) +
    ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Combine the plots of H2O2 production in 105 um filtration experiments with the plots of the light-dark experiments:
#For absolute H2O2 production rates
Treatment_plot_abs_production <- Light_Dark_Total_PH2O2  + Size_Frac_PH2O2
#For net produciton rates:
Treatment_plot_net_production <- Light_Dark_Net_PH2O2_zoom   + Size_Frac_Net_PH2O2
ggsave("Treatment_plot_abs_production.pdf",  Treatment_plot_abs_production, width = 7, height = 4, units = "in", dpi=300)
ggsave("Treatment_plot_net_production.pdf",  Treatment_plot_net_production, width = 7, height = 4, units = "in", dpi=300)

Treatment_plot_abs_production
Treatment_plot_net_production
```
There are values missing in the absolute production for two dates, because the data did not fit the spike-batch incubation model for those dates.  

Make a plot of H2O2 decay rates in whole and 105 um filtered water and light and dark:  
```{r}
#This is for the net decay in size fractionated experiment dates:
Size_Frac_Decay <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac") %>%
  ggplot(aes(fill=Condition, y=Net_decay_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI, ymax=Net_decay_avg+Net_decay_CI), width=0.2,
                  position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18", "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightgreen", "green4"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

#Below is for net decay in the light-dark experiments:
Light_Dark_Decay <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  ggplot(aes(fill=Condition, y=Net_decay_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI, ymax=Net_decay_avg+Net_decay_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18", "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(0,300)) +
    ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

#This is for absolute kloss in size fractionated experiment dates:
Size_Frac_Kloss <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac") %>%
  ggplot(aes(fill=Condition, y=Kloss_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI), width=0.2,
                  position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18", "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightgreen", "green4"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(0,0.5)) +
    ylab(expression("Kloss,H2O2 (hr-1)"))

#This is for absolute Kloss in light-dark experiments:
Light_Dark_Kloss <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark") %>%
  ggplot(aes(fill=Condition, y=Kloss_avg, x=Experiment_Date)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Kloss_avg-Kloss_CI, ymax=Kloss_avg+Kloss_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.08) +
    scale_x_discrete(limits=c("10-Jul-18", "24-Jul-18", "31-Jul-18", "7-Aug-18", "14-Aug-18", "18-Sep-18", "23-Jul-19", "2-Aug-19", "17-Sep-19")) +
    scale_fill_manual(values=c("red", "lightblue"), labels=c("Dark", "Light")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
    coord_cartesian(ylim=c(0,0.9)) +
    scale_y_continuous(breaks=seq(0,0.9, by=0.3)) +
    ylab(expression("Kloss,H2O2 (hr-1)"))

#Combined the net decay panels into one plot:
Treatment_plot_net_decay <- Light_Dark_Decay + Size_Frac_Decay
ggsave("Treatment_plot_net_decay.pdf",  Treatment_plot_net_decay, width = 7, height = 4, units = "in", dpi=300)
Treatment_plot_net_decay
#Combined the absolute kloss panels into one plot:
Treatment_plot_Kloss <- Light_Dark_Kloss + Size_Frac_Kloss
ggsave("Treatment_plot_Kloss.pdf",  Treatment_plot_Kloss, width = 7, height = 4, units = "in", dpi=300)
Treatment_plot_Kloss
```
Are the differences in net decay between whole water and 105 um filtered water significant?  
```{r}
#Get a vector of dates to loop through:
#I removed the 14-Sep-18 experiment here because n=1 for whole water decay rates.
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac" & Experiment_Date != "14-Sep-18")
dates <- unique(dates$Experiment_Date)

#Do a t_test for each date. Test if mean whole water decay is different from 105 um filtered water decay:
for (i in dates){
  WW <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  filt <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "FL")
  print(i)
  print(t.test(filt$Net_decay, WW$Net_decay, paired = FALSE, alternative = "two.sided"))
}
```
Are the differences between net H2O2 production in the size fractionated experiments significant?
```{r}
#Get a vector of dates to loop through:
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac")
dates <- unique(dates$Experiment_Date)

#Do a t_test for each date.
for (i in dates){
  WW <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  filt <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "FL")
  print(i)
  print(t.test(filt$Net_production, WW$Net_production, paired = FALSE, alternative = "two.sided"))
}
```
Are absolute production rates significantly different between 105 um filtered water and whole water?
```{r}
#Get a vector of dates to loop through:
#Removed 14-Sep-18 because whole water has n=1
#Removed 6-Aug-19 and 24-Aug-19 because data did not fit absolute production and decay model.
date_drop <- c("6-Aug-19", "24-Aug-19", "14-Sep-18")
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac" & !(Experiment_Date %in% date_drop))
dates <- unique(dates$Experiment_Date)

#Do a T-test for each date. It's  a Welch's two-sided test.
for (i in dates){
  WW <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  filt <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "FL")
  print(i)
  print(t.test(filt$PH2O2, WW$PH2O2, paired = FALSE, alternative = "two.sided"))
}
```
Is Kloss significantly different in size fractionated experiments?  
```{r}
#Get a vector of dates to loop through:
#Removed 14-Sep-18 because whole water has n=1
#Removed 6-Aug-19 and 24-Aug-19 because data did not fit absolute production and decay model.
date_drop <- c("6-Aug-19", "24-Aug-19", "14-Sep-18")
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac" & !(Experiment_Date %in% date_drop))
dates <- unique(dates$Experiment_Date)

#Do a T-test for each date. It's  a Welch's two-sided test.
for (i in dates){
  WW <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  filt <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "FL")
  print(i)
  print(t.test(filt$Kloss, WW$Kloss, paired = FALSE, alternative = "two.sided"))
}
```
Are the differences between net H2O2 decay in the light and dark significant?
```{r}
#Get a vector of dates to loop through:
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark")
dates <- unique(dates$Experiment_Date)

#Do a t_test for each date.
for (i in dates){
  Light <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  Dark <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WD")
  print(i)
  print(t.test(Dark$Net_decay, Light$Net_decay, paired = FALSE, alternative = "two.sided"))
}
```
Are the differences between Kloss in light and dark bottles significant?
```{r}
#Get a vector of dates to loop through:
#Removed 23-Jul-19 because Kloss could not be calculated. Data did not fit spike-batch incubation model.
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark" & Experiment_Date != "23-Jul-19")
dates <- unique(dates$Experiment_Date)

#Do a t_test for each date.
for (i in dates){
  Light <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  Dark <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WD")
  print(i)
  print(t.test(Dark$Kloss, Light$Kloss, paired = FALSE, alternative = "two.sided"))
}
```
Are the differences between absolute production in light and dark bottles significant?
```{r}
#Get a vector of dates to loop through:
#Removed 23-Jul-19 because Kloss could not be calculated. Data did not fit spike-batch incubation model.
dates <- filter(Merged_Prod_Decay_df, Experiment_type == "Light_Dark" & Experiment_Date != "23-Jul-19")
dates <- unique(dates$Experiment_Date)

#Do a t_test for each date.
for (i in dates){
  Light <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WL")
  Dark <- filter(Prod_Decay_df, Experiment_Date == i & Condition == "WD")
  print(i)
  print(t.test(Dark$PH2O2, Light$PH2O2, paired = FALSE, alternative = "two.sided"))
}
```

By how much were chlorophyll a concentrations reduced in the 105 um filtered experiments?  
```{r}
size_frac_df <- filter(Merged_Prod_Decay_df, Experiment_type == "Size_Frac")
ww_df <- filter(size_frac_df, Condition == "WL")
filt_df <- filter(size_frac_df, Condition == "FL")
filt_df$Chla_decline <- filt_df$Chla / ww_df$Chla * 100
print("mean")
mean(100 - filt_df$Chla_decline)
print("std dev")
sd(100 - filt_df$Chla_decline)/sqrt(length(filt_df$Chla_decline))*1.96
print("min")
min(100 - filt_df$Chla_decline)
print("max")
max(100 - filt_df$Chla_decline)
```

Plot the difference:  
```{r}
Chla_filt_vs_ww <- ggplot(size_frac_df, aes(y=Chla, x=Experiment_Date, fill=Condition)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Chla-Chla_CI, ymax=Chla+Chla_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.2) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18",
                              "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightgreen", "green4"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
  coord_cartesian(ylim=c(0,180)) +
  scale_y_continuous(breaks=seq(0,180, by=30)) +
  ylab(expression("Chlorophyll a ("*mu*"g/L)"))

Chla_filt_vs_ww
```
Is the reduction in chlorophyll on each date significant?  
```{r}
#Import a dataframe of replicate chlorophyll a values on each date:
chl_t_test_df <- read.table("Chl_t_test_df.txt", header=TRUE, sep="\t")

#Get a vector of dates to loop through:
dates <- unique(chl_t_test_df$Date_Collected)

#Do a t_test for each date. Test if mean chlorophyll a concentrations in filtered water are significantly lower than mean chlorophyll concentrations in whole water on the same date.
for (i in dates){
  WW <- filter(chl_t_test_df, Date_Collected == i & Condition == "WL")
  filt <- filter(chl_t_test_df, Date_Collected == i & Condition == "FL")
  print(i)
  print(t.test(filt$Chla, WW$Chla, paired = FALSE, alternative = "less"))
}
```

How was respiration rate impacted by 105 micron filtration?
```{r}
Resp_filt_vs_ww <- ggplot(size_frac_df, aes(y=Resp, x=Experiment_Date, fill=Condition)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=Resp-Resp_CI, ymax=Resp+Resp_CI),
                  width=0.2, position=position_dodge(0.9), size = 0.2) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18",
                              "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightblue", "blue2"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
  coord_cartesian(ylim=c(0,180)) +
  scale_y_continuous(breaks=seq(0,180, by=30)) +
  ylab(expression("Respiration rate ("*mu*"M O"[2]*"/day)"))

Resp_filt_vs_ww
ggsave("Resp_WW_vs_105_barplot.pdf", Resp_filt_vs_ww, width = 12, height = 8, units = "in", dpi=300)
```
Is the reduction in Respiration on each date significant?  
```{r}
#Import a dataframe of replicate Respiration values on each date:
resp_t_test_df <- read.table("Resp_t_test_df.txt", header=TRUE, sep="\t")

#Get a vector of dates to loop through:
dates <- unique(resp_t_test_df$Date_Collected)

#Do a t_test for each date. Test if mean respiration in filtered water is significantly lower than mean respiration in whole water on the same date.
for (i in dates){
  WW <- filter(resp_t_test_df, Date_Collected == i & Condition == "WW")
  filt <- filter(resp_t_test_df, Date_Collected == i & Condition == "FL")
  print(i)
  print(t.test(filt$Resp, WW$Resp, paired = FALSE, alternative = "less"))
}
```

Analysis with sequencing data
----
First, lets load the data and convert to absolute abundance.  
```{r}
#Import the dataframes:
LE_H2O2.shared <- read.table("LE_H2O2.shared", header = TRUE, sep="\t")
LE_H2O2.metadata <- read.table("LE_H202.metadata.txt", header= TRUE, sep="\t")
LE_H2O2.taxonomy <- read.table("LE_H2O2.taxonomy", header= TRUE, sep="\t")

#Fix up OTU table and merge with taxonomy table:
rownames(LE_H2O2.shared) <- LE_H2O2.shared$Group
drop <- c("label", "numOtus", "Group")
LE_H2O2.shared <- LE_H2O2.shared[ , !(names(LE_H2O2.shared) %in% drop)]
LE_H2O2.shared <- t(LE_H2O2.shared) #transpose table so that OTUs are row names for merging
LE_H2O2.shared <- LE_H2O2.shared[rowSums(LE_H2O2.shared) != 0,] #Remove Otus with zero abundance in all samples
rownames(LE_H2O2.taxonomy) <- LE_H2O2.taxonomy$OTU
drop <- "OTU"
LE_H2O2.taxonomy <- LE_H2O2.taxonomy[ , !(names(LE_H2O2.taxonomy) %in% drop)]
LE_H2O2.taxonomy <- LE_H2O2.taxonomy[ LE_H2O2.taxonomy$Size > 2, ] #Only keep OTUs that have more than 2 sequences
LE_H2O2.merged <- merge(LE_H2O2.shared, LE_H2O2.taxonomy, by="row.names", all = FALSE)
LE_H2O2.merged$Taxonomy <- gsub("\\([0-9]*\\)", "", LE_H2O2.merged$Taxonomy)
#Divide the taxonomy into separate columns for each rank for sorting
LE_H2O2.merged <- separate(LE_H2O2.merged, Taxonomy, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), sep=";", remove=TRUE)

#Remove samples that were controls and had a low number of reads:
drop <- LE_H2O2.metadata$Sortchem[ LE_H2O2.metadata$Enough_reads == "No" ]
LE_H2O2.merged <- LE_H2O2.merged[ , !(names(LE_H2O2.merged) %in% drop)]

#Remove samples that were cross-contaminted during plate loading:
drop <- c("E2019_1152_1", "E2019_1177_1")
LE_H2O2.merged <- LE_H2O2.merged[ , !(names(LE_H2O2.merged) %in% drop)]

#Sum the T. thermophilus reads in each sample, then remove those OTUs from the data:
LE_H2O2.merged.std_reads <- as.data.frame(colSums(LE_H2O2.merged[2:214][LE_H2O2.merged$Genus == "Thermus",]))
colnames(LE_H2O2.merged.std_reads) <- c("Total_Thermus_reads")
LE_H2O2.merged.std_reads$Sortchem <- row.names(LE_H2O2.merged.std_reads)
LE_H2O2.merged <- LE_H2O2.merged[LE_H2O2.merged$Genus != "Thermus", ]

#Convert dataframe to long format:
LE_H2O2.merged.long <- gather(LE_H2O2.merged, Sortchem, Counts, 2:214)

#Add the tallied Thermus reads to the dataframe:
LE_H2O2.merged.long <- merge(LE_H2O2.merged.long, LE_H2O2.merged.std_reads, by="Sortchem", all = TRUE)

#Add metadata:
LE_H2O2.merged.long <- merge(LE_H2O2.merged.long, LE_H2O2.metadata, by="Sortchem", all.x = TRUE, all.y = FALSE)

#Calculate absolute abundance:
LE_H2O2.merged.long$Reads_mL <- (LE_H2O2.merged.long$Counts * LE_H2O2.merged.long$Spiked_copy_number) / (LE_H2O2.merged.long$Total_Thermus_reads * LE_H2O2.merged.long$mLs_filtered)

#Calculate mean abundance, sd, se, and 95% CI for all OTUs in each sample:
LE_H2O2.merged.summary <- LE_H2O2.merged.long %>%
  group_by(Row.names, Experiment_Date, Experiment_type, Condition, Phylum, Class, Order, Family, Genus) %>%
  summarise(n=n(), mean=mean(Reads_mL), sd=sd(Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)

#Merge with the environmental data, only keep dates that have both H2O2 data and 16S data:  
LE_H2O2.merged.summary.environ <- merge(Merged_Prod_Decay_df, LE_H2O2.merged.summary,
                                        by=c("Experiment_Date", "Condition", "Experiment_type"), all = FALSE)

#Remove columns carried over from the other dataframes that are no longer needed:
remove <- c("n.x", "n.y", "sd", "se", "Notes")

LE_H2O2.merged.summary.environ <- LE_H2O2.merged.summary.environ[ , !(colnames(LE_H2O2.merged.summary.environ) %in% remove)]

#rename a few columns so that it is more clear what they are:
colnames(LE_H2O2.merged.summary.environ)[86] <- "OTU"
colnames(LE_H2O2.merged.summary.environ)[92] <- "OTU_mean_abund"
colnames(LE_H2O2.merged.summary.environ)[93] <- "OTU_CI"
```
The error about expecting 6 pieces refers to the final semi-colon at the end of the string in the Taxonomy column. R thinks the text after that semicolon is a 7th field, but I only labeled the first 6, so it is excluding that seventh field. This is fine because there is no text after the final semicolon, no sense keeping it.  

By how much was Microcystis abundance reduced in the 105 um filtered experiments on average?  
```{r}
#Sum all the counts for each Microcystis OTU in each sample:
#There is only 1 major OTU, the rest have a hundres or tens of reads:
Microcystis_abundance_df <- filter(LE_H2O2.merged.long, Genus == "Microcystis_PCC-7914") %>%
  group_by(Experiment_Date, Bottle_name, Experiment_type, Condition) %>%
  summarise(Total_Reads_mL=sum(Reads_mL))
Microcystis_abundance_df$Total_Reads_mL <- round(Microcystis_abundance_df$Total_Reads_mL, 0)

#Reserve the size fractionated experiments for T-tests later:
Microcystis_abundance_size_frac <- filter(Microcystis_abundance_df, Experiment_type == "Size_Frac")

#Average the total Microcystis read counts across the replicate bottles for each experiment:
Microcystis_abundance_df <- Microcystis_abundance_df %>%
 group_by(Experiment_Date, Experiment_type, Condition) %>%
    summarise(n=n(), mean=mean(Total_Reads_mL), sd=sd(Total_Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)

#Round to nearest whole number:  
Microcystis_abundance_df$mean <- round(Microcystis_abundance_df$mean, 0)
Microcystis_abundance_df$ci <- round(Microcystis_abundance_df$ci, 0)

#Calculate average reduction in Microcystis abundance:  
ww_df <- filter(Microcystis_abundance_df, Condition == "WL" & Experiment_type == "Size_Frac")
filt_df <- filter(Microcystis_abundance_df, Condition == "FL" & Experiment_type == "Size_Frac")
filt_df$MC_decline <- filt_df$mean / ww_df$mean * 100
print("mean")
mean(100 - filt_df$MC_decline)
print("std dev")
sd(100 - filt_df$MC_decline)/sqrt(length(filt_df$MC_decline))*1.96
print("min")
min(100 - filt_df$MC_decline)
print("max")
max(100 - filt_df$MC_decline)
```
Plot:  
```{r}
Microcystis_filt_vs_ww <- filter(Microcystis_abundance_df, Experiment_type == "Size_Frac") %>% 
  ggplot(aes(y=mean, x=Experiment_Date, fill=Condition)) +
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=mean+ci, ymax=mean-ci),
                  width=0.2, position=position_dodge(0.9), size = 0.2) +
    scale_x_discrete(limits=c("3-Aug-18", "10-Aug-18", "21-Aug-18", "14-Sep-18",
                              "6-Aug-19", "24-Aug-19", "20-Sep-19")) +
    scale_fill_manual(values=c("lightgreen", "green4"), labels=c("105 um filtered", "Whole water")) +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                       l = 0)),
          axis.title.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 5, b = 0,
                                                                                        l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_blank(),
          axis.text.x = element_text(size = 14, color = "black", angle = 45, vjust = 1,
                                     hjust = 1, margin = margin(t = 5, r = 5, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 12)) +
  ylab("Microcystis abundance (reads/mL)")

Size_Frac_MC_plot <- Chla_filt_vs_ww + Microcystis_filt_vs_ww
ggsave("Size_Frac_MC_plot.pdf",  Size_Frac_MC_plot, width = 7, height = 4, units = "in", dpi=300)
Size_Frac_MC_plot
```
The two rows removed are data from the 26 Jul 2019 experiment that had oddly staggered H2O2 additions times which prevents incorporation with the rest of the experiments.   

Are the differences significant?  
```{r}
#Get a vector of dates to loop through:
dates <- unique(Microcystis_abundance_size_frac$Experiment_Date)

#Do a t_test for each date. Test if mean chlorophyll a concentrations in filtered water are significantly lower than mean chlorophyll concentrations in whole water on the same date.
for (i in dates){
  WW <- filter(Microcystis_abundance_size_frac, Experiment_Date == i & Condition == "WL")
  filt <- filter(Microcystis_abundance_size_frac, Experiment_Date == i & Condition == "FL")
  print(i)
  print(t.test(filt$Total_Reads_mL, WW$Total_Reads_mL, paired = FALSE, alternative = "less"))
}
```
P is less than 0.05 on all dates.  

Is Microcystis abundance correlated with H2O2 production rates?  
```{r}
#Get only the whole water light data from the Microcystis abundance dataframe:  
Microcystis_abundance_df_WW <- filter(Microcystis_abundance_df, Condition == "WL")
drop <- c("Condition", "Experiment_type", "n", "sd", "se")
Microcystis_abundance_df_WW <- Microcystis_abundance_df_WW[ , !(colnames(Microcystis_abundance_df_WW) %in% drop)]
colnames(Microcystis_abundance_df_WW) <- c("Experiment_Date", "Microcystis_abund", "Microcystis_CI")

#Merge with the dataframe containing the average H2O2 data:
Merged_Prod_Decay_WL_only <- merge(Merged_Prod_Decay_WL_only, Microcystis_abundance_df_WW, by="Experiment_Date", all=FALSE)
```

Calculate the correlation:  
```{r}
#Calculate correlation with absolute H2O2 production:
cor(Merged_Prod_Decay_WL_only$Microcystis_abund, Merged_Prod_Decay_WL_only$PH2O2_avg, use = "complete.obs")
lm_Microcystis_PH2O2 <- lm(PH2O2_avg ~ Microcystis_abund, data=Merged_Prod_Decay_WL_only, na.action = na.omit)
summary(lm_Microcystis_PH2O2)
```
```{r}
#Calculate correlation with net H2O2 production:
cor( Merged_Prod_Decay_WL_only$Microcystis_abund, Merged_Prod_Decay_WL_only$Net_production_avg, use = "complete.obs")
lm_Microcystis_net_H2O2 <- lm(Net_production_avg ~ Microcystis_abund, data=Merged_Prod_Decay_WL_only, na.action = na.omit)
summary(lm_Microcystis_net_H2O2)
```
```{r}
#Calculate correlation with net H2O2 decay:
cor( Merged_Prod_Decay_WL_only$Microcystis_abund, Merged_Prod_Decay_WL_only$Net_decay_avg, use = "complete.obs")
lm_Microcystis_net_H2O2_decay <- lm(Net_decay_avg ~ Microcystis_abund, data=Merged_Prod_Decay_WL_only, na.action = na.omit)
summary(lm_Microcystis_net_H2O2_decay)
```
```{r}
#Calculate correlation with Kloss:
cor( Merged_Prod_Decay_WL_only$Microcystis_abund, Merged_Prod_Decay_WL_only$Kloss_avg, use = "complete.obs")
lm_Microcystis_Kloss <- lm(Kloss_avg ~ Microcystis_abund, data=Merged_Prod_Decay_WL_only, na.action = na.omit)
summary(lm_Microcystis_Kloss)
```

Plot the regressions in Microcystis abundance and H2O2 production / decay (both net and absolute):  
```{r}
#Plot against Absolute production
MC_vs_PH2O2_plot <- filter(Merged_Prod_Decay_WL_only, Model_Fit == "Yes") %>% 
  ggplot(aes(x=Microcystis_abund, y=PH2O2_avg, color=as.factor(Year))) +
    geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
    geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI,
                      ymax=PH2O2_avg+PH2O2_CI), size = 0.2, width=1e4) +
    geom_errorbarh(aes(xmin=Microcystis_abund-Microcystis_CI,
                      xmax=Microcystis_abund+Microcystis_CI), size = 0.2, height=4) +
    geom_point() +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
          axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,400)) +
    #scale_y_continuous(breaks=seq(0,350, by=50)) +
    xlab("Microcystis abundance (reads/mL)") +
    ylab(expression("Absolute H"[2]*"O"[2]*" production (nM/hr)"))

#Plot against net production
MC_vs_net_production <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Microcystis_abund, y=Net_production_avg,
                                                                 color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI,
                    ymax=Net_production_avg+Net_production_CI), size = 0.2, width=1e4) +
  geom_errorbarh(aes(xmin=Microcystis_abund-Microcystis_CI,
                     xmax=Microcystis_abund+Microcystis_CI), size = 0.2, height=10) +
  geom_point() +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top") +
  coord_cartesian(ylim=c(-50,300)) +
  scale_y_continuous(breaks=seq(-50,300, by=50)) +
  xlab("Microcystis abundance (reads/mL)") +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

#Plot against net decay
MC_vs_net_decay <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Microcystis_abund, y=Net_decay_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI,
                    ymax=Net_decay_avg+Net_decay_CI), size = 0.2, width=1e4) +
  geom_errorbarh(aes(xmin=Microcystis_abund-Microcystis_CI,
                     xmax=Microcystis_abund+Microcystis_CI), size = 0.2, height=10) +
  geom_point() +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top") +
  coord_cartesian(ylim=c(-10,450)) +
  scale_y_continuous(breaks=seq(0,450, by=90)) +
  xlab("Microcystis abundance (reads/mL)") +
  ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))

MC_vs_PH2O2_plot
Microcystis_regressions <- MC_vs_net_production + MC_vs_net_decay
Microcystis_regressions
ggsave("Microcystis_regressions.pdf",  Microcystis_regressions, width = 8, height = 3.5, units = "in", dpi=300)
```
Add the Microcystis abundance regression into the panel with net production rates:
```{r}
MC_vs_net_production <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Microcystis_abund, y=Net_production_avg,
                                                                 color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=Net_production_avg-Net_production_CI,
                    ymax=Net_production_avg+Net_production_CI), size = 0.2, width=1e4) +
  geom_errorbarh(aes(xmin=Microcystis_abund-Microcystis_CI,
                     xmax=Microcystis_abund+Microcystis_CI), size = 0.2, height=10) +
  geom_point() +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(-50,300)) +
  scale_y_continuous(breaks=seq(-50,300, by=50)) +
  xlab("Microcystis abundance (reads/mL)") +
  ylab(expression("Net H"[2]*"O"[2]*" production (nM/hr)"))

combined_bio_regressions_net_prod <- CDOM_vs_Net_prod + Chla_vs_Net_prod + PrimProd_vs_Net_prod + DOC_vs_Net_prod + Resp_vs_Net_prod + MC_vs_net_production + plot_layout(ncol = 3)

combined_bio_regressions_net_prod
ggsave("combined_net_prod_regressions_with_Microcystis.pdf",  combined_bio_regressions_net_prod, width = 12, height = 10, units = "in", dpi=300)
```

Create a similar plot as I did above for the net production rates, but use the correlations with absolute H2O2 production instead:
```{r}
#Plot the regression with chlorophyll a:
Chla_vs_Abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Chla, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=Chla-Chla_CI, xmax=Chla+Chla_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "top",
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 14)) +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  coord_cartesian(ylim=c(0,450), xlim=c(0,200)) +
  xlab(expression("Chlorophyll a ("*mu*"g/L)")) +
  ylab(expression("Abs. H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with Respiration:
Resp_vs_Abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=Resp, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=5, size = 0.1) +
  geom_errorbarh(aes(xmin=Resp-Resp_CI, xmax=Resp+Resp_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  coord_cartesian(ylim=c(0,450), xlim=c(0,200)) +
  xlab(expression("Respiration ("*mu*"M O"[2]*"/day)")) +
  ylab(expression("Abs H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with Primary Production:
PrimProd_vs_Abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=PrimProd, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=2, size = 0.1) +
  geom_errorbarh(aes(xmin=PrimProd-PrimProd_CI, xmax=PrimProd+PrimProd_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  coord_cartesian(ylim=c(0,450), xlim=c(0,100)) +
  xlab(expression("Primary Production ("*mu*"M C/hr)")) +
  ylab(expression("Abs H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with CDOM:
CDOM_vs_Abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=CDOM, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE,) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=0.5, size = 0.1) +
  geom_errorbarh(aes(xmin=CDOM-CDOM_CI, xmax=CDOM+CDOM_CI), height=11, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  coord_cartesian(ylim=c(0,450), xlim=c(0,25)) +
  xlab(expression("CDOM absorbance (a305)")) +
  ylab(expression("Abs H"[2]*"O"[2]*" production (nM/hr)"))

#Plot the regression with DOC:
DOC_vs_Abs_prod <- ggplot(Merged_Prod_Decay_WL_only, aes(x=DOC, y=PH2O2_avg, color=as.factor(Year))) +
  geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
  geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI, ymax=PH2O2_avg+PH2O2_CI),
                width=10, size = 0.1) +
  geom_point(size = 0.8, alpha = 0.8) +
  scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
  theme_classic() +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        axis.title.y = element_text(size = 12, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,450, by = 50)) +
  scale_x_continuous(breaks=seq(200,600, by = 100)) +
  coord_cartesian(ylim=c(0,450), xlim=c(200,600)) +
  xlab(expression("DOC ("*mu*"M)")) +
  ylab(expression("Abs H"[2]*"O"[2]*" production (nM/hr)"))

#Plot Microcystis abundance vs Absolute H2O2 producation rate
MC_vs_PH2O2_plot <- filter(Merged_Prod_Decay_WL_only, Model_Fit == "Yes") %>% 
  ggplot(aes(x=Microcystis_abund, y=PH2O2_avg, color=as.factor(Year))) +
    geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE) +
    geom_errorbar(aes(ymin=PH2O2_avg-PH2O2_CI,
                      ymax=PH2O2_avg+PH2O2_CI), size = 0.2, width=1e4) +
    geom_errorbarh(aes(xmin=Microcystis_abund-Microcystis_CI,
                      xmax=Microcystis_abund+Microcystis_CI), size = 0.2, height=4) +
    geom_point() +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 10, color = "black", margin = margin(t = 0, r = 5, b = 0, l = 0)),
          axis.title.y = element_blank(),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 12, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 10, color = "black", margin = margin(t = 5, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.1, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.position = "none") +
    coord_cartesian(ylim=c(0,450)) +
    scale_y_continuous(breaks=seq(0,450, by=50)) +
    xlab("Microcystis abundance (reads/mL)") +
    ylab(expression("Absolute H"[2]*"O"[2]*" production (nM/hr)"))

combined_bio_regressions_Abs_prod <- CDOM_vs_Abs_prod + Chla_vs_Abs_prod + PrimProd_vs_Abs_prod + DOC_vs_Abs_prod + Resp_vs_Abs_prod + MC_vs_PH2O2_plot + plot_layout(ncol = 3)

combined_bio_regressions_Abs_prod
ggsave("combined_bio_regressions_Abs_prod.pdf",  combined_bio_regressions_Abs_prod, width = 12, height = 10, units = "in", dpi=300)
```



In 2021, Dhurba did some measurements that were similar to the light dark bottle experiments ran in the outdoor tank at the Lake Erie Center (the data from 2018-2019) but he spiked in 18-O2 labeled H2O2 and incubated the bottles indoors under temperature and light controlled conditions.  

How do the absolute production rates and absolute decay rate constants obtained in the light and dark compare between the two methods?  
```{r}
#Import the dataframe that has the summarized data from the outdoor spike-batch experiments and the indoor 18-O2 experiments:
IndoorOutdoor_df <- read.table("Outdoor_Indoor_Exp_Compare.txt", header=TRUE, sep="\t")
#Make a plot for PH2O2 in the light:
IndoorOutdoor_PH2O2_light <- filter(IndoorOutdoor_df, Condition == "light") %>% ggplot(aes(x=Experiment_Type, y=PH2O2, color=Experiment_Type)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  ggtitle("Light") +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  scale_y_continuous(breaks=seq(0,450, by=75)) +
  coord_cartesian(ylim=c(0,450)) +
  xlab("Experiment Type") +
  ylab(expression("Absolute H2O2 production (nM/hr)"))

#Make a plot for PH2O2 in the dark:
IndoorOutdoor_PH2O2_dark <- filter(IndoorOutdoor_df, Condition == "dark") %>% ggplot(aes(x=Experiment_Type, y=PH2O2, color=Experiment_Type)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  ggtitle("Dark") + 
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,40)) +
  xlab("Experiment Type") +
  ylab(expression("Absolute H2O2 production (nM/hr)"))

#Make a plot for Kloss in the light:
IndoorOutdoor_Kloss_light <- filter(IndoorOutdoor_df, Condition == "light") %>% ggplot(aes(x=Experiment_Type, y=Kloss, color=Experiment_Type)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  ggtitle("Light") +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,0.8)) +
  xlab("Experiment Type") +
  ylab(expression("Kloss,H2O2 (hr-1)"))

#Make a plot for Kloss in the dark:
IndoorOutdoor_Kloss_dark <- filter(IndoorOutdoor_df, Condition == "dark") %>% ggplot(aes(x=Experiment_Type, y=Kloss, color=Experiment_Type)) +
  geom_boxplot(outlier.shape = NA, position=position_dodge(width=1.5)) +
  geom_jitter(alpha=0.7) +
  theme_classic() +
  ggtitle("Dark") +
  theme(plot.background = element_rect(color = "NA"),
        axis.line.x = element_line(size=0.1),
        axis.line.y = element_line(size=0.1),
        axis.text.y = element_text(size = 14, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.y = element_text(size = 16, color = "black", margin = margin(t = 0, r = 10, b = 0, l = 0)),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size = 16, color = "black", margin = margin(t = 10, r = 0, b = 0, l = 0)),
        axis.ticks.length = unit(-0.1, "cm"),
        axis.ticks = element_line(size=0.1),
        legend.position = "none") +
  coord_cartesian(ylim=c(0,0.8)) +
  xlab("Experiment Type") +
  ylab(expression("Kloss,H2O2 (hr-1)"))

Combined_IndoorOutdoor_plot <- IndoorOutdoor_PH2O2_light + IndoorOutdoor_PH2O2_dark + IndoorOutdoor_Kloss_light + IndoorOutdoor_Kloss_dark + plot_layout(ncol = 2)
Combined_IndoorOutdoor_plot
ggsave("Combined_IndoorOutdoor_plot.pdf",  Combined_IndoorOutdoor_plot, width = 7, height = 6, units = "in", dpi=300)
```
The missing rows for the PH2O2 and Kloss plots refer to the experiments ran outdoors where PH2O2 and Kloss could not be calculated and the value from 21-Sep-21 with PH2O2 listed as NA.  

Are the differences observed between the two methods significant?  
```{r}
### FOR PH2O2 IN THE LIGHT ###
#The following line extracts the data for the isotope experiments:
PH2O2_light_isotope <- filter(IndoorOutdoor_df, Experiment_Type == "Indoor_Isotope" & Condition == "light")$PH2O2
#The next line removes any NaNs:
PH2O2_light_isotope <- PH2O2_light_isotope[is.na(PH2O2_light_isotope) == FALSE]
#The following line extracts the data for the spike-batch experiments:
PH2O2_light_spike_batch <- filter(IndoorOutdoor_df, Experiment_Type == "Outdoor" & Condition == "light")$PH2O2
#The next line removes any NaNs:
PH2O2_light_spike_batch <- PH2O2_light_spike_batch[is.na(PH2O2_light_spike_batch) == FALSE]
#Print some text to keep track of what is printed to the screen:
print("T-test for PH2O2 in the light")
#Run the T-test and print the result:
print(t.test(PH2O2_light_isotope, PH2O2_light_spike_batch, paired = FALSE, alternative = "two.sided"))

### FOR PH2O2 IN THE DARK ###
#The following line extracts the data for the isotope experiments:
PH2O2_dark_isotope <- filter(IndoorOutdoor_df, Experiment_Type == "Indoor_Isotope" & Condition == "dark")$PH2O2
#The next line removes any NaNs:
PH2O2_dark_isotope <- PH2O2_dark_isotope[is.na(PH2O2_dark_isotope) == FALSE]
#The following line extracts the data for the spike-batch experiments:
PH2O2_dark_spike_batch <- filter(IndoorOutdoor_df, Experiment_Type == "Outdoor" & Condition == "dark")$PH2O2
#The next line removes any NaNs:
PH2O2_dark_spike_batch <- PH2O2_dark_spike_batch[is.na(PH2O2_dark_spike_batch) == FALSE]
#Print some text to keep track of what is printed to the screen:
print("T-test for PH2O2 in the dark")
#Run the T-test and print the result:
print(t.test(PH2O2_dark_isotope, PH2O2_dark_spike_batch, paired = FALSE, alternative = "two.sided"))

### FOR Kloss IN THE LIGHT ###
#The following line extracts the data for the isotope experiments:
Kloss_light_isotope <- filter(IndoorOutdoor_df, Experiment_Type == "Indoor_Isotope" & Condition == "light")$Kloss
#The next line removes any NaNs:
Kloss_light_isotope <- Kloss_light_isotope[is.na(Kloss_light_isotope) == FALSE]
#The following line extracts the data for the spike-batch experiments:
Kloss_light_spike_batch <- filter(IndoorOutdoor_df, Experiment_Type == "Outdoor" & Condition == "light")$Kloss
#The next line removes any NaNs:
Kloss_light_spike_batch <- Kloss_light_spike_batch[is.na(Kloss_light_spike_batch) == FALSE]
#Print some text to keep track of what is printed to the screen:
print("T-test for Kloss in the light")
#Run the T-test and print the result:
print(t.test(Kloss_light_isotope, Kloss_light_spike_batch, paired = FALSE, alternative = "two.sided"))

### FOR Kloss IN THE DARK ###
#The following line extracts the data for the isotope experiments:
Kloss_dark_isotope <- filter(IndoorOutdoor_df, Experiment_Type == "Indoor_Isotope" & Condition == "dark")$Kloss
#The next line removes any NaNs:
Kloss_dark_isotope <- Kloss_dark_isotope[is.na(Kloss_dark_isotope) == FALSE]
#The following line extracts the data for the spike-batch experiments:
Kloss_dark_spike_batch <- filter(IndoorOutdoor_df, Experiment_Type == "Outdoor" & Condition == "dark")$Kloss
#The next line removes any NaNs:
Kloss_dark_spike_batch <- Kloss_dark_spike_batch[is.na(Kloss_dark_spike_batch) == FALSE]
#Print some text to keep track of what is printed to the screen:
print("T-test for Kloss in the dark")
#Run the T-test and print the result:
print(t.test(Kloss_dark_isotope, Kloss_dark_spike_batch, paired = FALSE, alternative = "two.sided"))
```
Dhurba found that Kloss did not depend on the amount of labeled H2O2 added to light and dark bottles. I'm going to calculate the significance of the differences below.

To do this, I compiled Dhurba's data in the following directory on the Cory Lab Z Drive: Lab_Members\PandeyDR\Research\_Chapters\Chapter 2\_DATA\MIMS\Petrel_MIMS\Experiments_using_H218O2_Stock\AbsoluteProduction_AbsoluteDecay_Tests

I copy-pasted the PH2O2 and Kloss data for each replicate bottle in Dhurba's experiments into a tab-delimited text file to import the data into R. Then I run the test calculations as done above for my light-dark and filtered-unfiltered comparisons.  
```{r}
#First import a data frame compiled from Dhurba's excel spreadsheets:
Isotope_Replicate_Data <- read.table("Isotope_Replicate_Data.txt", header=TRUE, sep="\t")

#Run the T-tests:
#Get a vector of dates to loop through:  
dates <- unique(Isotope_Replicate_Data$Date)
#Remove the dates that don't have measurements with high inital [H2O2]:
drop <- c("23-Jun-21", "31-Aug-21")
dates <- dates[!(dates %in% drop)]

#For each date, do a t-test of Kloss in the light with low vs high initial H2O2
for (i in dates){
  #Create a vector of Kloss from only that date with low initial H2O2
  lowH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "low")$Kloss
  #Create a vector of Kloss from only that date with high initial H2O2
  highH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "high")$Kloss
  #Print the date and whether it was light or dark
  print(i)
  print("light")
  #run the t-test and print the result:
  print(t.test(lowH2O2_vector, highH2O2_vector, paired = FALSE, alternative = "two.sided"))
}

#For each date, do a t-test of Kloss in the dark with low vs high initial H2O2
for (i in dates){
  #Create a vector of Kloss from only that date with low initial H2O2
  lowH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "low")$Kloss
  #Create a vector of Kloss from only that date with high initial H2O2
  highH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "high")$Kloss
  #Print the date and whether it was light or dark
  print(i)
  print("dark")
  #run the t-test and print the result:
  print(t.test(lowH2O2_vector, highH2O2_vector, paired = FALSE, alternative = "two.sided"))
}
```
Now see if Kloss is significantly different in the light and dark:  
```{r}
#Run the T-tests:
#Get a vector of dates to loop through:  
dates <- unique(Isotope_Replicate_Data$Date)
#For each date, test if Kloss with low initial H2O2 are different in the light and dark
for (i in dates){
  #Create a vector of Kloss from only that date with low initial H2O2 in the light
  lightKloss_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "low")$Kloss
  #Create a vector of Kloss from only that date with low initial H2O2 in the dark
  darkKloss_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "low")$Kloss
  #Print the date and initial H2O2 level
  print(i)
  print("low H2O2")
  #run the t-test and print the result:
  print(t.test(lightKloss_vector, darkKloss_vector, paired = FALSE, alternative = "two.sided"))
}

#Run the T-tests:
#Get a vector of dates to loop through:  
dates <- unique(Isotope_Replicate_Data$Date)
#Remove the dates that don't have measurements with high inital [H2O2] and one date with no differences in Kloss:
drop <- c("23-Jun-21", "31-Aug-21", "14-Jul-21")
dates <- dates[!(dates %in% drop)]

#For each date, test if Kloss with high initial H2O2 are different in the light and dark
for (i in dates){
  #Create a vector of Kloss from only that date with high initial H2O2 in the light
  lightKloss_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "high")$Kloss
  #Create a vector of Kloss from only that date with high initial H2O2
  darkKloss_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "high")$Kloss
  #Print the date and initial H2O2 level
  print(i)
  print("high H2O2")
  #run the t-test and print the result:
  print(t.test(lightKloss_vector, darkKloss_vector, paired = FALSE, alternative = "two.sided"))
}
```
Is absolute H2O2 production significantly different between  low and high initial H2O2?
```{r}
#Run the T-tests:
#Get a vector of dates to loop through:  
dates <- unique(Isotope_Replicate_Data$Date)
#Remove the dates that don't have measurements with high inital [H2O2] (23-Jul-21 and 31-Aug-21) and that don't have any differences in the replicates (14-Jul-21) or that had problems with PH2O2 (21-Sep-21)
drop <- c("23-Jun-21", "31-Aug-21", "14-Jul-21", "21-Sep-21")
dates <- dates[!(dates %in% drop)]

#For each date, do a t-test of PH2O2 in the light with low vs high initial H2O2
for (i in dates){
  #Create a vector of PH2O2 from only that date with low initial H2O2
  lowH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "low")$PH2O2
  #Create a vector of PH2O2 from only that date with high initial H2O2
  highH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "high")$PH2O2
  #Print the date and whether it was light or dark
  print(i)
  print("light")
  #run the t-test and print the result:
  print(t.test(lowH2O2_vector, highH2O2_vector, paired = FALSE, alternative = "two.sided"))
}

#For each date, do a t-test of PH2O2 in the dark with low vs high initial H2O2
for (i in dates){
  #Create a vector of PH2O2 from only that date with low initial H2O2
  lowH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "low")$PH2O2
  #Create a vector of PH2O2 from only that date with high initial H2O2
  highH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "high")$PH2O2
  #Print the date and whether it was light or dark
  print(i)
  print("dark")
  #run the t-test and print the result:
  print(t.test(lowH2O2_vector, highH2O2_vector, paired = FALSE, alternative = "two.sided"))
}
```
Is absolute H2O2 production significantly different between light exposed and dark control bottles?  
```{r}
#Run the T-tests for low initial H2O2:
#Get a vector of dates to loop through:  
dates <- unique(Isotope_Replicate_Data$Date)
#Remove the date that doesn't have PH2O2 data (21-Sep-21) and the date that has 0 production for both the light and dark (14-Jul-21).
drop <- c("14-Jul-21", "21-Sep-21")
dates <- dates[!(dates %in% drop)]

#For each date, test if PH2O2 with low initial H2O2 are different in the light and dark
for (i in dates){
  #Create a vector of PH2O2 from only that date with low initial H2O2 in the light
  lightPH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "low")$PH2O2
  #Create a vector of PH2O2 from only that date with low initial H2O2 in the dark
  darkPH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "low")$PH2O2
  #Print the date and initial H2O2 level
  print(i)
  print("low H2O2")
  #run the t-test and print the result:
  print(t.test(lightPH2O2_vector, darkPH2O2_vector, paired = FALSE, alternative = "two.sided"))
}

#Run the T-tests for high initial H2O2:
#Get a vector of dates to loop through:  
dates <- unique(Isotope_Replicate_Data$Date)
#Remove the dates that don't have measurements with high inital [H2O2] ("23-Jun-21", "31-Aug-21") the date with 0 PH2O2 in the light and dark for all reps ("14-Jul-21") and the date with no PH2O2 data ("21-Sep-21"):
drop <- c("23-Jun-21", "31-Aug-21", "14-Jul-21", "21-Sep-21")
dates <- dates[!(dates %in% drop)]

#For each date, test if PH2O2 with high initial H2O2 are different in the light and dark
for (i in dates){
  #Create a vector of PH2O2 from only that date with high initial H2O2 in the light
  lightPH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "light" & H2O2_initial == "high")$PH2O2
  #Create a vector of PH2O2 from only that date with high initial H2O2
  darkPH2O2_vector <- filter(Isotope_Replicate_Data, Date == i & Condition == "dark" & H2O2_initial == "high")$PH2O2
  #Print the date and initial H2O2 level
  print(i)
  print("high H2O2")
  #run the t-test and print the result:
  print(t.test(lightPH2O2_vector, darkPH2O2_vector, paired = FALSE, alternative = "two.sided"))
}
```

Next we will ask the question, does H2O2 production and decay in the outdoor experiments depend on the microbial community composition?  

First we need to calculate how similar the communities are. We will use Bray-Curtis dissimilarity.  

**Notes on Bray-Curtis Dissimilarity:**  
It is defined as:  
$$BC_{ij} = 1 - \frac{2C_{ij}}{S_i + S_j}$$  
Where:  
1. *i* and *j* represent samples to be compared  
2. Cij is the sum of the counts for each OTU found in both samples, using only the lesser count from either for each species  
    + For an example, consider two imaginary forests. Forest 1 has 10 squirrels, 3 bears, and a leopard. Forest 2 has 13 squirrels, 1 bear, and no leopards. When counting squirrels for Cij the value would be 10 and for bears it would be 1. Leopards would not contribute to Cij because they are not present in both forests. The value for Cij would be 10 + 1 = 14.  
3. Si and Sj are the sum of all species in samples *i* and *j*, respectively.  
    + Continuing the example from above the Si would be 10 + 3 + 1 = 14 and Sj would be 13 + 1 + 0 = 14.  
    + The Bray-Curtis dissimilarity for the example would be  
      1 - [(2 * 14) / (14 + 14)] = 1 - [28 / 28] = 1 - 1 = 0  

First, we need to reorganize the data and calculate Bray-Curtis dissimilarity:  
```{r}
#Extract just the Whole water community samples from the summary dataframe:
LE_H2O2.merged.summary.WW <- LE_H2O2.merged.summary[ LE_H2O2.merged.summary$Condition == "WL", ]

#Convert the dataframe into a matrix where rows are samples and columns are OTU abundances:
LE_H2O2.pivot <- pivot_wider(LE_H2O2.merged.summary.WW, id_cols = Experiment_Date, names_from = Row.names, values_from = mean)
Date_vector <- LE_H2O2.pivot$Experiment_Date #Save date vector
LE_H2O2.matrix <- round(LE_H2O2.pivot[,-1], 0) #Round to the nearest whole number, exclude dates
#Convert to matrix format
LE_H2O2.matrix <- as.matrix(LE_H2O2.matrix)
LE_H2O2.matrix <- apply(LE_H2O2.matrix, 2, as.numeric)
rownames(LE_H2O2.matrix) <- Date_vector #Set row names as dates

#Calculate Bray-Curtis dissimilarity:
dist.matrix <- vegdist(LE_H2O2.matrix, method = "bray")
``` 

Build the PCoA with 2 axes:  
```{r}
WW_PCoA_k2 <- cmdscale(dist.matrix, k = 2, eig = TRUE)
WW_PCoA_k2
```
Build a scree plot to determine the number of principal components to consider:
```{r}
plot(WW_PCoA_k2$eig, xlab = "Component Number", ylab = "Eigenvalue")
```
It looks like the eigenvalue tappers off after 4 axes.  

```{r}
WW_PCoA_k4 <- cmdscale(dist.matrix, k = 4, eig = TRUE)
WW_PCoA_k4
```
The goodness of fit is improved with four axes. We will use this moving forward.  
How much variation does each axis explain?  
```{r}
#Remove the PCoA with two axes:
rm(WW_PCoA_k2)

#Calculate % variance explained by each PCoA axis:
WW_PCoA_k4$eig[1] / sum(WW_PCoA_k4$eig) * 100
WW_PCoA_k4$eig[2] / sum(WW_PCoA_k4$eig) * 100
WW_PCoA_k4$eig[3] / sum(WW_PCoA_k4$eig) * 100
WW_PCoA_k4$eig[4] / sum(WW_PCoA_k4$eig) * 100
```
Axis 1 explains 34.9 % of the variance, axis 2 explains 18.1 % of the variance, axis 3 explains 7.7 % of the variance, and axis 4 explains 6.9 %. These four axes explain 67.6% of the variance.

Add the PCoA coordinates to the environmental data and plot the PCoA:
```{r}
#Merge the environmental matrix with the PCoA coordinates:
Merged_Prod_Decay_WL_only <- merge(Merged_Prod_Decay_WL_only, WW_PCoA_k4$points, by.x = "Experiment_Date", by.y = "row.names", all.x = TRUE, all.y = FALSE)
```

Plot PCoA:
```{r}
#Make 3D PCoA plot:
PCoA_3D <- plot_ly(Merged_Prod_Decay_WL_only, x = ~V1, y= ~V3, z = ~V2, color = ~Net_production_avg, symbol = ~Year,
                   symbols = c("circle", "diamond", "square")) %>%
                    add_markers(marker = list(line = list(color = "black", width = 0.5), size = 6)) %>%
                    layout(scene = list(xaxis = list(title = "PCoA 1 (34.9 %)"),
                                        yaxis = list(title = "PCoA 3 (7.7 %)"),
                                        zaxis = list(title = "PCoA 2 (18.1 %)")),
                           annotations = list(text = "GOF = 0.68", showarrow = F, align = "left",
                                              x=0.8, y=0.9, z=2.5))
PCoA_3D
```
Do any of the PCoA Axes correlate with H2O2 production or decay?

Perform bidirectional stepwise regression to find the best model to explain PCoA1 variance. Start with all variables in the model, then remove models that do not significantly decrease model performance (measure with BIC). We have a high number of potential variables and a low number of data points, so we will do a bootstrapping analysis to assess model stability.  
```{r}
library(MASS)
library(bootStepAIC)
#Do the stepwise regession
set.seed(278)
PCoA_df <- Merged_Prod_Decay_WL_only[ !is.na(Merged_Prod_Decay_WL_only$pH), ] #Remove two samples from EC cruises missing pH data
PCoA_df <- PCoA_df[ !is.na(Merged_Prod_Decay_WL_only$TP), ] #Remove one sample missing phosphorus data
fit <- lm(V1 ~ Net_production_avg+Net_decay_avg+Max_H2O2_avg+pH+Chla+CDOM+TP+TDP+Nitrate+NH4+SRP+Incubation_Temp+peakA+peakC+peakT+C_A_ratio+T_A_ratio+IntFlour+FI+SlopeRatio+Microcystis_abund, data=PCoA_df)
step <- boot.stepAIC(fit, PCoA_df, B = 300, alpha=0.01, k = log(26), direction="backward")
step
```
Repeat the stepwise regression for the second PCoA axis:  
```{r}
set.seed(156)
fit <- lm(V2 ~ Net_production_avg+Net_decay_avg+Max_H2O2_avg+pH+Chla+CDOM+TP+TDP+Nitrate+NH4+SRP+Incubation_Temp+peakA+peakC+peakT+C_A_ratio+T_A_ratio+IntFlour+FI+SlopeRatio+Microcystis_abund, data=PCoA_df)
step2 <- boot.stepAIC(fit, PCoA_df, B = 300, alpha=0.01, k = log(26), direction="backward")
step2
```
Repeat the stepwise regression for axis 3:  
```{r}
set.seed(546)
fit <- lm(V3 ~ Net_production_avg+Net_decay_avg+Max_H2O2_avg+pH+Chla+CDOM+TP+TDP+Nitrate+NH4+SRP+Incubation_Temp+peakA+peakC+peakT+C_A_ratio+T_A_ratio+IntFlour+FI+SlopeRatio+Microcystis_abund, data=PCoA_df)
step3 <- boot.stepAIC(fit, PCoA_df, B = 300, alpha=0.01, k = log(26), direction="backward")
step3
```
Repeat the stepwise regression for axis 4:  
```{r}
set.seed(293)
fit <- lm(V4 ~ Net_production_avg+Net_decay_avg+Max_H2O2_avg+pH+Chla+CDOM+TP+TDP+Nitrate+NH4+SRP+Incubation_Temp+peakA+peakC+peakT+C_A_ratio+T_A_ratio+IntFlour+FI+SlopeRatio+Microcystis_abund, data=PCoA_df)
step4 <- boot.stepAIC(fit, PCoA_df, B = 300, alpha=0.01, k = log(26), direction="backward")
step4
```
Let's summarize all the stepwise linear mixed models in a table:  
```{r}
stepAIC_table <- tab_model(step$OrigStepAIC, step2$OrigStepAIC, step3$OrigStepAIC, step4$OrigStepAIC, show.df = TRUE, show.intercept = FALSE, show.aic = TRUE, show.ci = FALSE, file="stepAIC_table.doc")

stepAIC_table
```

Not much stands out, many of the parameters are included in the models with high significance. A lot of the H2O2 production and water chemistry are correlated with changes in bacterial community composition.  

We will try to use OTU relative abundances along with other environmental data to predict H2O2 production rates using a random forest model. The Pearson correlation coefficient between H2O2 production and chlorophyll concentration along with the corresponding R2 value will serve as the baseline to compare the model using OTU abundances.  

Because of the uncertainty with absolute biotic H2O2 production rates, sticking with net H2O2 production and decay rates in whole water.

First merged the LE_H2O2_matrix with the environmental data:  
```{r}
#Remove everything except Experiment_Date, net production, and net decay:
keep <- c("Experiment_Date", "Net_production_avg", "Net_production_CI", "Net_decay_avg",
          "Net_decay_CI")

RF_H2O2_df <- Merged_Prod_Decay_WL_only[ , colnames(Merged_Prod_Decay_WL_only) %in% keep]

LE_H2O2_matrix_merged <- merge(LE_H2O2.matrix, RF_H2O2_df, by.x = "row.names",
                               by.y = "Experiment_Date", all.x = FALSE, all.y = TRUE)
#all.x is FALSE because there are 16S samples from filtered and dark bottles that are not being considered here (focusing on whole water light production and decay rates)

#Make a column name more informative:  
names(LE_H2O2_matrix_merged)[names(LE_H2O2_matrix_merged) == "Row.names"] <- "Experiment_Date"

#Make a dataframe that is only net H2O2 production and OTU abundances for the first model:
drop <- c("Experiment_Date", "Net_production_CI", "Net_decay_avg", "Net_decay_CI")

LE_H2O2_matrix_merged_Net_H2O2_production <- LE_H2O2_matrix_merged[ , !(colnames(LE_H2O2_matrix_merged) %in% drop)]

rm(drop)
drop <- c()
#Only keep OTUs with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 1:9844){ 
  #Set to 9844, because there are 9845 columns, but the last column is the H2O2 data, which we want to ignore here
  if (max(LE_H2O2_matrix_merged_Net_H2O2_production[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (OTU number) to a vector of OTUs to drop
    drop[i] <- colnames(LE_H2O2_matrix_merged_Net_H2O2_production)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_matrix_merged_Net_H2O2_production <- LE_H2O2_matrix_merged_Net_H2O2_production[ , !(colnames(LE_H2O2_matrix_merged_Net_H2O2_production) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(31415)

#import the R dataframe with field metadata and OTU abundance as a pandas dataframe:
features = r.LE_H2O2_matrix_merged_Net_H2O2_production
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_production_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_production_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```
The next step is to create a training and testing data set. For model training, the model is allowed to know the values for Net H2O2 production rates in order to learn how to use the features to predict H2O2 production rates. Then, we make a prediction on a test set where data for H2O2 production rate data is not available. The model effectiveness is determined by comparing the modeled and actual values for net H2O2 production rate.   

I'll bin the data into 3 subsets, pick one as the training set, and validate the model on the other two subsets. Repeat until each subset has trained the model (3-fold cross validation).  

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_PH2O2 = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_Net_PH2O2 = pd.DataFrame(model_PH2O2.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_Net_PH2O2 <- py$scores_df_Net_PH2O2
```
 
Make predictions of H2O2 production with the test features (OTU abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_PH2O2.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_PH2O2.score(features, labels)
```
What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

How does this compare to the baseline predictions using the linear regression model with Chlorophyll and CDOM?
```{r}
#Get the R2 values for these regressions:
print("R squared chla")
WL_lm_results_net_prod_table$Chla$r.squared
print("R squared CDOM")
WL_lm_results_net_prod_table$CDOM$r.squared
print("R squared PrimProd")
WL_lm_results_net_prod_table$PrimProd$r.squared
#Get the MAE values for these regressions
print("mean absolute error Chla")
mean(abs(WL_lm_results_net_prod$Chla$residuals))
print("mean absolute error CDOM")
mean(abs(WL_lm_results_net_prod$CDOM$residuals))
print("mean absolute error PrimProd")
mean(abs(WL_lm_results_net_prod$PrimProd$residuals))
```
```{r}
#What is the %increase in R2 using the Random Forest Regression over chlorophyll regression?
abs(0.86 - 0.29)/0.29 * 100

#What is the change in MAE?
abs(9.87 - 27.26)/27.26 * 100

#What is the % increase in R2 over the CDOM regression?
abs(0.86 - 0.44)/0.44 * 100

#What is the change in MAE?
abs(9.87 - 20.78)/29.78 * 100
```
The R2 with the RF model is ~196 % higher than the chlorophyll regression model, and the MAE is 64 % lower.
The R2 with the RF model is ~95 % higher than the CDOM regression model, and the MAE is 37 % lower.

Calculate the importance of each OTU as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_PH2O2, features, labels, n_repeats=10,
                                random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the OTU importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```
Determine how these OTUs correlate with net H2O2 production rates alone:  
```{r}
#sum the important OTUs using the 95% CI cutoff:
#Only included the OTUS with importances 0.01 or more (including CI error).
important_Otus_95_CI_net_H2O2_production <- c("Otu00822", "Otu00579", "Otu00378", "Otu00323", "Otu00258", "Otu00123", "Otu00433")

#Create an empty matrix to save results into
reg_results <- list()
reg_results_table <- list()
Net_H2O2_production_rf_important_OTUs <- array(numeric(),c(length(important_Otus_95_CI_net_H2O2_production),12))

#This is a vector of model importances from the python output:
RF_Model_Importance <- c(0.092, 0.045, 0.030, 0.029, 0.029, 0.017, 0.017)

#This is a vector of confidence intervals on the importance from the python output:
RF_Model_95CI <- c(0.020, 0.009, 0.006, 0.007, 0.006, 0.005, 0.004)

#Start a counter to keep track of the number of loop iterations:  
count = 0
for (i in important_Otus_95_CI_net_H2O2_production){
  #Get a dataframe that is just the OTU of interest at this stage of the loop, is only whole water, and only includes samples with fit to model in both reps.
 temp_df <- LE_H2O2.merged.summary.environ[LE_H2O2.merged.summary.environ$OTU == i & LE_H2O2.merged.summary.environ$Condition == "WL", ]
 #Add one to the counter:
 count <- count + 1
 #Get importance rank
 Net_H2O2_production_rf_important_OTUs[ count, 1] <- count
 #Get OTU number
 Net_H2O2_production_rf_important_OTUs[ count, 2] <- i
 #Get the Taxonomy: 
 Net_H2O2_production_rf_important_OTUs[ count, 3] <- LE_H2O2.taxonomy[row.names(LE_H2O2.taxonomy) == i, 2]
 #Get the model importance:
 Net_H2O2_production_rf_important_OTUs[ count, 4] <- RF_Model_Importance[count]
 #Get the confidence interval on the importance:
 Net_H2O2_production_rf_important_OTUs[ count, 5] <- RF_Model_95CI[count]
 #Get the mean abundance:  
 Net_H2O2_production_rf_important_OTUs[ count, 6] <- mean(temp_df$OTU_mean_abund)
 #Get the abundance standard deviation:  
 Net_H2O2_production_rf_important_OTUs[ count, 7] <- sd(temp_df$OTU_mean_abund)
 #Get the max. abundance:  
 Net_H2O2_production_rf_important_OTUs[ count, 8] <- max(temp_df$OTU_mean_abund)
#Get the confidence interval on the maximum abundance:   
 Net_H2O2_production_rf_important_OTUs[ count, 9] <- temp_df$OTU_CI[ temp_df$OTU_mean_abund == max(temp_df$OTU_mean_abund) ]
 #Get pearson's R:
 Net_H2O2_production_rf_important_OTUs[ count, 10] <- cor(temp_df$OTU_mean_abund, temp_df$Net_production_avg)
 #Get the regression p-value:
 reg_results[[i]] <- lm(Net_production_avg ~ OTU_mean_abund, data=temp_df, na.action = na.omit)
 reg_results_table[[i]] <- glance(reg_results[[i]])
 Net_H2O2_production_rf_important_OTUs[ count, 11] <- reg_results_table[[i]]$p.value
 #Get the regression R2 value:  
 Net_H2O2_production_rf_important_OTUs[ count, 12] <- reg_results_table[[i]]$r.squared
}

#Rename columns to something useful
colnames(Net_H2O2_production_rf_important_OTUs) <- c("Importance rank", "OTU number", "Taxonomy", "Model Importance", "Importance 95% CI", "Mean abundance", "Abundance standard deviation", "Maximum abundance", "Max abundance CI", "Pearson's R", "p-value", "R2")
```
Create a summary table for the paper:  
```{r}
Net_H2O2_production_rf_important_OTUs <- as.data.frame(Net_H2O2_production_rf_important_OTUs)

tab_df(Net_H2O2_production_rf_important_OTUs[1:5], alternate.rows = T, title = "Importance OTUs in Gross Biotic PH2O2 Random Forest Model", file="RF_Gross_Biotic_Important_OTUs.doc")

write.table(Net_H2O2_production_rf_important_OTUs, file = "Net_H2O2_production_rf_important_OTUs_table.txt", sep = "\t", col.names = TRUE)
```

None of these OTUs are correlated with net H2O2 production rates on their own.  

Next, see if OTUs can predict net decay rates better than linear regressions with other environmental parameters:  

First merged the LE_H2O2_matrix with the environmental data:  
```{r}
#Remove everything except Experiment_Date, net production, and net decay:
keep <- c("Experiment_Date", "Net_production_avg", "Net_production_CI", "Net_decay_avg",
          "Net_decay_CI")

RF_H2O2_df <- Merged_Prod_Decay_WL_only[ , colnames(Merged_Prod_Decay_WL_only) %in% keep]

LE_H2O2_matrix_merged <- merge(LE_H2O2.matrix, RF_H2O2_df, by.x = "row.names",
                               by.y = "Experiment_Date", all.x = FALSE, all.y = TRUE)
#all.x is FALSE because there are 16S samples from filtered and dark bottles that are not being considered here (focusing on whole water light production and decay rates)

#Make a column name more informative:  
names(LE_H2O2_matrix_merged)[names(LE_H2O2_matrix_merged) == "Row.names"] <- "Experiment_Date"

#Make a dataframe that is only net H2O2 decay rates and OTU abundances for the first model:
drop <- c("Experiment_Date", "Net_production_avg", "Net_production_CI", "Net_decay_CI")

LE_H2O2_matrix_merged_Net_H2O2_decay <- LE_H2O2_matrix_merged[ , !(colnames(LE_H2O2_matrix_merged) %in% drop)]

rm(drop)
drop <- c()
#Only keep OTUs with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 1:9844){ 
  #Set to 9844, because there are 9845 columns, but the last column is the H2O2 data, which we want to ignore here
  if (max(LE_H2O2_matrix_merged_Net_H2O2_decay[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (OTU number) to a vector of OTUs to drop
    drop[i] <- colnames(LE_H2O2_matrix_merged_Net_H2O2_decay)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_matrix_merged_Net_H2O2_decay <- LE_H2O2_matrix_merged_Net_H2O2_decay[ , !(colnames(LE_H2O2_matrix_merged_Net_H2O2_decay) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(31415)

#import the R dataframe with field metadata and OTU abundance as a pandas dataframe:
features = r.LE_H2O2_matrix_merged_Net_H2O2_decay
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_decay_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_decay_avg', axis = 1) #axis tells drop to look at column names

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```
The next step is to create a training and testing data set. For model training, the model is allowed to know the values for Net H2O2 production rates in order to learn how to use the features to predict H2O2 production rates. Then, we make a prediction on a test set where data for H2O2 production rate data is not available. The model effectiveness is determined by comparing the modeled and actual values for net H2O2 production rate.   

I'll bin the data into 3 subsets, pick one as the training set, and validate the model on the other two subsets. Repeat until each subset has trained the model (3-fold cross validation).  

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_decay = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_decay = pd.DataFrame(model_H2O2_decay.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_decay <- py$scores_df_H2O2_decay
```
 
Make predictions of H2O2 production with the test features (OTU abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_decay.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_decay.score(features, labels)
```
What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

How does this compare to the baseline predictions using the linear regression model with Chlorophyll and CDOM?
```{r}
#Get the R2 values for these regressions:
print("R squared chla")
WL_lm_results_net_decay_table$Chla$r.squared
print("R squared CDOM")
WL_lm_results_net_decay_table$CDOM$r.squared
print("R squared primary production")
WL_lm_results_net_decay_table$PrimProd$r.squared
#Get the MAE values for these regressions
print("mean absolute error Chla")
mean(abs(WL_lm_results_net_decay$Chla$residuals))
print("mean absolute error CDOM")
mean(abs(WL_lm_results_net_decay$CDOM$residuals))
print("mean absolute error primary production")
mean(abs(WL_lm_results_net_decay$PrimProd$residuals))
```
```{r}
#What is the %increase in R2 using the Random Forest Regression over chlorophyll regression?
abs(0.84 - 0.13)/0.13 * 100

#What is the % change in MAE?
abs(19.10 - 42.60)/42.60 * 100

#What is the % increase in R2 over the primary production regression?
abs(0.84 - 0.38)/0.38 * 100

#What is the change in MAE?
abs(19.10 - 22.09)/22.09 * 100
```
The R2 and mean absolute error with the RF model is much higher with bacterial OTUs than chlorophyll a concentration and primary production rates.  

Calculate the importance of each OTU as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_decay, features, labels, n_repeats=10,
                                random_state=42)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the OTU importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```
No single OTU had very large impacts on the predictive power of the random forest model.  

Create a table showing how the ranodm forest models compared to chlorophyll and CDOM regression models:  
```{r}
#Generate the table of comparative stats":  
RF_OTUs <- c(0.86, 9.87, 0.84, 19.10)
chl_reg <- c(0.29, 27.26, 0.13, 42.60)
CDOM_reg <- c(0.44, 20.78, 0.02, 47.66)
PrimProd_reg <- c(0.54, 18.84, 0.38, 22.09)
RF_compare_df <- array(numeric(),c(4,5))
RF_compare_df[,1] <- c("Net H2O2 production R2", "Net H2O2 production MAE",
                       "Net H2O2 decay R2", "Net H2O2 decay MAE")
RF_compare_df[,2] <- RF_OTUs
RF_compare_df[,3] <- chl_reg
RF_compare_df[,4] <- CDOM_reg
RF_compare_df[,5] <- PrimProd_reg
colnames(RF_compare_df) <- c("Statistic", "OTU abundance random forest", "Chlorophyll a regression", "CDOM regression", "Primary Production regression")

#Export into a publication ready format:  
RF_compare_df <- as.data.frame(RF_compare_df)

tab_df(RF_compare_df, alternate.rows = T, file="RF_compare_df.doc")
```

Are specific phyla or classes of bacteria better predictors of net H2O2 production and decay?  

Group the OTU abundances by Phlyum rank and some the abundances, then recalcuate the 95% CI for phlum abundance:  
```{r}
#First clean up phylum names to match NCBI names:
LE_H2O2.merged.long$Phylum <- gsub("Acidobacteriota", "Acidobacteria", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Actinobacteriota", "Actinobacteria", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Armatimonadota", "Armatimonadetes", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Bacteroidota", "Bacteroidetes", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Planctomycetota", "Planctomycetes", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Caldisericota", "Caldiserica", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Calditrichota", "Calditrichaeota", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Campilobacterota", "Epsilonproteobacteria", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Cloacimonadota", "Cloacimonetes", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Deferrisomatota", "Deltaproteobacteria", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Deinococcota", "Deinococcus_Thermus", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Elusimicrobiota", "Elusimicrobia", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Entotheonellaeota", "Tectomicrobia", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Fibrobacterota", "Fibrobacteres", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Fusobacteriota", "Fusobacteria", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Gemmatimonadota", "Gemmatimonadetes", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Halobacterota", "Euryarchaeota", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Nitrospinota", "Nitrospinae", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Nitrospirota", "Nitrospirae", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("SAR324_clade(Marine_group_B)", "Deltaproteobacteria", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Spirochaetota", "Spirochaetes", LE_H2O2.merged.long$Phylum)
LE_H2O2.merged.long$Phylum <- gsub("Verrucomicrobiota", "Verrucomicrobia", LE_H2O2.merged.long$Phylum)

#First get a dataframe of the summed abundance of all OTUs grouped by phylum in each sample:
LE_H2O2_phylum_df <- LE_H2O2.merged.long %>%
    group_by(Experiment_Date, Bottle_name, Experiment_type, Condition, Phylum) %>%
  summarise(Total_Reads_mL=sum(Reads_mL))

#Compute the average abundance of each phylum and the 95% confidence intervals on the mean:  
LE_H2O2_phylum_df <- LE_H2O2_phylum_df %>%
 group_by(Experiment_Date, Experiment_type, Condition, Phylum) %>%
    summarise(n=n(), mean=mean(Total_Reads_mL), sd=sd(Total_Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)

#Make a dataframe of only whole water data to put into random forest model:
LE_H2O2_phylum_df_WL_only <- filter(LE_H2O2_phylum_df, Condition == "WL")

#remove columns that are not needed:
drop <- c("Experiment_type", "Condition", "n", "sd", "se")
LE_H2O2_phylum_df_WL_only <- LE_H2O2_phylum_df_WL_only[ , !(colnames(LE_H2O2_phylum_df_WL_only) %in% drop)]

#Convert the table to wide format so that each column has one experiment data and the phylum abundances and errors are their own columns of data:  
LE_H2O2_phylum_df_WL_only_wide <- dcast(melt(LE_H2O2_phylum_df_WL_only, id.vars=c("Experiment_Date", "Phylum")), Experiment_Date~variable+Phylum)

#Combine with the data frame of environmental data, only keeping samples which have 16S and H2O2 data:  
LE_H2O2_phylum_df_environ <- merge(Merged_Prod_Decay_WL_only,
                                   LE_H2O2_phylum_df_WL_only_wide,
                                   by=c("Experiment_Date"), all = FALSE)
```

Make a dataframe to put into the random forest model:  
```{r}
#Make a dataframe that is only net H2O2 production rates and summed phylum abundances for the first model:
keep <- c("Net_production_avg", "mean_Acidobacteria", "mean_Actinobacteria", "mean_Alphaproteobacteria", "mean_Archaea_unclassified", "mean_Armatimonadetes", "mean_Asgardarchaeota", "mean_Bacteria_unclassified", "mean_Bacteroidetes", "mean_Betaproteobacteria", "mean_Caldiserica", "mean_Calditrichaeota", "mean_Epsilonproteobacteria", "mean_Chlamydiae", "mean_Chloroflexi", "mean_Cloacimonetes", "mean_Crenarchaeota", "mean_Cyanobacteria", "mean_Deinococcus_Thermus", "mean_Deltaproteobacteria", "mean_Dependentiae", "mean_Desulfobacterota", "mean_DTB120", "mean_Elusimicrobia", "mean_Tectomicrobia", "mean_Euryarchaeota", "mean_FCPU426", "mean_Fibrobacteres", "mean_Firmicutes", "mean_Fusobacteria", "mean_Gammaproteobacteria", "mean_Gemmatimonadetes", "mean_Hydrogenedentes", "mean_Latescibacterota", "mean_Margulisbacteria", "mean_MBNT15", "mean_Methylomirabilota", "mean_Myxococcota", "mean_Nanoarchaeota", "mean_NB1-j", "mean_Nitrospinae", "mean_Nitrospirae", "mean_NKB15", "mean_Oligoflexia", "mean_Patescibacteria", "mean_PAUC34f", "mean_Planctomycetes", "mean_Proteobacteria", "mean_Spirochaetes", "mean_Sumerlaeota", "mean_TA06", "mean_Thermoplasmatota", "mean_Verrucomicrobia", "mean_WOR-1", "mean_WPS-2", "mean_WS1", "mean_WS2", "mean_WS4", "mean_Zixibacteria")
          
LE_H2O2_phylum_Net_production_df <- LE_H2O2_phylum_df_environ[ , (colnames(LE_H2O2_phylum_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep phyla with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:59){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_phylum_Net_production_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (phylum name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_phylum_Net_production_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_phylum_Net_production_df <- LE_H2O2_phylum_Net_production_df[ , !(colnames(LE_H2O2_phylum_Net_production_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(31415)

#import the R dataframe with field metadata and OTU abundance as a pandas dataframe:
features = r.LE_H2O2_phylum_Net_production_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_production_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_production_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```
The next step is to create a training and testing data set. For model training, the model is allowed to know the values for Net H2O2 production rates in order to learn how to use the features to predict H2O2 production rates. Then, we make a prediction on a test set where data for H2O2 production rate data is not available. The model effectiveness is determined by comparing the modeled and actual values for net H2O2 production rate.   

I'll bin the data into 3 subsets, pick one as the training set, and validate the model on the other two subsets. Repeat until each subset has trained the model (3-fold cross validation).  

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_production_phylum = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_production_phylum = pd.DataFrame(model_H2O2_production_phylum.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_production_phylum <- py$scores_df_H2O2_production_phylum
```
 
Make predictions of H2O2 production with the test features (OTU abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_production_phylum.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_production_phylum.score(features, labels)
```
What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

What is the percent change in R2 and MAE when using phylum abundances instead of OTU abundances?  
```{r}
#What is the % decrease in R2?
abs(0.86 - 0.77)/0.86 * 100

#What is the % increase in MAE?
abs(9.87 - 14.27)/14.27 * 100
```

Calculate the importance of each phylum as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_production_phylum, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the phylum importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```

Now do a phylum-level analysis for net H2O2 decay rates: 
```{r}
#Make a dataframe that is only net H2O2 decay rates and summed phylum abundances for the first model:
keep <- c("Net_decay_avg", "mean_Acidobacteria", "mean_Actinobacteria", "mean_Alphaproteobacteria", "mean_Archaea_unclassified", "mean_Armatimonadetes", "mean_Asgardarchaeota", "mean_Bacteria_unclassified", "mean_Bacteroidetes", "mean_Betaproteobacteria", "mean_Caldiserica", "mean_Calditrichaeota", "mean_Epsilonproteobacteria", "mean_Chlamydiae", "mean_Chloroflexi", "mean_Cloacimonetes", "mean_Crenarchaeota", "mean_Cyanobacteria", "mean_Deinococcus_Thermus", "mean_Deltaproteobacteria", "mean_Dependentiae", "mean_Desulfobacterota", "mean_DTB120", "mean_Elusimicrobia", "mean_Tectomicrobia", "mean_Euryarchaeota", "mean_FCPU426", "mean_Fibrobacteres", "mean_Firmicutes", "mean_Fusobacteria", "mean_Gammaproteobacteria", "mean_Gemmatimonadetes", "mean_Hydrogenedentes", "mean_Latescibacterota", "mean_Margulisbacteria", "mean_MBNT15", "mean_Methylomirabilota", "mean_Myxococcota", "mean_Nanoarchaeota", "mean_NB1-j", "mean_Nitrospinae", "mean_Nitrospirae", "mean_NKB15", "mean_Oligoflexia", "mean_Patescibacteria", "mean_PAUC34f", "mean_Planctomycetes", "mean_Proteobacteria", "mean_Spirochaetes", "mean_Sumerlaeota", "mean_TA06", "mean_Thermoplasmatota", "mean_Verrucomicrobia", "mean_WOR-1", "mean_WPS-2", "mean_WS1", "mean_WS2", "mean_WS4", "mean_Zixibacteria")
          
LE_H2O2_phylum_Net_decay_df <- LE_H2O2_phylum_df_environ[ , (colnames(LE_H2O2_phylum_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep phyla with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:59){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_phylum_Net_decay_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (phylum name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_phylum_Net_decay_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_phylum_Net_decay_df <- LE_H2O2_phylum_Net_decay_df[ , !(colnames(LE_H2O2_phylum_Net_decay_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(31415)

#import the R dataframe with field metadata and OTU abundance as a pandas dataframe:
features = r.LE_H2O2_phylum_Net_decay_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_decay_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_decay_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```
The next step is to create a training and testing data set. For model training, the model is allowed to know the values for Net H2O2 production rates in order to learn how to use the features to predict H2O2 production rates. Then, we make a prediction on a test set where data for H2O2 production rate data is not available. The model effectiveness is determined by comparing the modeled and actual values for net H2O2 production rate.   

I'll bin the data into 3 subsets, pick one as the training set, and validate the model on the other two subsets. Repeat until each subset has trained the model (3-fold cross validation).  

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_decay_phylum = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_decay_phylum = pd.DataFrame(model_H2O2_decay_phylum.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_decay_phylum <- py$scores_df_H2O2_decay_phylum
```
 
Make predictions of H2O2 production with the test features (OTU abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_decay_phylum.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_decay_phylum.score(features, labels)
```
What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```


What is the percent change in R2 and MAE when using phylum abundances instead of OTU abundances?  
```{r}
#What is the % increase in R2?
abs(0.86 - 0.84)/0.84 * 100

#What is the % decrease in MAE?
abs(19.10 - 17)/19.10 * 100
```

Calculate the importance of each phylum as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_decay_phylum, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the phylum importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```
What is the range of abundance in Nitrospirae?
```{r}
print("minimum Nitrospirae abundance")
min(LE_H2O2_phylum_Net_decay_df$mean_Nitrospirae)
print("maximum Nitrospirae abundance")
max(LE_H2O2_phylum_Net_decay_df$mean_Nitrospirae)
print("mean Nitrospirae abundance")
mean(LE_H2O2_phylum_Net_decay_df$mean_Nitrospirae)
print("95% CI Nitrospirae abundance")
sd(LE_H2O2_phylum_Net_decay_df$mean_Nitrospirae)/sqrt(dim(LE_H2O2_phylum_Net_decay_df)[1])*1.96
```
Plot Nitrospirae abundance vs net H2O2 decay rate:  
```{r}
Nitrospirae_decay_plot <- ggplot(LE_H2O2_phylum_df_environ, aes(x=mean_Nitrospirae, y=Net_decay_avg, color=as.factor(Year))) +
    geom_smooth(method=lm, color="navy", fill="lightsteelblue", se=TRUE, size=0.25) +
    geom_point(size = 1, alpha = 0.8) +
    geom_errorbar(aes(ymin=Net_decay_avg-Net_decay_CI, ymax=Net_decay_avg+Net_decay_CI),
                  width=10, size = 0.1) +
    geom_errorbarh(aes(xmin=mean_Nitrospirae-ci_Nitrospirae,
                       xmax=mean_Nitrospirae+ci_Nitrospirae),
                   height=10, size=0.1) +
    scale_color_manual(values=c("red", "blue", "orange"), name = "Year") +
    theme_classic() +
    theme(plot.background = element_rect(color = "NA"),
          strip.text = element_text(size = 10),
          strip.background = element_rect(size = 0.25),
          axis.line.x = element_line(size=0.1),
          axis.line.y = element_line(size=0.1),
          axis.text.y = element_text(size = 8, color = "black",
                                     margin = margin(t = 0, r = 5, b = 0, l = 0)),
          axis.title.y = element_text(size = 10, color = "black",
                                      margin = margin(t = 0, r = 5, b = 0, l = 0)),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          axis.title.x = element_text(size = 10, color = "black",
                                      margin = margin(t = 5, r = 0, b = 0, l = 0)),
          axis.text.x = element_text(size = 8, color = "black", angle=45, hjust=1,
                                     margin = margin(t = 5, r = 0, b = 0, l = 0)),
          axis.ticks.length = unit(-0.05, "cm"),
          axis.ticks = element_line(size=0.1),
          legend.text = element_text(size = 8),
          legend.position = "top") +
    #coord_cartesian(ylim=c(-100,250)) +
    #scale_y_continuous(breaks=seq(-100,250, by=50)) +
    xlab(expression("Nitrospirae abundance (reads/mL)")) +
    ylab(expression("Net H"[2]*"O"[2]*" decay (nM/hr)"))
Nitrospirae_decay_plot
```
Next, try predicting H2O2 production and decay at the order level:  
```{r}
#First get a dataframe of the summed abundance of all OTUs grouped by taxonomic order in each sample:
LE_H2O2_order_df <- LE_H2O2.merged.long %>%
    group_by(Experiment_Date, Bottle_name, Experiment_type, Condition, Order) %>%
  summarise(Total_Reads_mL=sum(Reads_mL))

#Compute the average abundance of each order and the 95% confidence intervals on the mean:
LE_H2O2_order_df <- LE_H2O2_order_df %>%
 group_by(Experiment_Date, Experiment_type, Condition, Order) %>%
    summarise(n=n(), mean=mean(Total_Reads_mL), sd=sd(Total_Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)

#Make a dataframe of only whole water data to put into random forest model:
LE_H2O2_order_df_WL_only <- filter(LE_H2O2_order_df, Condition == "WL")

#remove columns that are not needed:
drop <- c("Experiment_type", "Condition", "n", "sd", "se")
LE_H2O2_order_df_WL_only <- LE_H2O2_order_df_WL_only[ , !(colnames(LE_H2O2_order_df_WL_only) %in% drop)]

#Convert the table to wide format so that each column has one experiment data and the phylum abundances and errors are their own columns of data:  
LE_H2O2_order_df_WL_only_wide <- dcast(melt(LE_H2O2_order_df_WL_only, id.vars=c("Experiment_Date", "Order")), Experiment_Date~variable+Order)

#Combine with the data frame of environmental data, only keeping samples which have 16S and H2O2 data:  
LE_H2O2_order_df_environ <- merge(Merged_Prod_Decay_WL_only,
                                   LE_H2O2_order_df_WL_only_wide,
                                   by=c("Experiment_Date"), all = FALSE)
```
Make a dataframe to put into the random forest model:  
```{r}
#Make a dataframe that is only net H2O2 production rates and summed order-level abundances for the first model:
keep <- c("Net_production_avg")
          
LE_H2O2_order_Net_production_df <- LE_H2O2_order_df_environ[ , grepl("mean_", colnames(LE_H2O2_order_df_environ)) | (colnames(LE_H2O2_order_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep phyla with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:367){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_order_Net_production_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (phylum name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_order_Net_production_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_order_Net_production_df <- LE_H2O2_order_Net_production_df[ , !(colnames(LE_H2O2_order_Net_production_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(20000)

#import the R dataframe with field metadata and order-level 16S rRNA abundance as a pandas dataframe:
features = r.LE_H2O2_order_Net_production_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_production_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_production_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_production_order = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_production_order = pd.DataFrame(model_H2O2_production_order.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_production_order <- py$scores_df_H2O2_production_order
```

Make predictions of H2O2 production with the test features (OTU order-level abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_production_order.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_production_order.score(features, labels)
```

What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

What is the percent change in R2 and MAE when using phylum abundances instead of OTU abundances?  
```{r}
#What is the % increase in R2?
abs(0.86 - 0.86)/0.86 * 100

#What is the % increase in MAE?
abs(10.58 - 9.87)/9.87 * 100
```

Calculate the importance of each phylum as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_production_order, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the phylum importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```

Now do an order-level analysis for net H2O2 decay rates:  
```{r}
#Make a dataframe that is only net H2O2 decay rates and summed order-level abundances for the first model:
keep <- c("Net_decay_avg")
          
LE_H2O2_order_Net_decay_df <- LE_H2O2_order_df_environ[ , grepl("mean_", colnames(LE_H2O2_order_df_environ)) | (colnames(LE_H2O2_order_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep phyla with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:367){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_order_Net_decay_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (phylum name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_order_Net_decay_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_order_Net_decay_df <- LE_H2O2_order_Net_decay_df[ , !(colnames(LE_H2O2_order_Net_decay_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(20000)

#import the R dataframe with field metadata and order-level 16S rRNA abundance as a pandas dataframe:
features = r.LE_H2O2_order_Net_decay_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_decay_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_decay_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_decay_order = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_decay_order = pd.DataFrame(model_H2O2_decay_order.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_decay_order <- py$scores_df_H2O2_decay_order
```

Make predictions of H2O2 production with the test features (OTU order-level abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_decay_order.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_decay_order.score(features, labels)
```

What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

Calculate the importance of each order as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_decay_order, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the order importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```

Next, try predicting H2O2 production and decay at the class level:  
```{r}
#First get a dataframe of the summed abundance of all OTUs grouped by taxonomic class in each sample:
LE_H2O2_class_df <- LE_H2O2.merged.long %>%
    group_by(Experiment_Date, Bottle_name, Experiment_type, Condition, Class) %>%
  summarise(Total_Reads_mL=sum(Reads_mL))

#Compute the average abundance of each order and the 95% confidence intervals on the mean:
LE_H2O2_class_df <- LE_H2O2_class_df %>%
 group_by(Experiment_Date, Experiment_type, Condition, Class) %>%
    summarise(n=n(), mean=mean(Total_Reads_mL), sd=sd(Total_Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)

#Make a data frame of only whole water data to put into random forest model:
LE_H2O2_class_df_WL_only <- filter(LE_H2O2_class_df, Condition == "WL")

#remove columns that are not needed:
drop <- c("Experiment_type", "Condition", "n", "sd", "se")
LE_H2O2_class_df_WL_only <- LE_H2O2_class_df_WL_only[ , !(colnames(LE_H2O2_class_df_WL_only) %in% drop)]

#Convert the table to wide format so that each column has one experiment data and the phylum abundances and errors are their own columns of data:  
LE_H2O2_class_df_WL_only_wide <- dcast(melt(LE_H2O2_class_df_WL_only, id.vars=c("Experiment_Date", "Class")), Experiment_Date~variable+Class)

#Combine with the data frame of environmental data, only keeping samples which have 16S and H2O2 data:  
LE_H2O2_class_df_environ <- merge(Merged_Prod_Decay_WL_only,
                                   LE_H2O2_class_df_WL_only_wide,
                                   by=c("Experiment_Date"), all = FALSE)
```
Make a dataframe to put into the random forest model:  
```{r}
#Make a dataframe that is only net H2O2 production rates and summed class-level abundances for the first model:
keep <- c("Net_production_avg")
          
LE_H2O2_class_Net_production_df <- LE_H2O2_class_df_environ[ , grepl("mean_", colnames(LE_H2O2_class_df_environ)) | (colnames(LE_H2O2_class_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep classes with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:164){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_class_Net_production_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (class name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_class_Net_production_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_class_Net_production_df <- LE_H2O2_class_Net_production_df[ , !(colnames(LE_H2O2_class_Net_production_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(21040)

#import the R dataframe with field metadata and class-level 16S rRNA abundance as a pandas dataframe:
features = r.LE_H2O2_class_Net_production_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_production_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_production_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_production_class = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_production_class = pd.DataFrame(model_H2O2_production_class.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_production_class <- py$scores_df_H2O2_production_class
```

Make predictions of H2O2 production with the test features (OTU order-level abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_production_class.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_production_class.score(features, labels)
```

What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

Calculate the importance of each order as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_production_class, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the order importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```

Make a class-level dataframe with net decay rates to put into the random forest model:  
```{r}
#Make a dataframe that is only net H2O2 decay rates and summed class-level abundances for the first model:
keep <- c("Net_decay_avg")
          
LE_H2O2_class_Net_decay_df <- LE_H2O2_class_df_environ[ , grepl("mean_", colnames(LE_H2O2_class_df_environ)) | (colnames(LE_H2O2_class_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep classes with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:164){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_class_Net_decay_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (class name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_class_Net_decay_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_class_Net_decay_df <- LE_H2O2_class_Net_decay_df[ , !(colnames(LE_H2O2_class_Net_decay_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(21880)

#import the R dataframe with field metadata and class-level 16S rRNA abundance as a pandas dataframe:
features = r.LE_H2O2_class_Net_decay_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_decay_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_decay_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_decay_class = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_decay_class = pd.DataFrame(model_H2O2_decay_class.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_decay_class <- py$scores_df_H2O2_decay_class
```

Make predictions of H2O2 decay with the test features (OTU class-level abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_decay_class.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_decay_class.score(features, labels)
```

What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

Calculate the importance of each class as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_decay_class, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the order importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```
Next, try predicting H2O2 production and decay at the genus level:  
```{r}
#First get a dataframe of the summed abundance of all OTUs grouped by genus in each sample:
LE_H2O2_genus_df <- LE_H2O2.merged.long %>%
    group_by(Experiment_Date, Bottle_name, Experiment_type, Condition, Genus) %>%
  summarise(Total_Reads_mL=sum(Reads_mL))

#Compute the average abundance of each genus and the 95% confidence intervals on the mean:
LE_H2O2_genus_df <- LE_H2O2_genus_df %>%
 group_by(Experiment_Date, Experiment_type, Condition, Genus) %>%
    summarise(n=n(), mean=mean(Total_Reads_mL), sd=sd(Total_Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)

#Make a data frame of only whole water data to put into random forest model:
LE_H2O2_genus_df_WL_only <- filter(LE_H2O2_genus_df, Condition == "WL")

#remove columns that are not needed:
drop <- c("Experiment_type", "Condition", "n", "sd", "se")
LE_H2O2_genus_df_WL_only <- LE_H2O2_genus_df_WL_only[ , !(colnames(LE_H2O2_genus_df_WL_only) %in% drop)]

#Convert the table to wide format so that each column has one experiment data and the genus abundances and errors are their own columns of data:  
LE_H2O2_genus_df_WL_only_wide <- dcast(melt(LE_H2O2_genus_df_WL_only, id.vars=c("Experiment_Date", "Genus")), Experiment_Date~variable+Genus)

#Combine with the data frame of environmental data, only keeping samples which have 16S and H2O2 data:  
LE_H2O2_genus_df_environ <- merge(Merged_Prod_Decay_WL_only,
                                   LE_H2O2_genus_df_WL_only_wide,
                                   by=c("Experiment_Date"), all = FALSE)
```
Make a dataframe to put into the random forest model:  
```{r}
#Make a dataframe that is only net H2O2 production rates and summed genus-level abundances for the first model:
keep <- c("Net_production_avg")
          
LE_H2O2_genus_Net_production_df <- LE_H2O2_genus_df_environ[ , grepl("mean_", colnames(LE_H2O2_genus_df_environ)) | (colnames(LE_H2O2_genus_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep genera with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:950){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_genus_Net_production_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (class name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_genus_Net_production_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those genera from the dataframe for the random forest model:
LE_H2O2_genus_Net_production_df <- LE_H2O2_genus_Net_production_df[ , !(colnames(LE_H2O2_genus_Net_production_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(21440)

#import the R dataframe with field metadata and gensus-level 16S rRNA abundance as a pandas dataframe:
features = r.LE_H2O2_genus_Net_production_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_production_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_production_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_production_genus = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_production_genus = pd.DataFrame(model_H2O2_production_genus.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_production_genus <- py$scores_df_H2O2_production_genus
```

Make predictions of H2O2 production with the test features (OTU genus-level abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_production_genus.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_production_genus.score(features, labels)
```

What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

Calculate the importance of each genus as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_production_genus, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the order importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```
Now do a model for H2O2 decay rates at the genus level:
Make a dataframe to put into the random forest model:  
```{r}
#Make a dataframe that is only net H2O2 decay rates and summed genus-level abundances for the first model:
keep <- c("Net_decay_avg")
          
LE_H2O2_genus_Net_decay_df <- LE_H2O2_genus_df_environ[ , grepl("mean_", colnames(LE_H2O2_genus_df_environ)) | (colnames(LE_H2O2_genus_df_environ) %in% keep)]

rm(drop)
drop <- c()
#Only keep genera with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 2:950){ 
  #Ignoring the first column, which has H2O2 rate data
  if (max(LE_H2O2_genus_Net_decay_df[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (class name) to a vector of phyla to drop
    drop[i] <- colnames(LE_H2O2_genus_Net_decay_df)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those genera from the dataframe for the random forest model:
LE_H2O2_genus_Net_decay_df <- LE_H2O2_genus_Net_decay_df[ , !(colnames(LE_H2O2_genus_Net_decay_df) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(21440)

#import the R dataframe with field metadata and gensus-level 16S rRNA abundance as a pandas dataframe:
features = r.LE_H2O2_genus_Net_decay_df
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['Net_decay_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('Net_decay_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_H2O2_decay_genus = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_H2O2_decay_genus = pd.DataFrame(model_H2O2_decay_genus.cv_results_).sort_values(by='rank_test_score')
```
Export the grid search results to R:
```{r}
scores_df_H2O2_decay_genus <- py$scores_df_H2O2_decay_genus
```

Make predictions of H2O2 production with the test features (OTU genus-level abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_H2O2_decay_genus.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_H2O2_decay_genus.score(features, labels)
```

What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```

Calculate the importance of each genus as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_H2O2_decay_genus, features, labels, n_repeats=10, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the order importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```

Summarize Cyanobium abundances for superoxide production calculations:  
```{r}
#Sum all the counts for each Cyanobium OTU in each sample:
Cyanobium_abundance_df <- filter(LE_H2O2.merged.long, Genus == "Cyanobium_PCC-6307") %>%
  group_by(Experiment_Date, Bottle_name, Experiment_type, Condition) %>%
  summarise(Total_Reads_mL=sum(Reads_mL))
Cyanobium_abundance_df$Total_Reads_mL <- round(Cyanobium_abundance_df$Total_Reads_mL, 0)

#Average the total Cyanobium read counts across the replicate bottles for each experiment:
Cyanobium_abundance_df <- Cyanobium_abundance_df %>%
 group_by(Experiment_Date, Experiment_type, Condition) %>%
    summarise(n=n(), mean=mean(Total_Reads_mL), sd=sd(Total_Reads_mL)) %>%
  mutate(se=sd/sqrt(n)) %>%
  mutate(ci=se*1.96)
```

Do OTUs predict absolute H2O2 production rates in the Random Forest Model like they did for Net H2O2 production?

First merged the LE_H2O2_matrix with the environmental data:  
```{r}
#Remove everything except Experiment_Date, absolute production rates, and Kloss
keep <- c("Experiment_Date", "PH2O2_avg", "PH2O2_CI", "Kloss_avg",
          "Kloss_CI")

RF_H2O2_df <- Merged_Prod_Decay_WL_only[ , colnames(Merged_Prod_Decay_WL_only) %in% keep]
RF_H2O2_df <- RF_H2O2_df[ RF_H2O2_df$PH2O2_avg != "NaN", ] #Removing the H2O2 data with NaNs from poor model fit.

LE_H2O2_matrix_merged <- merge(LE_H2O2.matrix, RF_H2O2_df, by.x = "row.names",
                               by.y = "Experiment_Date", all.x = FALSE, all.y = TRUE)
#all.x is FALSE because there are 16S samples without any matching rate analysis due to problems with conducting the experiment or H2O2 measurements)

#Make a column name more informative:  
names(LE_H2O2_matrix_merged)[names(LE_H2O2_matrix_merged) == "Row.names"] <- "Experiment_Date"

#Make a dataframe that is only absolute H2O2 production rates and OTU abundances for the first model:
drop <- c("Experiment_Date", "PH2O2_CI", "Kloss_avg", "Kloss_CI")

LE_H2O2_matrix_merged_Abs_H2O2_production <- LE_H2O2_matrix_merged[ , !(colnames(LE_H2O2_matrix_merged) %in% drop)]

rm(drop)
drop <- c()
#Only keep OTUs with a maximum abundance above 500 reads/mL. I am doing this because many of the OTUs with lower abundances are basically at the limit of detection when looking at the confidence intervals on their abundances.
#Loop through each column of the dataframe
for (i in 1:9844){ 
  #Set to 9844, because there are 9845 columns, but the last column is the H2O2 data, which we want to ignore here
  if (max(LE_H2O2_matrix_merged_Abs_H2O2_production[,i]) < 500){
      #if the maximum value in the column is less than 500, add the column name (OTU number) to a vector of OTUs to drop
    drop[i] <- colnames(LE_H2O2_matrix_merged_Abs_H2O2_production)[i]
  }
}

drop <- drop[!(is.na(drop))] #Remove the NA entries, where values in drop were skipped
#Now remove those OTUs from the dataframe for the random forest model:
LE_H2O2_matrix_merged_Abs_H2O2_production <- LE_H2O2_matrix_merged_Abs_H2O2_production[ , !(colnames(LE_H2O2_matrix_merged_Abs_H2O2_production) %in% drop)]
```

Import the dataframe into python and format for random forest model:  
```{python}
#import the required python packages for data manipulation
import pandas as pd
import numpy as np
np.random.seed(31419)

#import the R dataframe with field metadata and OTU abundance as a pandas dataframe:
features = r.LE_H2O2_matrix_merged_Abs_H2O2_production
#display the first 5 rows:
features.info()

#Separate the data into the features and targets.
#The target (aka label) is the value that we want to predict. Features are what the model uses to make the prediction
labels = np.array(features['PH2O2_avg']) #The algorithm needs a numpy array so we do that conversion here

#Remove the labels from features
features = features.drop('PH2O2_avg', axis = 1) #axis refers to the columns

#Save a list of features for use later
feature_list = list(features.columns)

#Convert to numpy array
features = np.array(features)
print(feature_list)
```

Tune the random forest hyperparameters by performing a grid search with the 3-fold cross validation:  
```{python}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#Define a dictionary of hyper parameter values to iterate through:  
model_params = {
  'n_estimators': [500, 1000, 2000, 3000, 4000, 5000, 7500, 10000, 25000],
  'max_features': ['sqrt', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  'min_samples_split': [2, 3, 4]
}

#Create random forest regressor:
rf_model = RandomForestRegressor(random_state = 42)
#Set up grid search using 4-fold cross validation and the model params defined above:  
clf = GridSearchCV(rf_model, model_params, cv=4)
#Train the grid search to find the best model:
model_PH2O2 = clf.fit(features, labels)
#Save the results of the gridsearch in a table:
scores_df_Abs_PH2O2 = pd.DataFrame(model_PH2O2.cv_results_).sort_values(by='rank_test_score')
```

Export the grid search results to R:
```{r}
scores_df_Abs_PH2O2 <- py$scores_df_Abs_PH2O2
```

Make predictions of absolute H2O2 production with the test features (OTU abundances). Then, determine the R2 score:  
```{python}
#Use the forest's predict method on the data
predictions = model_PH2O2.predict(features)

#Calculate the residuals:
errors = abs(predictions - labels)

#Calculate the model R2 score
model_PH2O2.score(features, labels)
```
What is the mean absolute error of the model?  
```{r}
mean(abs(py$errors))
```
```{r}
#What is the %increase in R2 using the Random Forest Regression over chlorophyll regression?
abs(0.90 - 0.49)/0.49 * 100

#What is the % increase in R2 over the CDOM regression?
abs(0.90 - 0.62)/0.62 * 100
```
The R2 with the RF model is ~83 % higher than the chlorophyll regression model
The R2 with the RF model is ~45 % higher than the CDOM regression model, and the MAE is 37 % lower.

Calculate the importance of each OTU as a predictor in the model via permutation:  
```{python}
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

result = permutation_importance(model_PH2O2, features, labels, n_repeats=10,
                                random_state=42)
sorted_idx = result.importances_mean.argsort()
```

List the importances:  
calculate the OTU importance using 95% confidence interval:    
```{python}
import math
for i in result.importances_mean.argsort()[::-1]:
     if result.importances_mean[i] - ((result.importances_std[i]/math.sqrt(10))*1.96) > 0.009:
         print(f"{feature_list[i]:<8}"
               f": {result.importances_mean[i]:.3f}"
               f" +/- {result.importances_std[i]:.3f}")
```