diff --git a/assets/css/main.css b/assets/css/main.css
index fbbfdd4..ff65bc4 100644
--- a/assets/css/main.css
+++ b/assets/css/main.css
@@ -380,3 +380,30 @@ table.dataTable.display tbody tr.even > .sorting_1 {
table.dataTable.display tbody tr.even {
background-color: green;
} */
+
+
+.infobox {
+ padding: 1em 1em 1em 4em;
+ margin-bottom: 10px;
+ border: 5px solid orange;
+ border-radius: 10px;
+ background: #6f0f0f 5px center/3em no-repeat;
+ color: #ffffff;
+ align-content: center;
+ font-family:'Courier New', Courier, monospace
+}
+
+.today {
+ background-image: url("static/img/vintage_reading_duck.jpg");
+}
+
+.insights {
+ padding: 1em 1em 1em 4em;
+ margin-bottom: 10px;
+ border: 5px solid rgb(27, 186, 27);
+ border-radius: 10px;
+ background: #4e0777 5px center/3em no-repeat;
+ color: #ffffff;
+ align-content: center;
+ font-family:'Courier New', Courier, monospace
+}
\ No newline at end of file
diff --git a/content/posts/2024-10-28-the-taskmaster-s-potty-mouth/index.Rmd b/content/posts/2024-10-28-the-taskmaster-s-potty-mouth/index.Rmd
new file mode 100644
index 0000000..32c2f65
--- /dev/null
+++ b/content/posts/2024-10-28-the-taskmaster-s-potty-mouth/index.Rmd
@@ -0,0 +1,408 @@
+---
+title: The Taskmaster's Potty Mouth
+author: Christopher Nam
+date: '2024-11-09'
+slug: the-taskmaster-s-potty-mouth
+categories: [analysis, profanity, greg davies]
+tags: []
+draft: no
+output:
+ blogdown::html_page:
+ toc: true
+ toc_depth: 1
+---
+
+
+```{r, echo = FALSE}
+blogdown::shortcode("callout", text="Warning This Post Contains Strong Language...Reader Discretion is advised!")
+```
+
+# Your Task
+
+> Find out whether the Taskmaster (Greg Davies) has become more or less foul mouth over time.
+```{r, echo = FALSE, out.width = "25%", fig.align='center', error = FALSE, warning=FALSE}
+knitr::include_graphics(path = "https://static1.colliderimages.com/wordpress/wordpress/wp-content/uploads/2024/03/greg-davies-from-taskmaster.jpg")
+```
+
+This post is an extension of [this profanity based post](/themedianduck/2024/10/profanity-insanity).
+
+```{r , fig.show = "hold", out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path =file.path(here::here(), "img", "gifs", "greg_fuckingbus.gif"), error = FALSE)
+```
+
+```{r preamble, echo = FALSE, warning = FALSE, message = FALSE, error=FALSE, collapse = TRUE, include = FALSE}
+library(here)
+here("static")
+
+preamble_dir <- here("static", "code", "R", "preamble")
+preamble_file <- "post_preamble.R"
+
+source(file.path(preamble_dir, preamble_file))
+source(file.path(preamble_dir, "database_preamble.R"))
+source(file.path(preamble_dir, "graphics_preamble.R"))
+
+```
+
+
+```{sql sql_query, connection = tm_db, output.var = "profanity_enh", echo = FALSE}
+-- Stored as an R dataframe profanity_enh
+
+SELECT
+pf.series,
+pf.episode,
+pf.task,
+pf.speaker as speaker_id,
+pp.name as speaker_name,
+pf.roots,
+pf.quote,
+pf.studio,
+pp.gender,
+pp.hand,
+pp.champion,
+pp.tmi as speaker_tmi,
+sp.name as series_name,
+sp.episodes as num_episodes_in_series,
+sp.champion as series_champion_id,
+sp.special
+FROM profanity pf
+LEFT JOIN people pp
+ ON (pf.speaker = pp.id
+ AND pf.series = pp.series) OR
+(pf.speaker = pp.id)
+LEFT JOIN series sp
+ ON pf.series = sp.id
+
+```
+
+```{r, message = FALSE, tidy = FALSE, echo = FALSE}
+library(reticulate)
+library(dplyr)
+
+series_profanity <- profanity_enh %>%
+ rowwise() %>%
+ mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
+ # To count the number of profanities utter in a quote.
+ group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi, gender, hand) %>%
+ # Aggregating and summarising data at a series, speaker level.
+ summarise(
+ speaker_episode_count = dplyr::n_distinct(episode),
+ sum_profanity_series = sum(num_profanity),
+ no_episodes_in_series = max(num_episodes_in_series)
+ ) %>%
+ mutate(profanity_per_episode = sum_profanity_series/no_episodes_in_series)
+```
+
+```{r, include = FALSE}
+greg_image_url <- "https://taskmaster.info/images/people/0019_greg_davies_3.png"
+```
+
+# The Profanity Rate Approach
+One way to answer our question is to use Profanity Rate that we previously defined in the aforementioned [post](/themedianduck/2024/10/profanity-insanity).
+
+Recall that this sums up the number of profanity occurrences within a series for a particular person, and divides by the number of episodes in that series. This provides the number of times the person in question will swear in an episode of that series, on average.
+```{r, include = FALSE}
+tm_series <- series_profanity %>%
+ filter(special == 0 & speaker_name == "Greg Davies")
+
+tm_series$image_url <- greg_image_url
+```
+
+```{r tm-pr-basic-plot, fig.cap = "The Taskmaster's Profanity Rateover Time", tidy = FALSE, echo = FALSE}
+ggplot(tm_series, aes(x= series, y=profanity_per_episode)) +
+ geom_rect(aes(xmin = 0, ymin = 4,
+xmax = 7.5, ymax = 7), fill = "#f3b0b0") +
+ geom_rect(aes(xmin = 7.5, ymin = 2.5,
+xmax = 16, ymax = 5), fill = "#b0c6f3") +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ geom_line(linewidth = 1.5) +
+ scale_x_continuous(breaks = seq(0, 20, 1)) +
+ scale_y_continuous(breaks = seq(0, 10, 1), limits = c(0, 10)) +
+ geom_image(aes(image = image_url), size= 0.07) +
+ xlab("Series") + ylab("Profanity Rate (Profanity per Episode)") +
+ ggtitle("The Taskmaster's Potty Mouth")
+```
+
+Figure \@ref(fig:tm-pr-basic-plot) shows the Taskmaster's profanity rate over time (captured by series). Profanity rates range between approximately 2.5 (series 8) and 7 (series 7). Visually, there does appear to be a change in profanity rate from series 8 onwards; the profanity rate drops from between 4 to 7 (red area), to 2.5 and 5 (blue area).
+
+
+The uptick in profanity rate in series 16 such that it could plausibly associated and drawn from the Series 1-7 profanity rate range. Could Greg be on the cusp of returning to his old foul mouthed ways?
+
+
+:::{.insights}
+There is some evidence to suggest that the Taskmaster has becomes less potty mouthed over time with a significant drop in profanity instances per episode from series 8 onwards. **The profanity rate drops from between 4 to 7 (Series 1 to 7), to 2.5 and 5 (Series 8 to 16)**.
+
+This could be seen as slightly counter intuitive as:
+
+1. We might assume that Greg, in his old age, has become more frustrated with life and contestants and thus more likely to swear.
+2. We might assume that as the show has progressed and evolved, Greg has played up his angry persona as the Taskmaster and thus more likely to swear.
+:::
+
+However, there are questions surrounding whether this change in profanity is statistically significant. There is also a noticeable overlap between between two coloured areas in which the profanity rate values are common to both; the profanity rates associated with series 9, 10 and 16 could have plausibly been drawn from the "Pre Series 8" swearing regime.
+
+There also potential questions around whether the the profanity rate is the ideal statistic to answer our question since it can be swayed by rogue observations.
+
+## Potential drawbacks to the Profanity Rate.
+The profanity rate, which essentially is an average (mean) summary statistic, is highly influenced by outliers and extreme values. It is not considered a [robust statistic](https://statisticsbyjim.com/basics/robust-statistics/) and is highly sensitive to the data.
+
+In addition, the Profanity Rate, by itself, also does not capture the potential distribution (spread) of profanity utterances sufficiently. That is, if we were to watch many episodes of Taskmaster, what are the range of profanity utterances we can expect to see (or hear), and how much do they vary across episodes.
+
+For this reason, we might want to consider additional statistics to the profanity rate (the mean profanity utterances), which are more robust and highlight the spread of distribution. The [median](https://statisticsbyjim.com/basics/median/), and [percentiles](https://statisticsbyjim.com/basics/percentiles/) in general, are one way to address these two issues.
+
+But before we can calculate these metrics, some additional work is required.
+
+# Case of the Missing Profanities
+One feature of our data source is that if Greg (or any other person for this matter) did not swear at all in an episode, no records will be present in our underlying dataset. 0 profanities should be associated with these episodes, which currently aren't being captured.
+
+If we want to consider something beyond profanity rate (for example the median and percentiles of profanities uttered in an episode across a series), we would need to ensure that these profanity free episodes are captured. Without capturing this profanity free phenomena, our statistics would not be be accurate; here, the median and percentile would be inflated.
+
+The "no records for profanity free" feature is not a flaw of the data source or of its design. However, due to the question we want to answer, it is an important consideration that has to be explicitly accounted for in our methodology.
+
+
+## Why Wasn't This an Issue with Profanity Rate?
+It is also worth remarking that that this "zero profanities" phenomena was not an issue for the profanity rate calculation. Recall the equation for profanity rate was:
+
+\begin{equation}
+\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}}
+(\#eq:profanityrate)
+\end{equation}
+
+The numerator (the top of the fraction) in Equation \@ref(eq:profanityrate) will be unaffected if zero profanities were observed (whether explicitly captured or not). The denominator (the bottom) in Equation \@ref(eq:profanityrate) shows that we normalise by the **number of episodes in a series**, and _not_ the **number of episodes in the series in which profanity was observed**. It is this normalisation that means that the profanity rate is not affected by this phenomena.
+
+
+```{r, tidy = FALSE, execute = FALSE, echo = FALSE, include = FALSE}
+ggplot(tm_series, aes(x = series, y = no_episodes_in_series)) +
+ geom_col() +
+ geom_line(aes(y = speaker_episode_count)) +
+ geom_point(aes(y = speaker_episode_count)) +
+ geom_image(aes(image = image_url, y = speaker_episode_count), size= 0.07) +
+ xlab("series") + ylab("Number of Episodes") +
+ scale_y_continuous(breaks = seq(0, 10, 1)) +
+ ggtitle("Taskmaster's Profanity Consistency")
+
+```
+
+## Profanity Consistency
+Before we potentially start capturing these "profanity free episodes" explicitly, we should assess whether this is an actual problem first. To do this, we define the Profanity Consistency Rate.
+
+\begin{equation}
+
+\texttt{Profanity Consistency Rate for Series $i$} = \frac{\texttt{Number of Episodes in Series $i$ which featured at least 1 swear}}{\texttt{Number of Episodes in Series $i$}}
+(\#eq:profanityconsis)
+\end{equation}
+
+The Consistency Rate can be thought of as the proportion of episodes in a series in which at least one swear word was uttered by the Taskmaster. A Consistency Rate of 100% means that the Taskmaster swore in all episodes of the series at least once; consistency rate of 50% means the Taskmaster swore in half of the episodes of the series at least once.
+
+
+For the purpose of this post, we will only consider and calculate the profanity consistency rate for the Taskmaster. However, the same logic applies for any other person of interest.
+
+```{r tm-prof-consistency, tidy = FALSE, fig.cap = "Taskmaster's Profanity Consistency", echo = FALSE}
+ggplot(tm_series, aes(x = series, y = speaker_episode_count/no_episodes_in_series)) +
+ geom_line(aes(y = speaker_episode_count/no_episodes_in_series), linewidth = 1.5) +
+ geom_point(aes(y = speaker_episode_count/no_episodes_in_series)) +
+ geom_image(aes(image = image_url, y = speaker_episode_count/no_episodes_in_series), size= 0.07) +
+ scale_x_continuous(breaks = seq(0, 20, 1)) +
+ scale_y_continuous(labels = scales::percent_format(accuraacy = 1), limits = c(0, 1)) +
+ xlab("Series") + ylab("Profanity Consistency Rate") +
+ ggtitle("Taskmaster's Profanity Consistency")
+
+```
+
+
+Figure \@ref(fig:tm-prof-consistency) plots the Taskmaster's Profanity Consistency over time (series). Anything below 100% indicates that the Taskmaster did not swear in all episodes of that particular series.
+
+From this we deduce that Greg was less irate in some episodes of Series 10, 11 and 15 as he did not swear in them.
+
+Consequently, our current dataset does not include these "profanity free episodes".
+
+
+# Putting on a Spread
+These "profanity free episodes" records can be captured through data munging steps, namely `LEFT JOIN` with the `episodes` table (left table), such that if an episode does appear in our enhanced profanity table, we set the profanity utterance to 0.
+
+```{r, echo = FALSE, error = FALSE, warning = FALSE, message= FALSE}
+tm_profanity <- profanity_enh %>%
+ filter(special == 0 & speaker_name == "Greg Davies") %>%
+ rowwise() %>%
+ mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
+ group_by(series, series_name, special, episode, speaker_id, speaker_name, speaker_tmi) %>%
+ summarise(
+ task_count = dplyr::n_distinct(task),
+ sum_profanity = sum(num_profanity)
+ )
+
+# Filter to Greg Davies
+#Filter to non special.
+
+#Aggregate to episode level for profanity per episode
+
+#dbWriteTable(tm_db, "tm_profanity", tm_profanity)
+
+```
+```{sql connection = tm_db, output.var = "fill_tm_profanity", echo = FALSE, , error = FALSE, warning = FALSE, message= FALSE}
+SELECT
+COALESCE(ep.series, tmp.series) as series,
+tmp.series_name,
+tmp.special,
+COALESCE(tmp.episode, ep.id) as show_ep_id,
+ep.episode series_ep_id,
+ep.title as ep_title,
+tmp.speaker_id,
+tmp.speaker_name,
+tmp.speaker_tmi,
+COALESCE(tmp.task_count, 0) as task_count,
+COALESCE(tmp.sum_profanity, 0) as sum_profanity
+FROM
+episodes ep
+LEFT OUTER JOIN tm_profanity tmp
+ON ep.id = tmp.episode
+AND ep.series = tmp.series
+WHERE ep.series >= 1 -- To filter out specials
+```
+
+```{r, echo = FALSE , error = FALSE, warning = FALSE, message= FALSE}
+# Dealing with addition missing data columns through a carry forward strategy
+fill_tm_profanity <- fill_tm_profanity %>% arrange(show_ep_id) %>%
+ tidyr::fill(special, speaker_id, speaker_name, speaker_tmi, .direction = "down") %>%
+ mutate(series_name = dplyr::if_else(is.na(series_name), paste("Series", series, sep = " "), series_name)
+ )
+```
+
+
+```{r, echo = FALSE, error = FALSE, warning = FALSE, message= FALSE}
+summary_tm_profanity <-
+ fill_tm_profanity %>%
+ group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi) %>%
+ summarise(
+ num_episodes = dplyr::n_distinct(series_ep_id),
+ total_profanity = sum(sum_profanity),
+ avg_profanity = mean(sum_profanity),
+ median_profanity = median(sum_profanity),
+ p10_profanity = quantile(sum_profanity, probs = 0.1),
+ p25_profanity = quantile(sum_profanity,probs = 0.25),
+ p75_profanity = quantile(sum_profanity,probs = 0.75),
+ p90_profanity = quantile(sum_profanity,probs = 0.90)
+ ) %>%
+ ungroup()
+
+summary_tm_profanity$image_url <- greg_image_url
+```
+
+
+```{r tm-prof-boxplot, fig.cap= "Boxplots of Profanity Utterances", tidy = FALSE, echo = FALSE}
+ggplot(fill_tm_profanity, aes(x=as.factor(series), y= sum_profanity)) +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ geom_boxplot(coef = 0) +
+ geom_image(data = summary_tm_profanity, aes(x=series, y = median_profanity, image = image_url), alpha = 0.1) +
+ ylab("Profanity Utterances in an Episode") +
+ xlab("Series") +
+ ggtitle("Boxplots of Profanity Utterances in an Episode by Series") +
+ scale_y_continuous(breaks = seq(0, 15, 2), limits = c(0, 15))
+```
+Figure \@ref(fig:tm-prof-boxplot) shows the boxplot of profanity utterances per episode, for each series. [Boxplots](https://statisticsbyjim.com/graphs/box-plot/) are one way to show the distribution and spread of a quantity that is random, in this case, profanity utterances in an episode.
+
+The thick black line in the center of the box (which the Taskmaster sits upon in this figure), represents the *median profanity*. 50% of profanity utterances in that series will lie above and below it respectively. The bottom and top bottom of the box represent the 25th and 75th percentile respectively (proportion of data lying below these values). *The box will represent where at least 50% of observations will lie between.* For sake of simplicity, I have not included the "whiskers" that are commonly used with boxplot figures; these whiskers show another spread range of observations. Individual observations which lie outside of the range are also displayed.
+
+
+::: {.insights}
+Some observations (but not all):
+
+- The Profanity Boxplots show a similar behaviour and conclusion to that when considering the mean profanity; **Greg has become less foul mouthed post Series 8 onwards. The median, and the main box, is noticeably lower from Series 8 onwards compared to pre Series 7.**
+- The size of the box varies more from Series 1-7 than after Series 8 onwards. This suggests that Greg was more volatile with his profanity usage in early seasons.
+ - However it is worth noting that Series 1 to 5 were shorter in length than Series 6 onwards. From Series 6 onwards, a series contains 10 episodes. Prior to this, a series could be as short as 5 episodes (Series 2 and 3), and long as 8 episodes (Series 4 and 5). Due to the limited number of data points in these earlier, short series, some care needs to be taken from the conclusions we draw from them.
+- We continue to see an overlap in data from the two regimes; the lower proportion of the boxplots in the "High Profanity Regime" (Series 1 to 7) overlaps with the top proportion of the boxplots in the "Low Profanity Regime" (Series 8 onwards)
+- Series 4 has the the smallest sized boxplot. This indicates relatively little spread and deviation in the profanity utterances per episode. Greg is pretty consistent in uttering five profanities per episode in this season.
+- Series 7 has the largest boxplot and thus the greatest spread. Greg isn't as consistent and is more random with is profanity utterances in this series.
+- Series 16 shows an uptick in profanities utterances to a level similar to pre Series 8; the median for series 16 is 5. Could this be the start of a new regime? Data from Series 17 onwards would help support or debunk this hypothesis.
+- Not all series boxplots are symmetrical, for example see Series 8 and 9. This indicates that there is some skewness in the profanity utterance distribution.
+ - Series 8 is negatively skewed; the median is closer to the top of the box, and a greater concentration of observations are in the top of the box.
+ - Conversely, Series 9 is positively skewed, the median closer to the bottom of the box, and a greater concentration of observations are in the bottom of the box.[^1]
+:::
+
+[^1]: Readers may be reassured to know that I still get negative and positive skew mixed up in direction. It might be because I also don't know my left and right instinctively.
+
+# To Mean or Median...
+As we start to conclude this post, we bring it back to two single summary statistics, namely the profanity rate (also known as the mean profanity utterance), and the median profanity utterance. To end, we simply compare the profanity rate and the median profanity to see if there are any substantial differences between the two statistics.
+
+```{r med-mean-comp, fig.cap = "Comparing the Mean (Average) and Median Profanity Utterances per episode" , tidy = FALSE, echo = FALSE,}
+
+colours <- c("Mean" = "darkblue", "Median" = "darkred")
+
+ggplot(summary_tm_profanity, aes(x = series), linewidth = 2, size = 2) +
+ geom_rect(aes(xmin = 1, ymin = 4,
+xmax = 7.5, ymax = 7), fill = "#f3b0b0") +
+ geom_rect(aes(xmin = 7.5, ymin = 2,
+xmax = 16, ymax = 5), fill = "#b0c6f3") +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ annotate(geom = "segment", x = 1, xend = 7, y = 5.5, linetype = 3) +
+ annotate(geom = "segment", x = 8, xend = 16, y= 3.5, linetype = 3) +
+ geom_line(aes(y = avg_profanity, color = "Mean")) +
+ geom_point(aes(y = avg_profanity, color = "Mean")) +
+ geom_line(aes(y = median_profanity, color = "Median" ),linetype = 2) +
+ geom_point(aes(y = median_profanity, color = "Median"), shape = 25) +
+ labs(x = "Series",
+ y = "Profanity Utterance per Episode",
+ title = "Mean and Median Profanity Utterance comparison",
+ caption = "N.B. Mean Profanity Utterance is the same as Profanity Rate",
+ color = "Legend"
+ ) +
+ scale_color_manual(values = colours) +
+ theme(legend.position="top") +
+ scale_y_continuous(breaks = seq(0, 15, 1), limits = c(0, 10)) +
+ scale_x_continuous(breaks = seq(1, 16, 1))
+
+```
+
+:::{ .insights}
+Figure \@ref(fig:med-mean-comp) indicates that there is relatively little difference between the mean and median:
+
+- the mean and median are generally aligned sharing similar, but not identical, values
+ - similar in value suggests that there are no extreme values or outliers which would affect the mean more than the median.
+ - series where the mean and median deviate the most correspond to skewed boxplots (see Series 9 and 10).
+- the two statistics exhibit the same overall trend over Series time is the same; profanity in Series 1-7 was generally at a higher occurrence rate than profanity from Series 8 onwards.
+ - This could have been different if the mean and median were vastly different in value.
+:::
+
+
+With the mean and the median being so similar in value and behaviour from Figure \@ref(fig:med-mean-comp), we may ask ourselves what was the whole point of this exercise if we achieve the same conclusions. Well,
+I would say that we were "lucky" in this scenario and if Taskmaster has taught us anything, it is that there doesn't necessary to be a point for everything.
+
+## The Copout Answer
+Those new to statistics may want to definitively know which statistic to use in life. Unfortunately there is no clear cut answer for this, and it largely depends on the problem and the application. Both statistics have their advantages and disadvantages, and its important to consider what is best for the task in hand.
+
+- The mean is more commonly accepted amongst the general public and can be efficiently computed over time (for example if we were being drip fed observations slowly over time, it easy to calculate the new mean). However, it is very sensitive to outliers and extreme values.
+- The median is less sensitive to outliers and extreme values. However, it can be more computationally intensive to compute over time and with large datasets (reordering the data is necessary to find the "new" midpoint observation)
+
+One common theme that you can expect to see in the field of Statistics ( and life in general) is that there is often no single answer for everything and it is very rare to to have a clear, black-and-white answer and conclusion. Conclusions drawn from data and and statistical methods should also come with an understanding of potential drawbacks and limitations.
+
+I hope the reader is prepared for the "50 Shades of Gray" conclusions we may be getting from The Median Duck project!
+
+
+# What Have We Learnt Today?
+
+::: {.infobox .today data-latex="{today}"}
+```{r, , out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = "https://www.beyondthejoke.co.uk/sites/default/files/styles/large/public/screen_shot_2021-09-09_at_11.17.00.png")
+```
+
+There is evidence to suggest that the **Taskmaster has become less foul mouthed in recent series of the show**.
+
+The **profanity uttered per episode has noticeably decreased:**
+
+- **from 4 to 7 utterances in Series 1 to 7**
+- **to 2 to 5 utterances from Series 8 to 16**.
+
+This can be seen by two different profanity statistics, the profanity rate and median profanity uttered per episode in a series, and a shift in the distribution (boxplots).
+
+Little Alex Horne's wholesome presence must be having an effect on him...
+
+```{r , out.width = "55%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = "https://media.zenfs.com/en/digital_spy_281/f88e9f7ea4c5bd5bd5cfd53aa9f2d541")
+```
+
+The uptick in Series 16's profanity statistics does suggest that the Taskmaster may be returning to his high profanity rate regime.
+
+:::
+
+
+```{r , fig.show = "hold", out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = file.path(here(), "img", "gifs", "greg_horseshit.gif"), error = FALSE)
+```
\ No newline at end of file
diff --git a/content/posts/2024-10-28-the-taskmaster-s-potty-mouth/index.html b/content/posts/2024-10-28-the-taskmaster-s-potty-mouth/index.html
new file mode 100644
index 0000000..dd26906
--- /dev/null
+++ b/content/posts/2024-10-28-the-taskmaster-s-potty-mouth/index.html
@@ -0,0 +1,191 @@
+---
+title: The Taskmaster's Potty Mouth
+author: Christopher Nam
+date: '2024-11-09'
+slug: the-taskmaster-s-potty-mouth
+categories: [analysis, profanity, greg davies]
+tags: []
+draft: no
+output:
+ blogdown::html_page:
+ toc: true
+ toc_depth: 1
+---
+
+
+
One way to answer our question is to use Profanity Rate that we previously defined in the aforementioned post.
+
Recall that this sums up the number of profanity occurrences within a series for a particular person, and divides by the number of episodes in that series. This provides the number of times the person in question will swear in an episode of that series, on average.
+
+
+Figure 1: The Taskmaster’s Profanity Rateover Time
+
+
+
+
Figure 1 shows the Taskmaster’s profanity rate over time (captured by series). Profanity rates range between approximately 2.5 (series 8) and 7 (series 7). Visually, there does appear to be a change in profanity rate from series 8 onwards; the profanity rate drops from between 4 to 7 (red area), to 2.5 and 5 (blue area).
+
The uptick in profanity rate in series 16 such that it could plausibly associated and drawn from the Series 1-7 profanity rate range. Could Greg be on the cusp of returning to his old foul mouthed ways?
+
+
There is some evidence to suggest that the Taskmaster has becomes less potty mouthed over time with a significant drop in profanity instances per episode from series 8 onwards. The profanity rate drops from between 4 to 7 (Series 1 to 7), to 2.5 and 5 (Series 8 to 16).
+
This could be seen as slightly counter intuitive as:
+
+
We might assume that Greg, in his old age, has become more frustrated with life and contestants and thus more likely to swear.
+
We might assume that as the show has progressed and evolved, Greg has played up his angry persona as the Taskmaster and thus more likely to swear.
+
+
+
However, there are questions surrounding whether this change in profanity is statistically significant. There is also a noticeable overlap between between two coloured areas in which the profanity rate values are common to both; the profanity rates associated with series 9, 10 and 16 could have plausibly been drawn from the “Pre Series 8” swearing regime.
+
There also potential questions around whether the the profanity rate is the ideal statistic to answer our question since it can be swayed by rogue observations.
+
+
Potential drawbacks to the Profanity Rate.
+
The profanity rate, which essentially is an average (mean) summary statistic, is highly influenced by outliers and extreme values. It is not considered a robust statistic and is highly sensitive to the data.
+
In addition, the Profanity Rate, by itself, also does not capture the potential distribution (spread) of profanity utterances sufficiently. That is, if we were to watch many episodes of Taskmaster, what are the range of profanity utterances we can expect to see (or hear), and how much do they vary across episodes.
+
For this reason, we might want to consider additional statistics to the profanity rate (the mean profanity utterances), which are more robust and highlight the spread of distribution. The median, and percentiles in general, are one way to address these two issues.
+
But before we can calculate these metrics, some additional work is required.
+
+
+
+
Case of the Missing Profanities
+
One feature of our data source is that if Greg (or any other person for this matter) did not swear at all in an episode, no records will be present in our underlying dataset. 0 profanities should be associated with these episodes, which currently aren’t being captured.
+
If we want to consider something beyond profanity rate (for example the median and percentiles of profanities uttered in an episode across a series), we would need to ensure that these profanity free episodes are captured. Without capturing this profanity free phenomena, our statistics would not be be accurate; here, the median and percentile would be inflated.
+
The “no records for profanity free” feature is not a flaw of the data source or of its design. However, due to the question we want to answer, it is an important consideration that has to be explicitly accounted for in our methodology.
+
+
Why Wasn’t This an Issue with Profanity Rate?
+
It is also worth remarking that that this “zero profanities” phenomena was not an issue for the profanity rate calculation. Recall the equation for profanity rate was:
+
\[\begin{equation}
+\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}}
+\tag{1}
+\end{equation}\]
+
The numerator (the top of the fraction) in Equation (1) will be unaffected if zero profanities were observed (whether explicitly captured or not). The denominator (the bottom) in Equation (1) shows that we normalise by the number of episodes in a series, and not the number of episodes in the series in which profanity was observed. It is this normalisation that means that the profanity rate is not affected by this phenomena.
+
+
+
Profanity Consistency
+
Before we potentially start capturing these “profanity free episodes” explicitly, we should assess whether this is an actual problem first. To do this, we define the Profanity Consistency Rate.
+
\[\begin{equation}
+
+\texttt{Profanity Consistency Rate for Series $i$} = \frac{\texttt{Number of Episodes in Series $i$ which featured at least 1 swear}}{\texttt{Number of Episodes in Series $i$}}
+\tag{2}
+\end{equation}\]
+
The Consistency Rate can be thought of as the proportion of episodes in a series in which at least one swear word was uttered by the Taskmaster. A Consistency Rate of 100% means that the Taskmaster swore in all episodes of the series at least once; consistency rate of 50% means the Taskmaster swore in half of the episodes of the series at least once.
+
For the purpose of this post, we will only consider and calculate the profanity consistency rate for the Taskmaster. However, the same logic applies for any other person of interest.
+
+
+Figure 2: Taskmaster’s Profanity Consistency
+
+
+
+
Figure 2 plots the Taskmaster’s Profanity Consistency over time (series). Anything below 100% indicates that the Taskmaster did not swear in all episodes of that particular series.
+
From this we deduce that Greg was less irate in some episodes of Series 10, 11 and 15 as he did not swear in them.
+
Consequently, our current dataset does not include these “profanity free episodes”.
+
+
+
+
Putting on a Spread
+
These “profanity free episodes” records can be captured through data munging steps, namely LEFT JOIN with the episodes table (left table), such that if an episode does appear in our enhanced profanity table, we set the profanity utterance to 0.
+
+
+Figure 3: Boxplots of Profanity Utterances
+
+
+
+
Figure 3 shows the boxplot of profanity utterances per episode, for each series. Boxplots are one way to show the distribution and spread of a quantity that is random, in this case, profanity utterances in an episode.
+
The thick black line in the center of the box (which the Taskmaster sits upon in this figure), represents the median profanity. 50% of profanity utterances in that series will lie above and below it respectively. The bottom and top bottom of the box represent the 25th and 75th percentile respectively (proportion of data lying below these values). The box will represent where at least 50% of observations will lie between. For sake of simplicity, I have not included the “whiskers” that are commonly used with boxplot figures; these whiskers show another spread range of observations. Individual observations which lie outside of the range are also displayed.
+
+
Some observations (but not all):
+
+
The Profanity Boxplots show a similar behaviour and conclusion to that when considering the mean profanity; Greg has become less foul mouthed post Series 8 onwards. The median, and the main box, is noticeably lower from Series 8 onwards compared to pre Series 7.
+
The size of the box varies more from Series 1-7 than after Series 8 onwards. This suggests that Greg was more volatile with his profanity usage in early seasons.
+
+
However it is worth noting that Series 1 to 5 were shorter in length than Series 6 onwards. From Series 6 onwards, a series contains 10 episodes. Prior to this, a series could be as short as 5 episodes (Series 2 and 3), and long as 8 episodes (Series 4 and 5). Due to the limited number of data points in these earlier, short series, some care needs to be taken from the conclusions we draw from them.
+
+
We continue to see an overlap in data from the two regimes; the lower proportion of the boxplots in the “High Profanity Regime” (Series 1 to 7) overlaps with the top proportion of the boxplots in the “Low Profanity Regime” (Series 8 onwards)
+
Series 4 has the the smallest sized boxplot. This indicates relatively little spread and deviation in the profanity utterances per episode. Greg is pretty consistent in uttering five profanities per episode in this season.
+
Series 7 has the largest boxplot and thus the greatest spread. Greg isn’t as consistent and is more random with is profanity utterances in this series.
+
Series 16 shows an uptick in profanities utterances to a level similar to pre Series 8; the median for series 16 is 5. Could this be the start of a new regime? Data from Series 17 onwards would help support or debunk this hypothesis.
+
Not all series boxplots are symmetrical, for example see Series 8 and 9. This indicates that there is some skewness in the profanity utterance distribution.
+
+
Series 8 is negatively skewed; the median is closer to the top of the box, and a greater concentration of observations are in the top of the box.
+
Conversely, Series 9 is positively skewed, the median closer to the bottom of the box, and a greater concentration of observations are in the bottom of the box.1
+
+
+
+
+
+
To Mean or Median…
+
As we start to conclude this post, we bring it back to two single summary statistics, namely the profanity rate (also known as the mean profanity utterance), and the median profanity utterance. To end, we simply compare the profanity rate and the median profanity to see if there are any substantial differences between the two statistics.
+
+
+Figure 4: Comparing the Mean (Average) and Median Profanity Utterances per episode
+
+
+
+
+
Figure 4 indicates that there is relatively little difference between the mean and median:
+
+
the mean and median are generally aligned sharing similar, but not identical, values
+
+
similar in value suggests that there are no extreme values or outliers which would affect the mean more than the median.
+
series where the mean and median deviate the most correspond to skewed boxplots (see Series 9 and 10).
+
+
the two statistics exhibit the same overall trend over Series time is the same; profanity in Series 1-7 was generally at a higher occurrence rate than profanity from Series 8 onwards.
+
+
This could have been different if the mean and median were vastly different in value.
+
+
+
+
With the mean and the median being so similar in value and behaviour from Figure 4, we may ask ourselves what was the whole point of this exercise if we achieve the same conclusions. Well,
+I would say that we were “lucky” in this scenario and if Taskmaster has taught us anything, it is that there doesn’t necessary to be a point for everything.
+
+
The Copout Answer
+
Those new to statistics may want to definitively know which statistic to use in life. Unfortunately there is no clear cut answer for this, and it largely depends on the problem and the application. Both statistics have their advantages and disadvantages, and its important to consider what is best for the task in hand.
+
+
The mean is more commonly accepted amongst the general public and can be efficiently computed over time (for example if we were being drip fed observations slowly over time, it easy to calculate the new mean). However, it is very sensitive to outliers and extreme values.
+
The median is less sensitive to outliers and extreme values. However, it can be more computationally intensive to compute over time and with large datasets (reordering the data is necessary to find the “new” midpoint observation)
+
+
One common theme that you can expect to see in the field of Statistics ( and life in general) is that there is often no single answer for everything and it is very rare to to have a clear, black-and-white answer and conclusion. Conclusions drawn from data and and statistical methods should also come with an understanding of potential drawbacks and limitations.
+
I hope the reader is prepared for the “50 Shades of Gray” conclusions we may be getting from The Median Duck project!
+
+
+
+
What Have We Learnt Today?
+
+
+
There is evidence to suggest that the Taskmaster has become less foul mouthed in recent series of the show.
+
The profanity uttered per episode has noticeably decreased:
+
+
from 4 to 7 utterances in Series 1 to 7
+
to 2 to 5 utterances from Series 8 to 16.
+
+
This can be seen by two different profanity statistics, the profanity rate and median profanity uttered per episode in a series, and a shift in the distribution (boxplots).
+
Little Alex Horne’s wholesome presence must be having an effect on him…
+
+
The uptick in Series 16’s profanity statistics does suggest that the Taskmaster may be returning to his high profanity rate regime.
+
+
+
+
+
+
+
Readers may be reassured to know that I still get negative and positive skew mixed up in direction. It might be because I also don’t know my left and right instinctively.↩︎
href="https://github.com/athul/archie">Archie Theme | Built with Hugo
+
+
+
+
+
+
+
+
+
+
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index.Rmd b/public/2024/10/the-taskmaster-s-potty-mouth/index.Rmd
new file mode 100644
index 0000000..2b44f1b
--- /dev/null
+++ b/public/2024/10/the-taskmaster-s-potty-mouth/index.Rmd
@@ -0,0 +1,407 @@
+---
+title: The Taskmaster's Potty Mouth
+author: Christopher Nam
+date: '2024-11-09'
+slug: the-taskmaster-s-potty-mouth
+categories: [analysis, profanity, greg davies]
+tags: []
+draft: no
+output:
+ blogdown::html_page:
+ toc: true
+ toc_depth: 1
+---
+
+
+```{r, echo = FALSE}
+blogdown::shortcode("callout", text="Warning This Post Contains Strong Language...Reader Discretion is advised!")
+```
+
+# Your Task
+
+> Find out whether the Taskmaster (Greg Davies) has become more or less foul mouth over time.
+```{r, echo = FALSE, out.width = "55%", fig.align='center', error = FALSE, warning=FALSE}
+knitr::include_graphics(path = "https://static1.colliderimages.com/wordpress/wordpress/wp-content/uploads/2024/03/greg-davies-from-taskmaster.jpg")
+```
+
+This post is an extension of [this profanity based post](/themedianduck/2024/10/profanity-insanity).
+
+```{r , fig.show = "hold", out.width = "15%", fig.align='default', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path =file.path(here::here(), "img", "gifs", "greg_fuckingbus.gif"), error = FALSE)
+```
+
+```{r preamble, echo = FALSE, warning = FALSE, message = FALSE, error=FALSE, collapse = TRUE, include = FALSE}
+library(here)
+here("static")
+
+preamble_dir <- here("static", "code", "R", "preamble")
+preamble_file <- "post_preamble.R"
+
+source(file.path(preamble_dir, preamble_file))
+source(file.path(preamble_dir, "database_preamble.R"))
+source(file.path(preamble_dir, "graphics_preamble.R"))
+
+```
+
+
+```{sql sql_query, connection = tm_db, output.var = "profanity_enh", echo = FALSE}
+-- Stored as an R dataframe profanity_enh
+
+SELECT
+pf.series,
+pf.episode,
+pf.task,
+pf.speaker as speaker_id,
+pp.name as speaker_name,
+pf.roots,
+pf.quote,
+pf.studio,
+pp.gender,
+pp.hand,
+pp.champion,
+pp.tmi as speaker_tmi,
+sp.name as series_name,
+sp.episodes as num_episodes_in_series,
+sp.champion as series_champion_id,
+sp.special
+FROM profanity pf
+LEFT JOIN people pp
+ ON (pf.speaker = pp.id
+ AND pf.series = pp.series) OR
+(pf.speaker = pp.id)
+LEFT JOIN series sp
+ ON pf.series = sp.id
+
+```
+
+```{r, message = FALSE, tidy = FALSE, echo = FALSE}
+library(reticulate)
+library(dplyr)
+
+series_profanity <- profanity_enh %>%
+ rowwise() %>%
+ mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
+ # To count the number of profanities utter in a quote.
+ group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi, gender, hand) %>%
+ # Aggregating and summarising data at a series, speaker level.
+ summarise(
+ speaker_episode_count = dplyr::n_distinct(episode),
+ sum_profanity_series = sum(num_profanity),
+ no_episodes_in_series = max(num_episodes_in_series)
+ ) %>%
+ mutate(profanity_per_episode = sum_profanity_series/no_episodes_in_series)
+```
+
+```{r, include = FALSE}
+greg_image_url <- "https://taskmaster.info/images/people/0019_greg_davies_3.png"
+```
+
+# The Profanity Rate Approach
+One way to answer our question is to use Profanity Rate that we previously defined in the aforementioned [post](/themedianduck/2024/10/profanity-insanity).
+
+Recall that this sums up the number of profanity occurrences within a series for a particular person, and divides by the number of episodes in that season. This provides the number of times the person in question will swear in an episode of that series, on average.
+```{r, include = FALSE}
+tm_series <- series_profanity %>%
+ filter(special == 0 & speaker_name == "Greg Davies")
+
+tm_series$image_url <- greg_image_url
+```
+
+```{r tm-pr-basic-plot, fig.cap = "The Taskmaster's Profanity Rateover Time", tidy = FALSE, echo = FALSE}
+ggplot(tm_series, aes(x= series, y=profanity_per_episode)) +
+ geom_rect(aes(xmin = 0, ymin = 4,
+xmax = 7.5, ymax = 7), fill = "#f3b0b0") +
+ geom_rect(aes(xmin = 7.5, ymin = 2.5,
+xmax = 16, ymax = 5), fill = "#b0c6f3") +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ geom_line(linewidth = 1.5) +
+ scale_x_continuous(breaks = seq(0, 20, 1)) +
+ scale_y_continuous(breaks = seq(0, 10, 1), limits = c(0, 10)) +
+ geom_image(aes(image = image_url), size= 0.07) +
+ xlab("Series") + ylab("Profanity Rate (Profanity per Episode)") +
+ ggtitle("The Taskmaster's Potty Mouth")
+```
+
+Figure \@ref(fig:tm-pr-basic-plot) shows the Taskmaster's profanity rate over time (captured by series). Profanity rates range between approximately 2.5 (series 8) and 7 (series 7). Visually, there does appear to be a change in profanity rate from series 8 onwards; the profanity rate drops from between 4 to 7 (red area), to 2.5 and 5 (blue area).
+
+
+The uptick in profanity rate in series 16 such that it could plausibly associated and drawn from the Series 1-7 profanity rate range. Could Greg be on the cusp of returning to his old foul mouthed ways?
+
+
+:::{.insights}
+There is some evidence to suggest that the Taskmaster has becomes less potty mouthed over time with a significant drop in profanity instances per episode from series 8 onwards. This could be seen as slightly counter intuitive as:
+
+1. We might assume that Greg, in his old age, has become more frustrated with life and contestants and thus more likely to swear.
+2. We might assume that as the show has progressed and evolved, Greg has played up his angry persona as the Taskmaster and thus more likely to swear.
+:::
+
+However, there are questions surrounding whether this change in profanity is statistically significant. There is a noticeable overlap between between two coloured areas in which the profanity rate values are common to both. The profanity rates associated with series 9, 10 and 16 could have been drawn from the "Pre Series 8" swearing regime.
+
+There also potential questions around whether the the profanity rate is the ideal metric to answer our question since it can be swayed by rogue observations.
+
+## Potential drawbacks to the Profanity Rate.
+The profanity rate, which essentially is an average (mean) summary statistic, is highly influenced by outliers and extreme values. It is not considered a [robust statistic](https://statisticsbyjim.com/basics/robust-statistics/) and is highly sensitive to the data.
+
+In addition, the Profanity Rate, by itself, also does not capture the potential distribution (spread) of profanity utterances sufficiently. That is, what are the potential values of number of profanities uttered were we to watch a series of episodes.
+
+For this reason, we might want to consider additional statistics to the profanity rate (mean profanity utterances), which are more robust and highlight the spread of distribution. The [median](https://statisticsbyjim.com/basics/median/), and [percentiles](https://statisticsbyjim.com/basics/percentiles/) in general, are one way to address these two issues.
+
+But before we can calculate these metrics, some additional work is required.
+
+# Case of the Missing Profanities
+One feature of our data source is that if Greg (or any other person for this matter) did not swear at all in an episode, no records will be present in our underlying dataset. 0 profanities should be associated with these episodes, which currently aren't being captured.
+
+If we want to consider something beyond profanity rate (for example the median and percentiles of profanities uttered in an episode across a series), we would need to ensure that these profanity free episodes are captured. Without capturing this profanity free phenomena, our statistics would not be be accurate; here, the median and percentile would be inflated.
+
+The "no records for profanity free" feature is not a flaw of the data source or of its design. However, due to the question we want to answer, it is an important consideration that has to be explicitly accounted for in our methodology.
+
+
+## Why Wasn't This an Issue with Profanity Rate?
+It is also worth remarking that that this "zero profanities" phenomena was not an issue for the profanity rate calculation. Recall the equation for profanity rate was:
+
+\begin{equation}
+\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}}
+(\#eq:profanityrate)
+\end{equation}
+
+The numerator (the top of the fraction) in Equation \@ref(eq:profanityrate) will be unaggected if zero profanities were observed (whether explicitly captured or not). The denominator (the bottom) in Equation \@ref(eq:profanityrate) shows that we normalise by the **number of episodes in a series**, and _not_ the **number of episodes in the series in which profanity was observed**. It is this normalisation that means that the profanity rate is not affected by this phenomena.
+
+
+```{r, tidy = FALSE, execute = FALSE, echo = FALSE, include = FALSE}
+ggplot(tm_series, aes(x = series, y = no_episodes_in_series)) +
+ geom_col() +
+ geom_line(aes(y = speaker_episode_count)) +
+ geom_point(aes(y = speaker_episode_count)) +
+ geom_image(aes(image = image_url, y = speaker_episode_count), size= 0.07) +
+ xlab("series") + ylab("Number of Episodes") +
+ scale_y_continuous(breaks = seq(0, 10, 1)) +
+ ggtitle("Taskmaster's Profanity Consistency")
+
+```
+
+## Profanity Consistency
+Before we potentially start capturing these "profanity free episodes" explicitly, we should assess whether this is an actual problem first. To do this, we define the Profanity Consistency Rate.
+
+\begin{equation}
+
+\texttt{Profanity Consistency Rate for Series $i$} = \frac{\texttt{Number of Episodes in Series $i$ which featured at least 1 swear}}{\texttt{Number of Episodes in Series $i$}}
+(\#eq:profanityconsis)
+\end{equation}
+
+The Consistency Rate can be thought of as the proportion of episodes in a series in which at least one swear word was uttered by the Taskmaster. A Consistency Rate of 100% means that the Taskmaster swore in all episodes of the series at least once; consistency rate of 50% means the Taskmaster swore in half of the episodes of the series at least once.
+
+
+For the purpose of this post, we will only consider and calculate the profanity consistency rate for the Taskmaster. However, the same logic applies for any other person of interest.
+
+```{r tm-prof-consistency, tidy = FALSE, fig.cap = "Taskmaster's Profanity Consistency", echo = FALSE}
+ggplot(tm_series, aes(x = series, y = speaker_episode_count/no_episodes_in_series)) +
+ geom_line(aes(y = speaker_episode_count/no_episodes_in_series), linewidth = 1.5) +
+ geom_point(aes(y = speaker_episode_count/no_episodes_in_series)) +
+ geom_image(aes(image = image_url, y = speaker_episode_count/no_episodes_in_series), size= 0.07) +
+ scale_x_continuous(breaks = seq(0, 20, 1)) +
+ scale_y_continuous(labels = scales::percent_format(accuraacy = 1), limits = c(0, 1)) +
+ xlab("Series") + ylab("Profanity Consistency Rate") +
+ ggtitle("Taskmaster's Profanity Consistency")
+
+```
+
+
+Figure \@ref(fig:tm-prof-consistency) plots the Taskmaster's Profanity Consistency over time (series). Anything below 100% indicates that the Taskmaster did not swear in all episodes of that particular series.
+
+From this we deduce that Greg was less irate in some episodes of Series 10, 11 and 15 as he did not swear in them.
+
+Consequently, our current dataset does not include these "profanity free episodes".
+
+
+# Putting on a Spread
+These "profanity free episodes" records can be captured through data munging steps, namely `LEFT JOIN` with the `episodes` table (left table), such that if an episode does appear in our enhanced profanity table, we set the profanity to 0.
+
+```{r, echo = FALSE, error = FALSE, warning = FALSE, message= FALSE}
+tm_profanity <- profanity_enh %>%
+ filter(special == 0 & speaker_name == "Greg Davies") %>%
+ rowwise() %>%
+ mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
+ group_by(series, series_name, special, episode, speaker_id, speaker_name, speaker_tmi) %>%
+ summarise(
+ task_count = dplyr::n_distinct(task),
+ sum_profanity = sum(num_profanity)
+ )
+
+# Filter to Greg Davies
+#Filter to non special.
+
+#Aggregate to episode level for profanity per episode
+
+#dbWriteTable(tm_db, "tm_profanity", tm_profanity)
+
+```
+```{sql connection = tm_db, output.var = "fill_tm_profanity", echo = FALSE, , error = FALSE, warning = FALSE, message= FALSE}
+SELECT
+COALESCE(ep.series, tmp.series) as series,
+tmp.series_name,
+tmp.special,
+COALESCE(tmp.episode, ep.id) as show_ep_id,
+ep.episode series_ep_id,
+ep.title as ep_title,
+tmp.speaker_id,
+tmp.speaker_name,
+tmp.speaker_tmi,
+COALESCE(tmp.task_count, 0) as task_count,
+COALESCE(tmp.sum_profanity, 0) as sum_profanity
+FROM
+episodes ep
+LEFT OUTER JOIN tm_profanity tmp
+ON ep.id = tmp.episode
+AND ep.series = tmp.series
+WHERE ep.series >= 1 -- To filter out specials
+```
+
+```{r, echo = FALSE , error = FALSE, warning = FALSE, message= FALSE}
+# Dealing with addition missing data columns through a carry forward strategy
+fill_tm_profanity <- fill_tm_profanity %>% arrange(show_ep_id) %>%
+ tidyr::fill(special, speaker_id, speaker_name, speaker_tmi, .direction = "down") %>%
+ mutate(series_name = dplyr::if_else(is.na(series_name), paste("Series", series, sep = " "), series_name)
+ )
+```
+
+
+```{r, echo = FALSE, error = FALSE, warning = FALSE, message= FALSE}
+summary_tm_profanity <-
+ fill_tm_profanity %>%
+ group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi) %>%
+ summarise(
+ num_episodes = dplyr::n_distinct(series_ep_id),
+ total_profanity = sum(sum_profanity),
+ avg_profanity = mean(sum_profanity),
+ median_profanity = median(sum_profanity),
+ p10_profanity = quantile(sum_profanity, probs = 0.1),
+ p25_profanity = quantile(sum_profanity,probs = 0.25),
+ p75_profanity = quantile(sum_profanity,probs = 0.75),
+ p90_profanity = quantile(sum_profanity,probs = 0.90)
+ ) %>%
+ ungroup()
+
+summary_tm_profanity$image_url <- greg_image_url
+```
+
+
+```{r tm-prof-boxplot, fig.cap= "Boxplots of Profanity Utterances", tidy = FALSE, echo = FALSE}
+ggplot(fill_tm_profanity, aes(x=as.factor(series), y= sum_profanity)) +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ geom_boxplot(coef = 0) +
+ geom_image(data = summary_tm_profanity, aes(x=series, y = median_profanity, image = image_url), alpha = 0.1) +
+ ylab("Profanity Utterances in an Episode") +
+ xlab("Series") +
+ ggtitle("Boxplots of Profanity Utterances in an Episode by Series") +
+ scale_y_continuous(breaks = seq(0, 15, 2), limits = c(0, 15))
+```
+Figure \@ref(fig:tm-prof-boxplot) shows the boxplot of profanity utterances per episode, for each series. [Boxplots](https://statisticsbyjim.com/graphs/box-plot/) are one way to show the distribution and spread of a quantity that is random, in this case, profanity utterances in an episode.
+
+The thick black line in the center of the box (which the Taskmaster sits upon in this figure), represents the *median profanity*. 50% of profanity utterances in that series will lie above and below it respectively. The bottom and top bottom of the box represent the 25th and 75th percentile respectively (proportion of data lying below these values). *The box will represent where at least 50% of observations will lie between.* For sake of simplicity, I have not included the "whiskers" in this particular boxplot figure; these whiskers show another spread range of observations. Individual observations which lie outside of the range are also displayed.
+
+
+::: {.insights}
+Some observations (but not all):
+
+- The Profanity Boxplots show a similar behaviour and conclusion from analysing just the profanity rate; Greg has become less foul mouthed post Series 8 onwards. **The median, and the main box, is noticeably lower from Series 8 onwards compared to pre Series 7.**
+- The size of the box varies more from Series 1-7 than after Series 8 onwards. This suggests that Greg was more volatile with his profanity usage in early seasons.
+- As some of the boxes overlap between the two regimes; We continue to see an overlap in data from the two regimes; the lower proportion of the boxplots in the "High Profanity Regime" (Series 1 to 7) overlaps with the top proportion of the boxplots in the "Low Profanity Regime" (Series 8 onwards)
+- Series 4 has the the smallest sized boxplot. This indicates relatively little spread and deviation in the profanity utterances per episode. Greg is pretty consistent in uttering five profanities per episode in this season.
+- Series 7 has the largest boxplot and thus the greatest spread. Greg isn't as consistent and is more random with is profanity utterances in this series.
+- Series 16 shows an uptick in profanities utterances to a level similar to pre Series 8; the median for series 16 is 5. Could this be the start of a new regime? Data from Series 17 onwards would help support or debunk this hypothesis.
+- Not all series boxplots are symmetrical; see Series 8 and 9. This indicates that there is some skewness in the profanity utterance distribution.
+ - Series 8 is negatively skewed; the median is closer to the top of the box, and a greater concentration of observations are in the top of the box.
+ - Conversely, Series 9 is positively sked, the median closer to the bottom of the box, and a greater concentration of observations are in the bottom of the box.[^1]
+
+- It is also worth noting that number of episodes varies per series. From Series 6 onwards, a series contains 10 episodes. Prior to this, a series could be as short as 5 episodes (Series 2 and 3), and long as 8 episodes (Series 4 and 5). Due to the limited number of data points in these earlier, short series, some care needs to be taken from the conclusions we draw from them.
+:::
+
+[^1]: Readers may be reassured to know that I still get the negative and positive skew mixed up. It might be because I also don't know my left and right instinctively.
+
+# To Mean or Median...
+As we start to conclude this post, we bring it back to two single summary statistics, namely the profanity rate (also known as the mean), and the median. To end, we simply compare the profanity rate and the median profanity to see if there are any substantial differences between the two statistics.
+
+Figure \@ref(fig:med-mean-comp) indicates that there is relatively little difference between the mean and median:
+
+- the mean and median are generally aligned sharing similar but not identical values
+ - similar in value suggests that there are no extreme values or outliers which would affect the mean more than the median.
+ - Series where the mean and median deviate the most are where skewness in observed the most in the boxplots (see Series 9 and 10).
+- the two statistics exhibit the same overall trend over Series time is the same; Profanity in Series 1-7 was generally at a higher occurrence rate than Proganity from Series 8 onwards.
+ - This could have been different if the mean and median were vastly different in value.
+
+
+```{r med-mean-comp, fig.cap = "Comparing the Mean (Average) and Median Profanity Utterances per episode" , tidy = FALSE, echo = FALSE,}
+
+colours <- c("Mean" = "darkblue", "Median" = "darkred")
+
+ggplot(summary_tm_profanity, aes(x = series), linewidth = 2, size = 2) +
+ geom_rect(aes(xmin = 1, ymin = 4,
+xmax = 7.5, ymax = 7), fill = "#f3b0b0") +
+ geom_rect(aes(xmin = 7.5, ymin = 2,
+xmax = 16, ymax = 5), fill = "#b0c6f3") +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ annotate(geom = "segment", x = 1, xend = 7, y = 5.5, linetype = 3) +
+ annotate(geom = "segment", x = 8, xend = 16, y= 3.5, linetype = 3) +
+ #geom_segment(aes(x = 1, xend = 7, y = 5.5, colour = "#f3b0b0"), alpha = 0.9, linetype = 4) +
+ #geom_segment(aes(x = 8, xend = 16, y = 3.5, colour = "#b0c6f3"), alpha = 0.9, linetype = 4) +
+ geom_line(aes(y = avg_profanity, color = "Mean")) +
+ geom_point(aes(y = avg_profanity, color = "Mean")) +
+ geom_line(aes(y = median_profanity, color = "Median" ),linetype = 2) +
+ geom_point(aes(y = median_profanity, color = "Median"), shape = 25) +
+ labs(x = "Series",
+ y = "Profanity Utterance per Episode",
+ title = "Mean and Median Profanity Utterance comparison",
+ caption = "N.B. Mean Profanity Utterance is the same as Profanity Rate",
+ color = "Legend"
+ ) +
+ scale_color_manual(values = colours) +
+ theme(legend.position="top") +
+ scale_y_continuous(breaks = seq(0, 15, 1), limits = c(0, 10)) +
+ scale_x_continuous(breaks = seq(1, 16, 1))
+
+```
+
+With the mean and the median being so similar in value and behaviour from Figure \@ref(fig:med-mean-comp), we may ask ourselves what was the whole point of this exercise if we achieve the same conclusions. Well,
+I would say that we were "lucky" in this scenario. If Taskmaster has taught us anything, it is that there doesn't necessary to be a point for everything.
+
+## The Copout Answer
+Those new to statistics may want to definitively know which statistic to use in life. Unfortunately there is no clear cut answer for this, and it largely depends on the problem and the application. Both statistics have their advantages and disadvantages.
+
+- The mean is more commonly accepted amongst the general public and can be efficiently computed over time (for example if we were being drip fed observations slowly over time, it easy to calculate the new mean). However, it is very sensitive to outlier and extreme values
+- The median is less sensitive to outliers and extreme values. However, it can be more computationally intensive to compute over time and with large datasets (reordering the data is necessary to find the "new" midpoint observation)
+
+One common theme that you can expect to see in the field of Statistics ( and life in general) is that there is often no single answer for everything and it is very rare to to have a clear, black-and-white answer and conclusion. Conclusions drawn from data and and statistical methods should also come up an understanding of potential drawbacks and limitations.
+
+I hope the reader is prepared for the "50 Shades of Gray" conclusions we may be getting from The Median Duck project!
+
+
+# What Have We Learnt Today?
+
+::: {.infobox .today data-latex="{today}"}
+```{r, , out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = "https://www.beyondthejoke.co.uk/sites/default/files/styles/large/public/screen_shot_2021-09-09_at_11.17.00.png")
+```
+
+There is evidence to suggest that the **Taskmaster has become less foul mouthed in recent series of the show**.
+
+The **profanity uttered per episode has noticeably decreased:**
+
+- **from 4 to 7 utterances in Series 1 to 7**
+- **to 2 to 5 utterances from Series 8 to 16**.
+
+This can be seen by two different profanity statistics, the profanity rate and median profanity uttered per episode in a series, and a shift in the distribution (boxplots).
+
+Little Alex Horne's wholesome presence must be having an effect on him...
+
+```{r , out.width = "55%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = "https://media.zenfs.com/en/digital_spy_281/f88e9f7ea4c5bd5bd5cfd53aa9f2d541")
+```
+
+The uptick in Series 16's profanity statistics does suggest that the Taskmaster may be returning to his high profanity rate regime.
+
+:::
+
+
+```{r , fig.show = "hold", out.width = "15%", fig.align='default', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = file.path(here(), "img", "gifs", "greg_horseshit.gif"), error = FALSE)
+```
\ No newline at end of file
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index.html b/public/2024/10/the-taskmaster-s-potty-mouth/index.html
new file mode 100644
index 0000000..fef1be5
--- /dev/null
+++ b/public/2024/10/the-taskmaster-s-potty-mouth/index.html
@@ -0,0 +1,305 @@
+
+
+
+ The Taskmaster's Potty Mouth - The Median Duck
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
One way to answer our question is to use Profanity Rate that we previously defined in the aforementioned post.
+
Recall that this sums up the number of profanity occurrences within a series for a particular person, and divides by the number of episodes in that season. This provides the number of times the person in question will swear in an episode of that series, on average.
+
+
+Figure 1: The Taskmaster’s Profanity Rateover Time
+
+
+
+
Figure 1 shows the Taskmaster’s profanity rate over time (captured by series). Profanity rates range between approximately 2.5 (series 8) and 7 (series 7). Visually, there does appear to be a change in profanity rate from series 8 onwards; the profanity rate drops from between 4 to 7 (red area), to 2.5 and 5 (blue area).
+
The uptick in profanity rate in series 16 such that it could plausibly associated and drawn from the Series 1-7 profanity rate range. Could Greg be on the cusp of returning to his old foul mouthed ways?
+
+
There is some evidence to suggest that the Taskmaster has becomes less potty mouthed over time with a significant drop in profanity instances per episode from series 8 onwards. This could be seen as slightly counter intuitive as:
+
+
We might assume that Greg, in his old age, has become more frustrated with life and contestants and thus more likely to swear.
+
We might assume that as the show has progressed and evolved, Greg has played up his angry persona as the Taskmaster and thus more likely to swear.
+
+
+
However, there are questions surrounding whether this change in profanity is statistically significant. There is a noticeable overlap between between two coloured areas in which the profanity rate values are common to both. The profanity rates associated with series 9, 10 and 16 could have been drawn from the “Pre Series 8” swearing regime.
+
There also potential questions around whether the the profanity rate is the ideal metric to answer our question since it can be swayed by rogue observations.
+
+
Potential drawbacks to the Profanity Rate.
+
The profanity rate, which essentially is an average (mean) summary statistic, is highly influenced by outliers and extreme values. It is not considered a robust statistic and is highly sensitive to the data.
+
In addition, the Profanity Rate, by itself, also does not capture the potential distribution (spread) of profanity utterances sufficiently. That is, what are the potential values of number of profanities uttered were we to watch a series of episodes.
+
For this reason, we might want to consider additional statistics to the profanity rate (mean profanity utterances), which are more robust and highlight the spread of distribution. The median, and percentiles in general, are one way to address these two issues.
+
But before we can calculate these metrics, some additional work is required.
+
+
+
+
Case of the Missing Profanities
+
One feature of our data source is that if Greg (or any other person for this matter) did not swear at all in an episode, no records will be present in our underlying dataset. 0 profanities should be associated with these episodes, which currently aren’t being captured.
+
If we want to consider something beyond profanity rate (for example the median and percentiles of profanities uttered in an episode across a series), we would need to ensure that these profanity free episodes are captured. Without capturing this profanity free phenomena, our statistics would not be be accurate; here, the median and percentile would be inflated.
+
The “no records for profanity free” feature is not a flaw of the data source or of its design. However, due to the question we want to answer, it is an important consideration that has to be explicitly accounted for in our methodology.
+
+
Why Wasn’t This an Issue with Profanity Rate?
+
It is also worth remarking that that this “zero profanities” phenomena was not an issue for the profanity rate calculation. Recall the equation for profanity rate was:
+
\[\begin{equation}
+\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}}
+\tag{1}
+\end{equation}\]
+
The numerator (the top of the fraction) in Equation (1) will be unaggected if zero profanities were observed (whether explicitly captured or not). The denominator (the bottom) in Equation (1) shows that we normalise by the number of episodes in a series, and not the number of episodes in the series in which profanity was observed. It is this normalisation that means that the profanity rate is not affected by this phenomena.
+
+
+
Profanity Consistency
+
Before we potentially start capturing these “profanity free episodes” explicitly, we should assess whether this is an actual problem first. To do this, we define the Profanity Consistency Rate.
+
\[\begin{equation}
+
+\texttt{Profanity Consistency Rate for Series $i$} = \frac{\texttt{Number of Episodes in Series $i$ which featured at least 1 swear}}{\texttt{Number of Episodes in Series $i$}}
+\tag{2}
+\end{equation}\]
+
The Consistency Rate can be thought of as the proportion of episodes in a series in which at least one swear word was uttered by the Taskmaster. A Consistency Rate of 100% means that the Taskmaster swore in all episodes of the series at least once; consistency rate of 50% means the Taskmaster swore in half of the episodes of the series at least once.
+
For the purpose of this post, we will only consider and calculate the profanity consistency rate for the Taskmaster. However, the same logic applies for any other person of interest.
+
+
+Figure 2: Taskmaster’s Profanity Consistency
+
+
+
+
Figure 2 plots the Taskmaster’s Profanity Consistency over time (series). Anything below 100% indicates that the Taskmaster did not swear in all episodes of that particular series.
+
From this we deduce that Greg was less irate in some episodes of Series 10, 11 and 15 as he did not swear in them.
+
Consequently, our current dataset does not include these “profanity free episodes”.
+
+
+
+
Putting on a Spread
+
These “profanity free episodes” records can be captured through data munging steps, namely LEFT JOIN with the episodes table (left table), such that if an episode does appear in our enhanced profanity table, we set the profanity to 0.
+
+
+Figure 3: Boxplots of Profanity Utterances
+
+
+
+
Figure 3 shows the boxplot of profanity utterances per episode, for each series. Boxplots are one way to show the distribution and spread of a quantity that is random, in this case, profanity utterances in an episode.
+
The thick black line in the center of the box (which the Taskmaster sits upon in this figure), represents the median profanity. 50% of profanity utterances in that series will lie above and below it respectively. The bottom and top bottom of the box represent the 25th and 75th percentile respectively (proportion of data lying below these values). The box will represent where at least 50% of observations will lie between. For sake of simplicity, I have not included the “whiskers” in this particular boxplot figure; these whiskers show another spread range of observations. Individual observations which lie outside of the range are also displayed.
+
+
Some observations (but not all):
+
+
The Profanity Boxplots show a similar behaviour and conclusion from analysing just the profanity rate; Greg has become less foul mouthed post Series 8 onwards. The median, and the main box, is noticeably lower from Series 8 onwards compared to pre Series 7.
+
The size of the box varies more from Series 1-7 than after Series 8 onwards. This suggests that Greg was more volatile with his profanity usage in early seasons.
+
As some of the boxes overlap between the two regimes; We continue to see an overlap in data from the two regimes; the lower proportion of the boxplots in the “High Profanity Regime” (Series 1 to 7) overlaps with the top proportion of the boxplots in the “Low Profanity Regime” (Series 8 onwards)
+
Series 4 has the the smallest sized boxplot. This indicates relatively little spread and deviation in the profanity utterances per episode. Greg is pretty consistent in uttering five profanities per episode in this season.
+
Series 7 has the largest boxplot and thus the greatest spread. Greg isn’t as consistent and is more random with is profanity utterances in this series.
+
Series 16 shows an uptick in profanities utterances to a level similar to pre Series 8; the median for series 16 is 5. Could this be the start of a new regime? Data from Series 17 onwards would help support or debunk this hypothesis.
+
Not all series boxplots are symmetrical; see Series 8 and 9. This indicates that there is some skewness in the profanity utterance distribution.
+
+
Series 8 is negatively skewed; the median is closer to the top of the box, and a greater concentration of observations are in the top of the box.
+
Conversely, Series 9 is positively sked, the median closer to the bottom of the box, and a greater concentration of observations are in the bottom of the box.1
+
+
It is also worth noting that number of episodes varies per series. From Series 6 onwards, a series contains 10 episodes. Prior to this, a series could be as short as 5 episodes (Series 2 and 3), and long as 8 episodes (Series 4 and 5). Due to the limited number of data points in these earlier, short series, some care needs to be taken from the conclusions we draw from them.
+
+
+
+
+
To Mean or Median…
+
As we start to conclude this post, we bring it back to two single summary statistics, namely the profanity rate (also known as the mean), and the median. To end, we simply compare the profanity rate and the median profanity to see if there are any substantial differences between the two statistics.
+
Figure 4 indicates that there is relatively little difference between the mean and median:
+
+
the mean and median are generally aligned sharing similar but not identical values
+
+
similar in value suggests that there are no extreme values or outliers which would affect the mean more than the median.
+
Series where the mean and median deviate the most are where skewness in observed the most in the boxplots (see Series 9 and 10).
+
+
the two statistics exhibit the same overall trend over Series time is the same; Profanity in Series 1-7 was generally at a higher occurrence rate than Proganity from Series 8 onwards.
+
+
This could have been different if the mean and median were vastly different in value.
+
+
+
+
+Figure 4: Comparing the Mean (Average) and Median Profanity Utterances per episode
+
+
+
+
With the mean and the median being so similar in value and behaviour from Figure 4, we may ask ourselves what was the whole point of this exercise if we achieve the same conclusions. Well,
+I would say that we were “lucky” in this scenario. If Taskmaster has taught us anything, it is that there doesn’t necessary to be a point for everything.
+
+
The Copout Answer
+
Those new to statistics may want to definitively know which statistic to use in life. Unfortunately there is no clear cut answer for this, and it largely depends on the problem and the application. Both statistics have their advantages and disadvantages.
+
+
The mean is more commonly accepted amongst the general public and can be efficiently computed over time (for example if we were being drip fed observations slowly over time, it easy to calculate the new mean). However, it is very sensitive to outlier and extreme values
+
The median is less sensitive to outliers and extreme values. However, it can be more computationally intensive to compute over time and with large datasets (reordering the data is necessary to find the “new” midpoint observation)
+
+
One common theme that you can expect to see in the field of Statistics ( and life in general) is that there is often no single answer for everything and it is very rare to to have a clear, black-and-white answer and conclusion. Conclusions drawn from data and and statistical methods should also come up an understanding of potential drawbacks and limitations.
+
I hope the reader is prepared for the “50 Shades of Gray” conclusions we may be getting from The Median Duck project!
+
+
+
+
What Have We Learnt Today?
+
+
+
There is evidence to suggest that the Taskmaster has become less foul mouthed in recent series of the show.
+
The profanity uttered per episode has noticeably decreased:
+
+
from 4 to 7 utterances in Series 1 to 7
+
to 2 to 5 utterances from Series 8 to 16.
+
+
This can be seen by two different profanity statistics, the profanity rate and median profanity uttered per episode in a series, and a shift in the distribution (boxplots).
+
Little Alex Horne’s wholesome presence must be having an effect on him…
+
+
The uptick in Series 16’s profanity statistics does suggest that the Taskmaster may be returning to his high profanity rate regime.
+
+
+
+
+
+
+
Readers may be reassured to know that I still get the negative and positive skew mixed up. It might be because I also don’t know my left and right instinctively.↩︎
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/allpr-plot-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/allpr-plot-1.png
new file mode 100644
index 0000000..8717296
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/allpr-plot-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/foul-plot-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/foul-plot-1.png
new file mode 100644
index 0000000..90ecde8
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/foul-plot-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/med-mean-comp-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/med-mean-comp-1.png
new file mode 100644
index 0000000..3b472b1
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/med-mean-comp-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-pr-basic-plot-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-pr-basic-plot-1.png
new file mode 100644
index 0000000..136066f
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-pr-basic-plot-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-prof-boxplot-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-prof-boxplot-1.png
new file mode 100644
index 0000000..ad4ba68
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-prof-boxplot-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-prof-consistency-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-prof-consistency-1.png
new file mode 100644
index 0000000..6fc74e2
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm-prof-consistency-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_pr_basic-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_pr_basic-1.png
new file mode 100644
index 0000000..8d1a6b9
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_pr_basic-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_pr_basic-plot-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_pr_basic-plot-1.png
new file mode 100644
index 0000000..5427777
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_pr_basic-plot-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_prof_boxplot-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_prof_boxplot-1.png
new file mode 100644
index 0000000..1f8f151
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_prof_boxplot-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_prof_consistency-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_prof_consistency-1.png
new file mode 100644
index 0000000..5f54466
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/tm_prof_consistency-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-10-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-10-1.png
new file mode 100644
index 0000000..f25bcab
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-10-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-11-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-11-1.png
new file mode 100644
index 0000000..b188eba
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-11-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-12-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-12-1.png
new file mode 100644
index 0000000..b188eba
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-12-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-13-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-13-1.png
new file mode 100644
index 0000000..9ce15cd
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-13-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-5-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-5-1.png
new file mode 100644
index 0000000..ee5ae9a
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-5-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-6-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-6-1.png
new file mode 100644
index 0000000..9a82801
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-6-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-7-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..21e226a
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-8-1.png b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-8-1.png
new file mode 100644
index 0000000..80353e0
Binary files /dev/null and b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/figure-html/unnamed-chunk-8-1.png differ
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/kePrint/kePrint.js b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/kePrint/kePrint.js
new file mode 100644
index 0000000..e6fbbfc
--- /dev/null
+++ b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/kePrint/kePrint.js
@@ -0,0 +1,8 @@
+$(document).ready(function(){
+ if (typeof $('[data-toggle="tooltip"]').tooltip === 'function') {
+ $('[data-toggle="tooltip"]').tooltip();
+ }
+ if ($('[data-toggle="popover"]').popover === 'function') {
+ $('[data-toggle="popover"]').popover();
+ }
+});
diff --git a/public/2024/10/the-taskmaster-s-potty-mouth/index_files/lightable/lightable.css b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/lightable/lightable.css
new file mode 100644
index 0000000..3be3be9
--- /dev/null
+++ b/public/2024/10/the-taskmaster-s-potty-mouth/index_files/lightable/lightable.css
@@ -0,0 +1,272 @@
+/*!
+ * lightable v0.0.1
+ * Copyright 2020 Hao Zhu
+ * Licensed under MIT (https://github.com/haozhu233/kableExtra/blob/master/LICENSE)
+ */
+
+.lightable-minimal {
+ border-collapse: separate;
+ border-spacing: 16px 1px;
+ width: 100%;
+ margin-bottom: 10px;
+}
+
+.lightable-minimal td {
+ margin-left: 5px;
+ margin-right: 5px;
+}
+
+.lightable-minimal th {
+ margin-left: 5px;
+ margin-right: 5px;
+}
+
+.lightable-minimal thead tr:last-child th {
+ border-bottom: 2px solid #00000050;
+ empty-cells: hide;
+
+}
+
+.lightable-minimal tbody tr:first-child td {
+ padding-top: 0.5em;
+}
+
+.lightable-minimal.lightable-hover tbody tr:hover {
+ background-color: #f5f5f5;
+}
+
+.lightable-minimal.lightable-striped tbody tr:nth-child(even) {
+ background-color: #f5f5f5;
+}
+
+.lightable-classic {
+ border-top: 0.16em solid #111111;
+ border-bottom: 0.16em solid #111111;
+ width: 100%;
+ margin-bottom: 10px;
+ margin: 10px 5px;
+}
+
+.lightable-classic tfoot tr td {
+ border: 0;
+}
+
+.lightable-classic tfoot tr:first-child td {
+ border-top: 0.14em solid #111111;
+}
+
+.lightable-classic caption {
+ color: #222222;
+}
+
+.lightable-classic td {
+ padding-left: 5px;
+ padding-right: 5px;
+ color: #222222;
+}
+
+.lightable-classic th {
+ padding-left: 5px;
+ padding-right: 5px;
+ font-weight: normal;
+ color: #222222;
+}
+
+.lightable-classic thead tr:last-child th {
+ border-bottom: 0.10em solid #111111;
+}
+
+.lightable-classic.lightable-hover tbody tr:hover {
+ background-color: #F9EEC1;
+}
+
+.lightable-classic.lightable-striped tbody tr:nth-child(even) {
+ background-color: #f5f5f5;
+}
+
+.lightable-classic-2 {
+ border-top: 3px double #111111;
+ border-bottom: 3px double #111111;
+ width: 100%;
+ margin-bottom: 10px;
+}
+
+.lightable-classic-2 tfoot tr td {
+ border: 0;
+}
+
+.lightable-classic-2 tfoot tr:first-child td {
+ border-top: 3px double #111111;
+}
+
+.lightable-classic-2 caption {
+ color: #222222;
+}
+
+.lightable-classic-2 td {
+ padding-left: 5px;
+ padding-right: 5px;
+ color: #222222;
+}
+
+.lightable-classic-2 th {
+ padding-left: 5px;
+ padding-right: 5px;
+ font-weight: normal;
+ color: #222222;
+}
+
+.lightable-classic-2 tbody tr:last-child td {
+ border-bottom: 3px double #111111;
+}
+
+.lightable-classic-2 thead tr:last-child th {
+ border-bottom: 1px solid #111111;
+}
+
+.lightable-classic-2.lightable-hover tbody tr:hover {
+ background-color: #F9EEC1;
+}
+
+.lightable-classic-2.lightable-striped tbody tr:nth-child(even) {
+ background-color: #f5f5f5;
+}
+
+.lightable-material {
+ min-width: 100%;
+ white-space: nowrap;
+ table-layout: fixed;
+ font-family: Roboto, sans-serif;
+ border: 1px solid #EEE;
+ border-collapse: collapse;
+ margin-bottom: 10px;
+}
+
+.lightable-material tfoot tr td {
+ border: 0;
+}
+
+.lightable-material tfoot tr:first-child td {
+ border-top: 1px solid #EEE;
+}
+
+.lightable-material th {
+ height: 56px;
+ padding-left: 16px;
+ padding-right: 16px;
+}
+
+.lightable-material td {
+ height: 52px;
+ padding-left: 16px;
+ padding-right: 16px;
+ border-top: 1px solid #eeeeee;
+}
+
+.lightable-material.lightable-hover tbody tr:hover {
+ background-color: #f5f5f5;
+}
+
+.lightable-material.lightable-striped tbody tr:nth-child(even) {
+ background-color: #f5f5f5;
+}
+
+.lightable-material.lightable-striped tbody td {
+ border: 0;
+}
+
+.lightable-material.lightable-striped thead tr:last-child th {
+ border-bottom: 1px solid #ddd;
+}
+
+.lightable-material-dark {
+ min-width: 100%;
+ white-space: nowrap;
+ table-layout: fixed;
+ font-family: Roboto, sans-serif;
+ border: 1px solid #FFFFFF12;
+ border-collapse: collapse;
+ margin-bottom: 10px;
+ background-color: #363640;
+}
+
+.lightable-material-dark tfoot tr td {
+ border: 0;
+}
+
+.lightable-material-dark tfoot tr:first-child td {
+ border-top: 1px solid #FFFFFF12;
+}
+
+.lightable-material-dark th {
+ height: 56px;
+ padding-left: 16px;
+ padding-right: 16px;
+ color: #FFFFFF60;
+}
+
+.lightable-material-dark td {
+ height: 52px;
+ padding-left: 16px;
+ padding-right: 16px;
+ color: #FFFFFF;
+ border-top: 1px solid #FFFFFF12;
+}
+
+.lightable-material-dark.lightable-hover tbody tr:hover {
+ background-color: #FFFFFF12;
+}
+
+.lightable-material-dark.lightable-striped tbody tr:nth-child(even) {
+ background-color: #FFFFFF12;
+}
+
+.lightable-material-dark.lightable-striped tbody td {
+ border: 0;
+}
+
+.lightable-material-dark.lightable-striped thead tr:last-child th {
+ border-bottom: 1px solid #FFFFFF12;
+}
+
+.lightable-paper {
+ width: 100%;
+ margin-bottom: 10px;
+ color: #444;
+}
+
+.lightable-paper tfoot tr td {
+ border: 0;
+}
+
+.lightable-paper tfoot tr:first-child td {
+ border-top: 1px solid #00000020;
+}
+
+.lightable-paper thead tr:last-child th {
+ color: #666;
+ vertical-align: bottom;
+ border-bottom: 1px solid #00000020;
+ line-height: 1.15em;
+ padding: 10px 5px;
+}
+
+.lightable-paper td {
+ vertical-align: middle;
+ border-bottom: 1px solid #00000010;
+ line-height: 1.15em;
+ padding: 7px 5px;
+}
+
+.lightable-paper.lightable-hover tbody tr:hover {
+ background-color: #F9EEC1;
+}
+
+.lightable-paper.lightable-striped tbody tr:nth-child(even) {
+ background-color: #00000008;
+}
+
+.lightable-paper.lightable-striped tbody td {
+ border: 0;
+}
+
diff --git a/public/2024/11/the-taskmaster-s-potty-mouth/index.Rmd b/public/2024/11/the-taskmaster-s-potty-mouth/index.Rmd
new file mode 100644
index 0000000..32c2f65
--- /dev/null
+++ b/public/2024/11/the-taskmaster-s-potty-mouth/index.Rmd
@@ -0,0 +1,408 @@
+---
+title: The Taskmaster's Potty Mouth
+author: Christopher Nam
+date: '2024-11-09'
+slug: the-taskmaster-s-potty-mouth
+categories: [analysis, profanity, greg davies]
+tags: []
+draft: no
+output:
+ blogdown::html_page:
+ toc: true
+ toc_depth: 1
+---
+
+
+```{r, echo = FALSE}
+blogdown::shortcode("callout", text="Warning This Post Contains Strong Language...Reader Discretion is advised!")
+```
+
+# Your Task
+
+> Find out whether the Taskmaster (Greg Davies) has become more or less foul mouth over time.
+```{r, echo = FALSE, out.width = "25%", fig.align='center', error = FALSE, warning=FALSE}
+knitr::include_graphics(path = "https://static1.colliderimages.com/wordpress/wordpress/wp-content/uploads/2024/03/greg-davies-from-taskmaster.jpg")
+```
+
+This post is an extension of [this profanity based post](/themedianduck/2024/10/profanity-insanity).
+
+```{r , fig.show = "hold", out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path =file.path(here::here(), "img", "gifs", "greg_fuckingbus.gif"), error = FALSE)
+```
+
+```{r preamble, echo = FALSE, warning = FALSE, message = FALSE, error=FALSE, collapse = TRUE, include = FALSE}
+library(here)
+here("static")
+
+preamble_dir <- here("static", "code", "R", "preamble")
+preamble_file <- "post_preamble.R"
+
+source(file.path(preamble_dir, preamble_file))
+source(file.path(preamble_dir, "database_preamble.R"))
+source(file.path(preamble_dir, "graphics_preamble.R"))
+
+```
+
+
+```{sql sql_query, connection = tm_db, output.var = "profanity_enh", echo = FALSE}
+-- Stored as an R dataframe profanity_enh
+
+SELECT
+pf.series,
+pf.episode,
+pf.task,
+pf.speaker as speaker_id,
+pp.name as speaker_name,
+pf.roots,
+pf.quote,
+pf.studio,
+pp.gender,
+pp.hand,
+pp.champion,
+pp.tmi as speaker_tmi,
+sp.name as series_name,
+sp.episodes as num_episodes_in_series,
+sp.champion as series_champion_id,
+sp.special
+FROM profanity pf
+LEFT JOIN people pp
+ ON (pf.speaker = pp.id
+ AND pf.series = pp.series) OR
+(pf.speaker = pp.id)
+LEFT JOIN series sp
+ ON pf.series = sp.id
+
+```
+
+```{r, message = FALSE, tidy = FALSE, echo = FALSE}
+library(reticulate)
+library(dplyr)
+
+series_profanity <- profanity_enh %>%
+ rowwise() %>%
+ mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
+ # To count the number of profanities utter in a quote.
+ group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi, gender, hand) %>%
+ # Aggregating and summarising data at a series, speaker level.
+ summarise(
+ speaker_episode_count = dplyr::n_distinct(episode),
+ sum_profanity_series = sum(num_profanity),
+ no_episodes_in_series = max(num_episodes_in_series)
+ ) %>%
+ mutate(profanity_per_episode = sum_profanity_series/no_episodes_in_series)
+```
+
+```{r, include = FALSE}
+greg_image_url <- "https://taskmaster.info/images/people/0019_greg_davies_3.png"
+```
+
+# The Profanity Rate Approach
+One way to answer our question is to use Profanity Rate that we previously defined in the aforementioned [post](/themedianduck/2024/10/profanity-insanity).
+
+Recall that this sums up the number of profanity occurrences within a series for a particular person, and divides by the number of episodes in that series. This provides the number of times the person in question will swear in an episode of that series, on average.
+```{r, include = FALSE}
+tm_series <- series_profanity %>%
+ filter(special == 0 & speaker_name == "Greg Davies")
+
+tm_series$image_url <- greg_image_url
+```
+
+```{r tm-pr-basic-plot, fig.cap = "The Taskmaster's Profanity Rateover Time", tidy = FALSE, echo = FALSE}
+ggplot(tm_series, aes(x= series, y=profanity_per_episode)) +
+ geom_rect(aes(xmin = 0, ymin = 4,
+xmax = 7.5, ymax = 7), fill = "#f3b0b0") +
+ geom_rect(aes(xmin = 7.5, ymin = 2.5,
+xmax = 16, ymax = 5), fill = "#b0c6f3") +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ geom_line(linewidth = 1.5) +
+ scale_x_continuous(breaks = seq(0, 20, 1)) +
+ scale_y_continuous(breaks = seq(0, 10, 1), limits = c(0, 10)) +
+ geom_image(aes(image = image_url), size= 0.07) +
+ xlab("Series") + ylab("Profanity Rate (Profanity per Episode)") +
+ ggtitle("The Taskmaster's Potty Mouth")
+```
+
+Figure \@ref(fig:tm-pr-basic-plot) shows the Taskmaster's profanity rate over time (captured by series). Profanity rates range between approximately 2.5 (series 8) and 7 (series 7). Visually, there does appear to be a change in profanity rate from series 8 onwards; the profanity rate drops from between 4 to 7 (red area), to 2.5 and 5 (blue area).
+
+
+The uptick in profanity rate in series 16 such that it could plausibly associated and drawn from the Series 1-7 profanity rate range. Could Greg be on the cusp of returning to his old foul mouthed ways?
+
+
+:::{.insights}
+There is some evidence to suggest that the Taskmaster has becomes less potty mouthed over time with a significant drop in profanity instances per episode from series 8 onwards. **The profanity rate drops from between 4 to 7 (Series 1 to 7), to 2.5 and 5 (Series 8 to 16)**.
+
+This could be seen as slightly counter intuitive as:
+
+1. We might assume that Greg, in his old age, has become more frustrated with life and contestants and thus more likely to swear.
+2. We might assume that as the show has progressed and evolved, Greg has played up his angry persona as the Taskmaster and thus more likely to swear.
+:::
+
+However, there are questions surrounding whether this change in profanity is statistically significant. There is also a noticeable overlap between between two coloured areas in which the profanity rate values are common to both; the profanity rates associated with series 9, 10 and 16 could have plausibly been drawn from the "Pre Series 8" swearing regime.
+
+There also potential questions around whether the the profanity rate is the ideal statistic to answer our question since it can be swayed by rogue observations.
+
+## Potential drawbacks to the Profanity Rate.
+The profanity rate, which essentially is an average (mean) summary statistic, is highly influenced by outliers and extreme values. It is not considered a [robust statistic](https://statisticsbyjim.com/basics/robust-statistics/) and is highly sensitive to the data.
+
+In addition, the Profanity Rate, by itself, also does not capture the potential distribution (spread) of profanity utterances sufficiently. That is, if we were to watch many episodes of Taskmaster, what are the range of profanity utterances we can expect to see (or hear), and how much do they vary across episodes.
+
+For this reason, we might want to consider additional statistics to the profanity rate (the mean profanity utterances), which are more robust and highlight the spread of distribution. The [median](https://statisticsbyjim.com/basics/median/), and [percentiles](https://statisticsbyjim.com/basics/percentiles/) in general, are one way to address these two issues.
+
+But before we can calculate these metrics, some additional work is required.
+
+# Case of the Missing Profanities
+One feature of our data source is that if Greg (or any other person for this matter) did not swear at all in an episode, no records will be present in our underlying dataset. 0 profanities should be associated with these episodes, which currently aren't being captured.
+
+If we want to consider something beyond profanity rate (for example the median and percentiles of profanities uttered in an episode across a series), we would need to ensure that these profanity free episodes are captured. Without capturing this profanity free phenomena, our statistics would not be be accurate; here, the median and percentile would be inflated.
+
+The "no records for profanity free" feature is not a flaw of the data source or of its design. However, due to the question we want to answer, it is an important consideration that has to be explicitly accounted for in our methodology.
+
+
+## Why Wasn't This an Issue with Profanity Rate?
+It is also worth remarking that that this "zero profanities" phenomena was not an issue for the profanity rate calculation. Recall the equation for profanity rate was:
+
+\begin{equation}
+\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}}
+(\#eq:profanityrate)
+\end{equation}
+
+The numerator (the top of the fraction) in Equation \@ref(eq:profanityrate) will be unaffected if zero profanities were observed (whether explicitly captured or not). The denominator (the bottom) in Equation \@ref(eq:profanityrate) shows that we normalise by the **number of episodes in a series**, and _not_ the **number of episodes in the series in which profanity was observed**. It is this normalisation that means that the profanity rate is not affected by this phenomena.
+
+
+```{r, tidy = FALSE, execute = FALSE, echo = FALSE, include = FALSE}
+ggplot(tm_series, aes(x = series, y = no_episodes_in_series)) +
+ geom_col() +
+ geom_line(aes(y = speaker_episode_count)) +
+ geom_point(aes(y = speaker_episode_count)) +
+ geom_image(aes(image = image_url, y = speaker_episode_count), size= 0.07) +
+ xlab("series") + ylab("Number of Episodes") +
+ scale_y_continuous(breaks = seq(0, 10, 1)) +
+ ggtitle("Taskmaster's Profanity Consistency")
+
+```
+
+## Profanity Consistency
+Before we potentially start capturing these "profanity free episodes" explicitly, we should assess whether this is an actual problem first. To do this, we define the Profanity Consistency Rate.
+
+\begin{equation}
+
+\texttt{Profanity Consistency Rate for Series $i$} = \frac{\texttt{Number of Episodes in Series $i$ which featured at least 1 swear}}{\texttt{Number of Episodes in Series $i$}}
+(\#eq:profanityconsis)
+\end{equation}
+
+The Consistency Rate can be thought of as the proportion of episodes in a series in which at least one swear word was uttered by the Taskmaster. A Consistency Rate of 100% means that the Taskmaster swore in all episodes of the series at least once; consistency rate of 50% means the Taskmaster swore in half of the episodes of the series at least once.
+
+
+For the purpose of this post, we will only consider and calculate the profanity consistency rate for the Taskmaster. However, the same logic applies for any other person of interest.
+
+```{r tm-prof-consistency, tidy = FALSE, fig.cap = "Taskmaster's Profanity Consistency", echo = FALSE}
+ggplot(tm_series, aes(x = series, y = speaker_episode_count/no_episodes_in_series)) +
+ geom_line(aes(y = speaker_episode_count/no_episodes_in_series), linewidth = 1.5) +
+ geom_point(aes(y = speaker_episode_count/no_episodes_in_series)) +
+ geom_image(aes(image = image_url, y = speaker_episode_count/no_episodes_in_series), size= 0.07) +
+ scale_x_continuous(breaks = seq(0, 20, 1)) +
+ scale_y_continuous(labels = scales::percent_format(accuraacy = 1), limits = c(0, 1)) +
+ xlab("Series") + ylab("Profanity Consistency Rate") +
+ ggtitle("Taskmaster's Profanity Consistency")
+
+```
+
+
+Figure \@ref(fig:tm-prof-consistency) plots the Taskmaster's Profanity Consistency over time (series). Anything below 100% indicates that the Taskmaster did not swear in all episodes of that particular series.
+
+From this we deduce that Greg was less irate in some episodes of Series 10, 11 and 15 as he did not swear in them.
+
+Consequently, our current dataset does not include these "profanity free episodes".
+
+
+# Putting on a Spread
+These "profanity free episodes" records can be captured through data munging steps, namely `LEFT JOIN` with the `episodes` table (left table), such that if an episode does appear in our enhanced profanity table, we set the profanity utterance to 0.
+
+```{r, echo = FALSE, error = FALSE, warning = FALSE, message= FALSE}
+tm_profanity <- profanity_enh %>%
+ filter(special == 0 & speaker_name == "Greg Davies") %>%
+ rowwise() %>%
+ mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
+ group_by(series, series_name, special, episode, speaker_id, speaker_name, speaker_tmi) %>%
+ summarise(
+ task_count = dplyr::n_distinct(task),
+ sum_profanity = sum(num_profanity)
+ )
+
+# Filter to Greg Davies
+#Filter to non special.
+
+#Aggregate to episode level for profanity per episode
+
+#dbWriteTable(tm_db, "tm_profanity", tm_profanity)
+
+```
+```{sql connection = tm_db, output.var = "fill_tm_profanity", echo = FALSE, , error = FALSE, warning = FALSE, message= FALSE}
+SELECT
+COALESCE(ep.series, tmp.series) as series,
+tmp.series_name,
+tmp.special,
+COALESCE(tmp.episode, ep.id) as show_ep_id,
+ep.episode series_ep_id,
+ep.title as ep_title,
+tmp.speaker_id,
+tmp.speaker_name,
+tmp.speaker_tmi,
+COALESCE(tmp.task_count, 0) as task_count,
+COALESCE(tmp.sum_profanity, 0) as sum_profanity
+FROM
+episodes ep
+LEFT OUTER JOIN tm_profanity tmp
+ON ep.id = tmp.episode
+AND ep.series = tmp.series
+WHERE ep.series >= 1 -- To filter out specials
+```
+
+```{r, echo = FALSE , error = FALSE, warning = FALSE, message= FALSE}
+# Dealing with addition missing data columns through a carry forward strategy
+fill_tm_profanity <- fill_tm_profanity %>% arrange(show_ep_id) %>%
+ tidyr::fill(special, speaker_id, speaker_name, speaker_tmi, .direction = "down") %>%
+ mutate(series_name = dplyr::if_else(is.na(series_name), paste("Series", series, sep = " "), series_name)
+ )
+```
+
+
+```{r, echo = FALSE, error = FALSE, warning = FALSE, message= FALSE}
+summary_tm_profanity <-
+ fill_tm_profanity %>%
+ group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi) %>%
+ summarise(
+ num_episodes = dplyr::n_distinct(series_ep_id),
+ total_profanity = sum(sum_profanity),
+ avg_profanity = mean(sum_profanity),
+ median_profanity = median(sum_profanity),
+ p10_profanity = quantile(sum_profanity, probs = 0.1),
+ p25_profanity = quantile(sum_profanity,probs = 0.25),
+ p75_profanity = quantile(sum_profanity,probs = 0.75),
+ p90_profanity = quantile(sum_profanity,probs = 0.90)
+ ) %>%
+ ungroup()
+
+summary_tm_profanity$image_url <- greg_image_url
+```
+
+
+```{r tm-prof-boxplot, fig.cap= "Boxplots of Profanity Utterances", tidy = FALSE, echo = FALSE}
+ggplot(fill_tm_profanity, aes(x=as.factor(series), y= sum_profanity)) +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ geom_boxplot(coef = 0) +
+ geom_image(data = summary_tm_profanity, aes(x=series, y = median_profanity, image = image_url), alpha = 0.1) +
+ ylab("Profanity Utterances in an Episode") +
+ xlab("Series") +
+ ggtitle("Boxplots of Profanity Utterances in an Episode by Series") +
+ scale_y_continuous(breaks = seq(0, 15, 2), limits = c(0, 15))
+```
+Figure \@ref(fig:tm-prof-boxplot) shows the boxplot of profanity utterances per episode, for each series. [Boxplots](https://statisticsbyjim.com/graphs/box-plot/) are one way to show the distribution and spread of a quantity that is random, in this case, profanity utterances in an episode.
+
+The thick black line in the center of the box (which the Taskmaster sits upon in this figure), represents the *median profanity*. 50% of profanity utterances in that series will lie above and below it respectively. The bottom and top bottom of the box represent the 25th and 75th percentile respectively (proportion of data lying below these values). *The box will represent where at least 50% of observations will lie between.* For sake of simplicity, I have not included the "whiskers" that are commonly used with boxplot figures; these whiskers show another spread range of observations. Individual observations which lie outside of the range are also displayed.
+
+
+::: {.insights}
+Some observations (but not all):
+
+- The Profanity Boxplots show a similar behaviour and conclusion to that when considering the mean profanity; **Greg has become less foul mouthed post Series 8 onwards. The median, and the main box, is noticeably lower from Series 8 onwards compared to pre Series 7.**
+- The size of the box varies more from Series 1-7 than after Series 8 onwards. This suggests that Greg was more volatile with his profanity usage in early seasons.
+ - However it is worth noting that Series 1 to 5 were shorter in length than Series 6 onwards. From Series 6 onwards, a series contains 10 episodes. Prior to this, a series could be as short as 5 episodes (Series 2 and 3), and long as 8 episodes (Series 4 and 5). Due to the limited number of data points in these earlier, short series, some care needs to be taken from the conclusions we draw from them.
+- We continue to see an overlap in data from the two regimes; the lower proportion of the boxplots in the "High Profanity Regime" (Series 1 to 7) overlaps with the top proportion of the boxplots in the "Low Profanity Regime" (Series 8 onwards)
+- Series 4 has the the smallest sized boxplot. This indicates relatively little spread and deviation in the profanity utterances per episode. Greg is pretty consistent in uttering five profanities per episode in this season.
+- Series 7 has the largest boxplot and thus the greatest spread. Greg isn't as consistent and is more random with is profanity utterances in this series.
+- Series 16 shows an uptick in profanities utterances to a level similar to pre Series 8; the median for series 16 is 5. Could this be the start of a new regime? Data from Series 17 onwards would help support or debunk this hypothesis.
+- Not all series boxplots are symmetrical, for example see Series 8 and 9. This indicates that there is some skewness in the profanity utterance distribution.
+ - Series 8 is negatively skewed; the median is closer to the top of the box, and a greater concentration of observations are in the top of the box.
+ - Conversely, Series 9 is positively skewed, the median closer to the bottom of the box, and a greater concentration of observations are in the bottom of the box.[^1]
+:::
+
+[^1]: Readers may be reassured to know that I still get negative and positive skew mixed up in direction. It might be because I also don't know my left and right instinctively.
+
+# To Mean or Median...
+As we start to conclude this post, we bring it back to two single summary statistics, namely the profanity rate (also known as the mean profanity utterance), and the median profanity utterance. To end, we simply compare the profanity rate and the median profanity to see if there are any substantial differences between the two statistics.
+
+```{r med-mean-comp, fig.cap = "Comparing the Mean (Average) and Median Profanity Utterances per episode" , tidy = FALSE, echo = FALSE,}
+
+colours <- c("Mean" = "darkblue", "Median" = "darkred")
+
+ggplot(summary_tm_profanity, aes(x = series), linewidth = 2, size = 2) +
+ geom_rect(aes(xmin = 1, ymin = 4,
+xmax = 7.5, ymax = 7), fill = "#f3b0b0") +
+ geom_rect(aes(xmin = 7.5, ymin = 2,
+xmax = 16, ymax = 5), fill = "#b0c6f3") +
+ geom_vline(aes(xintercept = 7.5), linetype = 4, linewidth = 1.5, alpha = 0.75, colour = "gray") +
+ annotate(geom = "segment", x = 1, xend = 7, y = 5.5, linetype = 3) +
+ annotate(geom = "segment", x = 8, xend = 16, y= 3.5, linetype = 3) +
+ geom_line(aes(y = avg_profanity, color = "Mean")) +
+ geom_point(aes(y = avg_profanity, color = "Mean")) +
+ geom_line(aes(y = median_profanity, color = "Median" ),linetype = 2) +
+ geom_point(aes(y = median_profanity, color = "Median"), shape = 25) +
+ labs(x = "Series",
+ y = "Profanity Utterance per Episode",
+ title = "Mean and Median Profanity Utterance comparison",
+ caption = "N.B. Mean Profanity Utterance is the same as Profanity Rate",
+ color = "Legend"
+ ) +
+ scale_color_manual(values = colours) +
+ theme(legend.position="top") +
+ scale_y_continuous(breaks = seq(0, 15, 1), limits = c(0, 10)) +
+ scale_x_continuous(breaks = seq(1, 16, 1))
+
+```
+
+:::{ .insights}
+Figure \@ref(fig:med-mean-comp) indicates that there is relatively little difference between the mean and median:
+
+- the mean and median are generally aligned sharing similar, but not identical, values
+ - similar in value suggests that there are no extreme values or outliers which would affect the mean more than the median.
+ - series where the mean and median deviate the most correspond to skewed boxplots (see Series 9 and 10).
+- the two statistics exhibit the same overall trend over Series time is the same; profanity in Series 1-7 was generally at a higher occurrence rate than profanity from Series 8 onwards.
+ - This could have been different if the mean and median were vastly different in value.
+:::
+
+
+With the mean and the median being so similar in value and behaviour from Figure \@ref(fig:med-mean-comp), we may ask ourselves what was the whole point of this exercise if we achieve the same conclusions. Well,
+I would say that we were "lucky" in this scenario and if Taskmaster has taught us anything, it is that there doesn't necessary to be a point for everything.
+
+## The Copout Answer
+Those new to statistics may want to definitively know which statistic to use in life. Unfortunately there is no clear cut answer for this, and it largely depends on the problem and the application. Both statistics have their advantages and disadvantages, and its important to consider what is best for the task in hand.
+
+- The mean is more commonly accepted amongst the general public and can be efficiently computed over time (for example if we were being drip fed observations slowly over time, it easy to calculate the new mean). However, it is very sensitive to outliers and extreme values.
+- The median is less sensitive to outliers and extreme values. However, it can be more computationally intensive to compute over time and with large datasets (reordering the data is necessary to find the "new" midpoint observation)
+
+One common theme that you can expect to see in the field of Statistics ( and life in general) is that there is often no single answer for everything and it is very rare to to have a clear, black-and-white answer and conclusion. Conclusions drawn from data and and statistical methods should also come with an understanding of potential drawbacks and limitations.
+
+I hope the reader is prepared for the "50 Shades of Gray" conclusions we may be getting from The Median Duck project!
+
+
+# What Have We Learnt Today?
+
+::: {.infobox .today data-latex="{today}"}
+```{r, , out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = "https://www.beyondthejoke.co.uk/sites/default/files/styles/large/public/screen_shot_2021-09-09_at_11.17.00.png")
+```
+
+There is evidence to suggest that the **Taskmaster has become less foul mouthed in recent series of the show**.
+
+The **profanity uttered per episode has noticeably decreased:**
+
+- **from 4 to 7 utterances in Series 1 to 7**
+- **to 2 to 5 utterances from Series 8 to 16**.
+
+This can be seen by two different profanity statistics, the profanity rate and median profanity uttered per episode in a series, and a shift in the distribution (boxplots).
+
+Little Alex Horne's wholesome presence must be having an effect on him...
+
+```{r , out.width = "55%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = "https://media.zenfs.com/en/digital_spy_281/f88e9f7ea4c5bd5bd5cfd53aa9f2d541")
+```
+
+The uptick in Series 16's profanity statistics does suggest that the Taskmaster may be returning to his high profanity rate regime.
+
+:::
+
+
+```{r , fig.show = "hold", out.width = "25%", fig.align='center', error = FALSE, warning=FALSE, echo=FALSE}
+knitr::include_graphics(path = file.path(here(), "img", "gifs", "greg_horseshit.gif"), error = FALSE)
+```
\ No newline at end of file
diff --git a/public/2024/11/the-taskmaster-s-potty-mouth/index.html b/public/2024/11/the-taskmaster-s-potty-mouth/index.html
new file mode 100644
index 0000000..7b2b144
--- /dev/null
+++ b/public/2024/11/the-taskmaster-s-potty-mouth/index.html
@@ -0,0 +1,311 @@
+
+
+
+ The Taskmaster's Potty Mouth - The Median Duck
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
One way to answer our question is to use Profanity Rate that we previously defined in the aforementioned post.
+
Recall that this sums up the number of profanity occurrences within a series for a particular person, and divides by the number of episodes in that series. This provides the number of times the person in question will swear in an episode of that series, on average.
+
+
+Figure 1: The Taskmaster’s Profanity Rateover Time
+
+
+
+
Figure 1 shows the Taskmaster’s profanity rate over time (captured by series). Profanity rates range between approximately 2.5 (series 8) and 7 (series 7). Visually, there does appear to be a change in profanity rate from series 8 onwards; the profanity rate drops from between 4 to 7 (red area), to 2.5 and 5 (blue area).
+
The uptick in profanity rate in series 16 such that it could plausibly associated and drawn from the Series 1-7 profanity rate range. Could Greg be on the cusp of returning to his old foul mouthed ways?
+
+
There is some evidence to suggest that the Taskmaster has becomes less potty mouthed over time with a significant drop in profanity instances per episode from series 8 onwards. The profanity rate drops from between 4 to 7 (Series 1 to 7), to 2.5 and 5 (Series 8 to 16).
+
This could be seen as slightly counter intuitive as:
+
+
We might assume that Greg, in his old age, has become more frustrated with life and contestants and thus more likely to swear.
+
We might assume that as the show has progressed and evolved, Greg has played up his angry persona as the Taskmaster and thus more likely to swear.
+
+
+
However, there are questions surrounding whether this change in profanity is statistically significant. There is also a noticeable overlap between between two coloured areas in which the profanity rate values are common to both; the profanity rates associated with series 9, 10 and 16 could have plausibly been drawn from the “Pre Series 8” swearing regime.
+
There also potential questions around whether the the profanity rate is the ideal statistic to answer our question since it can be swayed by rogue observations.
+
+
Potential drawbacks to the Profanity Rate.
+
The profanity rate, which essentially is an average (mean) summary statistic, is highly influenced by outliers and extreme values. It is not considered a robust statistic and is highly sensitive to the data.
+
In addition, the Profanity Rate, by itself, also does not capture the potential distribution (spread) of profanity utterances sufficiently. That is, if we were to watch many episodes of Taskmaster, what are the range of profanity utterances we can expect to see (or hear), and how much do they vary across episodes.
+
For this reason, we might want to consider additional statistics to the profanity rate (the mean profanity utterances), which are more robust and highlight the spread of distribution. The median, and percentiles in general, are one way to address these two issues.
+
But before we can calculate these metrics, some additional work is required.
+
+
+
+
Case of the Missing Profanities
+
One feature of our data source is that if Greg (or any other person for this matter) did not swear at all in an episode, no records will be present in our underlying dataset. 0 profanities should be associated with these episodes, which currently aren’t being captured.
+
If we want to consider something beyond profanity rate (for example the median and percentiles of profanities uttered in an episode across a series), we would need to ensure that these profanity free episodes are captured. Without capturing this profanity free phenomena, our statistics would not be be accurate; here, the median and percentile would be inflated.
+
The “no records for profanity free” feature is not a flaw of the data source or of its design. However, due to the question we want to answer, it is an important consideration that has to be explicitly accounted for in our methodology.
+
+
Why Wasn’t This an Issue with Profanity Rate?
+
It is also worth remarking that that this “zero profanities” phenomena was not an issue for the profanity rate calculation. Recall the equation for profanity rate was:
+
\[\begin{equation}
+\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}}
+\tag{1}
+\end{equation}\]
+
The numerator (the top of the fraction) in Equation (1) will be unaffected if zero profanities were observed (whether explicitly captured or not). The denominator (the bottom) in Equation (1) shows that we normalise by the number of episodes in a series, and not the number of episodes in the series in which profanity was observed. It is this normalisation that means that the profanity rate is not affected by this phenomena.
+
+
+
Profanity Consistency
+
Before we potentially start capturing these “profanity free episodes” explicitly, we should assess whether this is an actual problem first. To do this, we define the Profanity Consistency Rate.
+
\[\begin{equation}
+
+\texttt{Profanity Consistency Rate for Series $i$} = \frac{\texttt{Number of Episodes in Series $i$ which featured at least 1 swear}}{\texttt{Number of Episodes in Series $i$}}
+\tag{2}
+\end{equation}\]
+
The Consistency Rate can be thought of as the proportion of episodes in a series in which at least one swear word was uttered by the Taskmaster. A Consistency Rate of 100% means that the Taskmaster swore in all episodes of the series at least once; consistency rate of 50% means the Taskmaster swore in half of the episodes of the series at least once.
+
For the purpose of this post, we will only consider and calculate the profanity consistency rate for the Taskmaster. However, the same logic applies for any other person of interest.
+
+
+Figure 2: Taskmaster’s Profanity Consistency
+
+
+
+
Figure 2 plots the Taskmaster’s Profanity Consistency over time (series). Anything below 100% indicates that the Taskmaster did not swear in all episodes of that particular series.
+
From this we deduce that Greg was less irate in some episodes of Series 10, 11 and 15 as he did not swear in them.
+
Consequently, our current dataset does not include these “profanity free episodes”.
+
+
+
+
Putting on a Spread
+
These “profanity free episodes” records can be captured through data munging steps, namely LEFT JOIN with the episodes table (left table), such that if an episode does appear in our enhanced profanity table, we set the profanity utterance to 0.
+
+
+Figure 3: Boxplots of Profanity Utterances
+
+
+
+
Figure 3 shows the boxplot of profanity utterances per episode, for each series. Boxplots are one way to show the distribution and spread of a quantity that is random, in this case, profanity utterances in an episode.
+
The thick black line in the center of the box (which the Taskmaster sits upon in this figure), represents the median profanity. 50% of profanity utterances in that series will lie above and below it respectively. The bottom and top bottom of the box represent the 25th and 75th percentile respectively (proportion of data lying below these values). The box will represent where at least 50% of observations will lie between. For sake of simplicity, I have not included the “whiskers” that are commonly used with boxplot figures; these whiskers show another spread range of observations. Individual observations which lie outside of the range are also displayed.
+
+
Some observations (but not all):
+
+
The Profanity Boxplots show a similar behaviour and conclusion to that when considering the mean profanity; Greg has become less foul mouthed post Series 8 onwards. The median, and the main box, is noticeably lower from Series 8 onwards compared to pre Series 7.
+
The size of the box varies more from Series 1-7 than after Series 8 onwards. This suggests that Greg was more volatile with his profanity usage in early seasons.
+
+
However it is worth noting that Series 1 to 5 were shorter in length than Series 6 onwards. From Series 6 onwards, a series contains 10 episodes. Prior to this, a series could be as short as 5 episodes (Series 2 and 3), and long as 8 episodes (Series 4 and 5). Due to the limited number of data points in these earlier, short series, some care needs to be taken from the conclusions we draw from them.
+
+
We continue to see an overlap in data from the two regimes; the lower proportion of the boxplots in the “High Profanity Regime” (Series 1 to 7) overlaps with the top proportion of the boxplots in the “Low Profanity Regime” (Series 8 onwards)
+
Series 4 has the the smallest sized boxplot. This indicates relatively little spread and deviation in the profanity utterances per episode. Greg is pretty consistent in uttering five profanities per episode in this season.
+
Series 7 has the largest boxplot and thus the greatest spread. Greg isn’t as consistent and is more random with is profanity utterances in this series.
+
Series 16 shows an uptick in profanities utterances to a level similar to pre Series 8; the median for series 16 is 5. Could this be the start of a new regime? Data from Series 17 onwards would help support or debunk this hypothesis.
+
Not all series boxplots are symmetrical, for example see Series 8 and 9. This indicates that there is some skewness in the profanity utterance distribution.
+
+
Series 8 is negatively skewed; the median is closer to the top of the box, and a greater concentration of observations are in the top of the box.
+
Conversely, Series 9 is positively skewed, the median closer to the bottom of the box, and a greater concentration of observations are in the bottom of the box.1
+
+
+
+
+
+
To Mean or Median…
+
As we start to conclude this post, we bring it back to two single summary statistics, namely the profanity rate (also known as the mean profanity utterance), and the median profanity utterance. To end, we simply compare the profanity rate and the median profanity to see if there are any substantial differences between the two statistics.
+
+
+Figure 4: Comparing the Mean (Average) and Median Profanity Utterances per episode
+
+
+
+
+
Figure 4 indicates that there is relatively little difference between the mean and median:
+
+
the mean and median are generally aligned sharing similar, but not identical, values
+
+
similar in value suggests that there are no extreme values or outliers which would affect the mean more than the median.
+
series where the mean and median deviate the most correspond to skewed boxplots (see Series 9 and 10).
+
+
the two statistics exhibit the same overall trend over Series time is the same; profanity in Series 1-7 was generally at a higher occurrence rate than profanity from Series 8 onwards.
+
+
This could have been different if the mean and median were vastly different in value.
+
+
+
+
With the mean and the median being so similar in value and behaviour from Figure 4, we may ask ourselves what was the whole point of this exercise if we achieve the same conclusions. Well,
+I would say that we were “lucky” in this scenario and if Taskmaster has taught us anything, it is that there doesn’t necessary to be a point for everything.
+
+
The Copout Answer
+
Those new to statistics may want to definitively know which statistic to use in life. Unfortunately there is no clear cut answer for this, and it largely depends on the problem and the application. Both statistics have their advantages and disadvantages, and its important to consider what is best for the task in hand.
+
+
The mean is more commonly accepted amongst the general public and can be efficiently computed over time (for example if we were being drip fed observations slowly over time, it easy to calculate the new mean). However, it is very sensitive to outliers and extreme values.
+
The median is less sensitive to outliers and extreme values. However, it can be more computationally intensive to compute over time and with large datasets (reordering the data is necessary to find the “new” midpoint observation)
+
+
One common theme that you can expect to see in the field of Statistics ( and life in general) is that there is often no single answer for everything and it is very rare to to have a clear, black-and-white answer and conclusion. Conclusions drawn from data and and statistical methods should also come with an understanding of potential drawbacks and limitations.
+
I hope the reader is prepared for the “50 Shades of Gray” conclusions we may be getting from The Median Duck project!
+
+
+
+
What Have We Learnt Today?
+
+
+
There is evidence to suggest that the Taskmaster has become less foul mouthed in recent series of the show.
+
The profanity uttered per episode has noticeably decreased:
+
+
from 4 to 7 utterances in Series 1 to 7
+
to 2 to 5 utterances from Series 8 to 16.
+
+
This can be seen by two different profanity statistics, the profanity rate and median profanity uttered per episode in a series, and a shift in the distribution (boxplots).
+
Little Alex Horne’s wholesome presence must be having an effect on him…
+
+
The uptick in Series 16’s profanity statistics does suggest that the Taskmaster may be returning to his high profanity rate regime.
+
+
+
+
+
+
+
Readers may be reassured to know that I still get negative and positive skew mixed up in direction. It might be because I also don’t know my left and right instinctively.↩︎
To entertain and educate the general public of statistical and analytical concepts through data associated with the hit UK TV show “Taskmaster”.
-
-
-
-
The Objective
-
Using data from the UK hit TV show “Taskmaster”, The Median Duck aims to entertain and educate the general public of statistical and analytical concepts. Articles and analysis will be written for a non-technical audience predominantly.
-
Types of analysis and content fall under the following umbrellas with some examples.
-
The various datasets, code snippets and analysis will be open source on this git repo, such that others are able to reproduce it at their own pace, and contribute to the project.
-
-
Exploratory Analysis
-
-
Based on existing series and historical data, what insights can we gleam.
-
-
Examples:
-- Do older contestants fair better on the show? Is there a correlation between series ranking and age of contestant?
-- Has Greg been harsher as the show has progressed along?
-- Was Sally Phillips better at creative tasks, than objective tasks on average?
-- Existing analysis that you have performed thus far would fall into this category.
-
-
-
Predictive Analysis
-
-
Based on existing series and historical data, can we make a prediction on an outcome. The outcome will be realised eventually, and we can compare our prediction to the observed outcome to see how (in)accurate we were.
-
-
Examples:
-- Prior to a series starting, can we predict who is likely to win the series? What are the probabilities (odds) of each contestant winning?
-- As the series progresses and we start to see actual contestant performance, how do these odds change?
-
-
-
Hypothetical Analysis
-
-
Based on existing data, can we make a hypothesise as to what an outcome may be. The outcome will not be realised (or at least very low probability), and we thus can’t compare our hypothesised prediction to reality.
-
-
Examples:
-- Would Katherine Ryan still have won her series if it was a 10 episode series, rather than 5 episodes?
-- What would be the outcome of a hypothetical series featuring Ed Gamble, Victoria Coren Mitchell, Nish Kumar, Josh Widdicombe and Sally Phillips?
-
-
-
Game Theory and Strategy
-
-
Based on existing literature, what is a optimal way to succeed in a task.
-
-
Examples:
-- What is the optimal strategy in Series 13’s “Guess Shoe” task?
-- What is the optimal strategy in Series 4’s “Draw the median duck” task?
-- What is the optimal strategy in Series 13’s “Give Alex a high-five. The third-fastest high-fiver wins.” task?
-
-
-
-
Final Thoughts
-
This vision is ambitious and I’m not entirely sure if I can pull it all (or any) of it off. However, in the spirit of Taskmaster, I am willing to give it a try, and potentially make a fool out of myself as I succeed or fail.
-
I’m hoping that documenting this vision will also inspire others to contribute and collaborate on the project in the future.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+ http://localhost:4321/themedianduck/vision/
+
+
+
+
+
diff --git a/static/data/taskmaster.db b/static/data/taskmaster.db
index ca1e739..f56d2e6 100644
Binary files a/static/data/taskmaster.db and b/static/data/taskmaster.db differ
diff --git a/static/img/gifs/greg_fuckingbus.gif b/static/img/gifs/greg_fuckingbus.gif
new file mode 100644
index 0000000..fc40e92
Binary files /dev/null and b/static/img/gifs/greg_fuckingbus.gif differ
diff --git a/static/img/gifs/greg_horseshit.gif b/static/img/gifs/greg_horseshit.gif
new file mode 100644
index 0000000..0b7d71e
Binary files /dev/null and b/static/img/gifs/greg_horseshit.gif differ