-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f26c192
commit fe4d99b
Showing
71 changed files
with
3,469 additions
and
327 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
title: "Hello R Markdown" | ||
author: "Frida Gomam" | ||
date: 2020-12-01T21:13:14-05:00 | ||
categories: ["R"] | ||
tags: ["R Markdown", "plot", "regression"] | ||
draft: yes | ||
--- | ||
|
||
|
||
|
||
<div id="r-markdown" class="section level1"> | ||
<h1>R Markdown</h1> | ||
<p>This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <a href="http://rmarkdown.rstudio.com" class="uri">http://rmarkdown.rstudio.com</a>.</p> | ||
<p>You can embed an R code chunk like this:</p> | ||
<pre class="r"><code>summary(cars) | ||
## speed dist | ||
## Min. : 4.0 Min. : 2.00 | ||
## 1st Qu.:12.0 1st Qu.: 26.00 | ||
## Median :15.0 Median : 36.00 | ||
## Mean :15.4 Mean : 42.98 | ||
## 3rd Qu.:19.0 3rd Qu.: 56.00 | ||
## Max. :25.0 Max. :120.00 | ||
fit <- lm(dist ~ speed, data = cars) | ||
fit | ||
## | ||
## Call: | ||
## lm(formula = dist ~ speed, data = cars) | ||
## | ||
## Coefficients: | ||
## (Intercept) speed | ||
## -17.579 3.932</code></pre> | ||
</div> | ||
<div id="including-plots" class="section level1"> | ||
<h1>Including Plots</h1> | ||
<p>You can also embed plots. See Figure <a href="#fig:pie">1</a> for example:</p> | ||
<pre class="r"><code>par(mar = c(0, 1, 0, 1)) | ||
pie( | ||
c(280, 60, 20), | ||
c('Sky', 'Sunny side of pyramid', 'Shady side of pyramid'), | ||
col = c('#0292D8', '#F7EA39', '#C4B632'), | ||
init.angle = -50, border = NA | ||
)</code></pre> | ||
<div class="figure"><span style="display:block;" id="fig:pie"></span> | ||
<img src="{{< blogdown/postref >}}index_files/figure-html/pie-1.png" alt="A fancy pie chart." width="672" /> | ||
<p class="caption"> | ||
Figure 1: A fancy pie chart. | ||
</p> | ||
</div> | ||
</div> |
Binary file added
BIN
+876 KB
...osts/2024-07-10-strength-in-data-connecting-to-the-taskmaster-database/Data/taskmaster.db
Binary file not shown.
179 changes: 179 additions & 0 deletions
179
...t/posts/2024-07-10-strength-in-data-connecting-to-the-taskmaster-database/index.Rmarkdown
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
--- | ||
title: 'Strength in Data: Connecting to the Taskmaster Database' | ||
author: Christopher Nam | ||
date: '2024-07-10' | ||
slug: [] | ||
categories: ["Getting Started", "Introduction", "Beginner", "Setup"] | ||
tags: | ||
- Introduction | ||
- Setup | ||
- Beginner | ||
- Getting Started | ||
draft: no | ||
output: | ||
blogdown::html_page: | ||
toc: true | ||
toc_depth: 1 | ||
--- | ||
|
||
```{r setup, include=FALSE, echo = FALSE} | ||
knitr::opts_chunk$set(echo = TRUE, root.dir = "../") | ||
``` | ||
|
||
# Your Task | ||
|
||
> Successfully connect to the Taskmaster database from within `R`. Fastest wins; your time starts now! | ||
|
||
# Introduction and Objective | ||
|
||
This article provides an overview of *Trabajo de las Mesas*, a pivotal Taskmaster database that will be central to performing a multitude of analysis and questions that we may want to answer regarding Taskmaster. TOC | ||
|
||
The article will also provide guidance on how to connect to the database from within <R>. | ||
|
||
# *Trabajo de las Mesas* Database | ||
|
||
[*Trabajo de las Mesas*](https://tdlm.fly.dev/) (TdlM^\[Taskmaster fanatics will know that this is in reference to the hint in S2E5's task *Build a bridge for the potato.*, which has since become one of key pieces of advice for all Taskmaster contestants. It has been suitably adapted for working on data tables in a database, rather than a piece of furniture.\]) provides a plethora of data associated with Taskmaster in a database format. Data included in the database includes information pertaining to a series, episode, conntestant, task attempts, and even profanity uttered by a contestant. | ||
|
||
The exhaustive nature of the data truly opens the door to potential questions we may want to answer in the Taskmaster universe. For this reasons, I am immensely grateful to the contributors of this project. | ||
|
||
## Data Quality | ||
|
||
As with any analysis and modelling project, the insights and conclusions generated are only as good as the data supplied to it. | ||
|
||
I do not know the specifics regarding how this data is collated and reviewed (my intention is that there will be a future article dedicated to this), but believe the data is inputted by fellow (hardcore) Taskmaster fans from [taskmaster.info](https://taskmaster.info/), an equally exhaustive Taskmaster resource. . | ||
|
||
For now, and to not derail me from my initial interest and excitement on The Median Duck project, I will assume that the data is of high quality (accurate, consistent etc.). | ||
|
||
If there are any instances where the data quality is suspect, and/or a contradictory insight or conclusion is identified, a deep dive will likely occurr and the deep dive process will like provide useful insight for any inspiring individuals hoping to get into data analytics more. | ||
|
||
## Why This Datasource? | ||
|
||
As the Taskmaster is a global phenomena, there is no doubt other datasources that could be used for this project. Most noticeably, Jack Bernhadt has an exhaustive [Google sheet document](https://docs.google.com/spreadsheets/d/1Us84BGInJw8Ef32xCVSVNo1W5mjri9CpUffYfLnq5xA/edit?usp=sharing) in which similar analysis and modelling could be performed. | ||
|
||
However, for the purposes of this project, being able to query from database has several advantages. This includes: | ||
|
||
- Quality: Data being in a structured tabular format which often leads to better data quality | ||
- Manipulations: Greater manipulation and transformations could potentially be employed (joins, group bys etc) | ||
- Automation, Repeatability and Scalabilty: if we wanted to repeat the same or similar analysis but on a new subset of data (for example updated data due to a new series being broadcast, or new parameters being employed), it is more convenient to do this in a structured data source such as a database. | ||
|
||
However, a database approach is by no means perfect either. The barrier to entry is considerably higher than data stored in a spreadsheet (both adding, manipulating and analysing data), and spreadsheets are good for ad-hoc, interactive analysis. | ||
|
||
Considering overall vision of The Median Duck, I believe that a database approach is ideal. | ||
|
||
## Potential Areas to Explore in the Future | ||
|
||
- Greater understanding of how the data is being collected. | ||
- Is it manual, and are their quality checks in place? Is there any opportunity to automate? | ||
- Can we introduce a SLA (service level agreement) of when the data can be expected to be populated. Data associated with more recent seasons don't appear to be present, despite being broadcasted already. | ||
- Introduction of an ETL timestamp. | ||
- Generate a data dictionary page | ||
- What tables are available, samples of the data, what the table pertains to, and key columns. | ||
- A dashboard on data quality. | ||
- A highlevel overview of the quality and how recent the data is. | ||
|
||
# Connecting to the Database from `R` | ||
|
||
## Downloading the `.db` file | ||
|
||
It is possible to view and query these the numerous tables in TdlM from the [website itself](https://tdlm.fly.dev/). However, this does not lead to intuitively to repeatable and reproduceable analysis. Connecting to the database from a statistical programming language such as `R` or `python`, naturally leads to repeatablility and reproduceability. | ||
|
||
I opting choosing to choose `R` for this project due to my familarity with it, and the high level visualisations and modelling that can be employed. | ||
|
||
The tables displayed on the website are powered from the following [database file](https://tdlm.fly.dev/taskmaster.db) which can downloaded and stored locally. The following code chunk downloads the database file locally (based on the repo directory); a corresponding folder location will be created if it does not already exist. | ||
|
||
```{r download} | ||
# URL where Database file resides. We will download from here. | ||
db_url <- "https://tdlm.fly.dev/taskmaster.db" | ||
|
||
# Where the data will be stored locally | ||
repo_root_dir <- getwd() | ||
db_file_name <- "taskmaster.db" | ||
data_dir <- "Data" | ||
|
||
db_data_location <- file.path(repo_root_dir, data_dir, db_file_name) | ||
|
||
|
||
# Create Data Directory if does not exist | ||
if(!file.exists(file.path(repo_root_dir, data_dir))){ | ||
dir.create(file.path(repo_root_dir, data_dir)) | ||
} | ||
|
||
# Download file specified by URL, save in the local destination. | ||
if(!file.exists(db_data_location)){ | ||
download.file(url = db_url, destfile = db_data_location, mode = "wb") | ||
} | ||
|
||
``` | ||
|
||
## Connecting to the `.db` file | ||
|
||
Now that the database file has been successfully downloaded, we can start to connect to it from `R` directory. The `DBI` package will be employed to establish this connection. | ||
|
||
```{r db_connect} | ||
package_name <- "RSQLite" | ||
|
||
if(!require(package_name, character.only = TRUE)){ | ||
install.packages(package_name, character.only = TRUE) | ||
} else{ | ||
library(package_name, character.only = TRUE) | ||
} | ||
|
||
|
||
# Driver used to establish database connection | ||
sqlite_driver <- dbDriver("SQLite") | ||
|
||
# Making the connection | ||
tm_db <- dbConnect(sqlite_driver, dbname = db_data_location) | ||
|
||
``` | ||
|
||
If successful, we should be able to list all the tables included in the database. | ||
|
||
```{r list_tables} | ||
# List all tables that are available in the database | ||
dbListTables(tm_db) | ||
``` | ||
|
||
## Querying the Database | ||
|
||
Now that we are successfully able to connect to the database, we are able to write queries and execute them directly from `R` to access the data. For example: | ||
|
||
### A Basic `SELECT` query | ||
|
||
```{r cols.print=25, series_output} | ||
|
||
# A Basic Select query on the series table. | ||
query <- "SELECT * FROM series LIMIT 10" | ||
|
||
dbGetQuery(tm_db, query) | ||
``` | ||
|
||
### Advanced query | ||
|
||
A more involved query involving `JOIN` and date manipulation | ||
|
||
```{r max.print=25, advanced_query} | ||
# A join, and data manipulation | ||
query <- "SELECT ts.name, | ||
ts.special as special_flag, | ||
tp.name as champion_name, | ||
tp.seat as chamption_seat, | ||
DATE(ts.studio_end) as studio_end, | ||
DATE(ts.air_start) as air_start, | ||
JULIANDAY(ts.air_start) - JULIANDAY(ts.studio_end) as broadcast_lag_days | ||
FROM series ts | ||
LEFT JOIN people tp | ||
ON ts.id = tp.series | ||
AND ts.champion = tp.id | ||
WHERE ts.special <> 1 | ||
" | ||
|
||
results <- dbGetQuery(tm_db, query) | ||
results | ||
``` | ||
|
||
The results of this query already indicate interesting insights, namely that 204 days (approximately `r round(204/7)` weeks) occurred between the studio record and first air date for Series 13, which is a noticeable deviation from prior seasons (greater broadcast lag). Future series also seem delayed, although to a lesser extent. Could the pandemic have initiated this lag? Or where there other production changes that led to this lag? | ||
|
||
# Times Up! | ||
|
||
And that concludes this task! Hopefully you've been able to connect to the TdlM database directly through `R` and potentially inspired to start performing your own analysis. |
Oops, something went wrong.