Skip to content

Commit

Permalink
TdlM connection
Browse files Browse the repository at this point in the history
  • Loading branch information
bluevolvo87 committed Jul 16, 2024
1 parent f26c192 commit fe4d99b
Show file tree
Hide file tree
Showing 71 changed files with 3,469 additions and 327 deletions.
50 changes: 50 additions & 0 deletions content/posts/2020-12-01-r-rmarkdown/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "Hello R Markdown"
author: "Frida Gomam"
date: 2020-12-01T21:13:14-05:00
categories: ["R"]
tags: ["R Markdown", "plot", "regression"]
draft: yes
---



<div id="r-markdown" class="section level1">
<h1>R Markdown</h1>
<p>This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <a href="http://rmarkdown.rstudio.com" class="uri">http://rmarkdown.rstudio.com</a>.</p>
<p>You can embed an R code chunk like this:</p>
<pre class="r"><code>summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
fit &lt;- lm(dist ~ speed, data = cars)
fit
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932</code></pre>
</div>
<div id="including-plots" class="section level1">
<h1>Including Plots</h1>
<p>You can also embed plots. See Figure <a href="#fig:pie">1</a> for example:</p>
<pre class="r"><code>par(mar = c(0, 1, 0, 1))
pie(
c(280, 60, 20),
c(&#39;Sky&#39;, &#39;Sunny side of pyramid&#39;, &#39;Shady side of pyramid&#39;),
col = c(&#39;#0292D8&#39;, &#39;#F7EA39&#39;, &#39;#C4B632&#39;),
init.angle = -50, border = NA
)</code></pre>
<div class="figure"><span style="display:block;" id="fig:pie"></span>
<img src="{{< blogdown/postref >}}index_files/figure-html/pie-1.png" alt="A fancy pie chart." width="672" />
<p class="caption">
Figure 1: A fancy pie chart.
</p>
</div>
</div>
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
---
title: 'Strength in Data: Connecting to the Taskmaster Database'
author: Christopher Nam
date: '2024-07-10'
slug: []
categories: ["Getting Started", "Introduction", "Beginner", "Setup"]
tags:
- Introduction
- Setup
- Beginner
- Getting Started
draft: no
output:
blogdown::html_page:
toc: true
toc_depth: 1
---

```{r setup, include=FALSE, echo = FALSE}
knitr::opts_chunk$set(echo = TRUE, root.dir = "../")
```

# Your Task

> Successfully connect to the Taskmaster database from within `R`. Fastest wins; your time starts now!

# Introduction and Objective

This article provides an overview of *Trabajo de las Mesas*, a pivotal Taskmaster database that will be central to performing a multitude of analysis and questions that we may want to answer regarding Taskmaster. TOC

The article will also provide guidance on how to connect to the database from within <R>.

# *Trabajo de las Mesas* Database

[*Trabajo de las Mesas*](https://tdlm.fly.dev/) (TdlM^\[Taskmaster fanatics will know that this is in reference to the hint in S2E5's task *Build a bridge for the potato.*, which has since become one of key pieces of advice for all Taskmaster contestants. It has been suitably adapted for working on data tables in a database, rather than a piece of furniture.\]) provides a plethora of data associated with Taskmaster in a database format. Data included in the database includes information pertaining to a series, episode, conntestant, task attempts, and even profanity uttered by a contestant.

The exhaustive nature of the data truly opens the door to potential questions we may want to answer in the Taskmaster universe. For this reasons, I am immensely grateful to the contributors of this project.

## Data Quality

As with any analysis and modelling project, the insights and conclusions generated are only as good as the data supplied to it.

I do not know the specifics regarding how this data is collated and reviewed (my intention is that there will be a future article dedicated to this), but believe the data is inputted by fellow (hardcore) Taskmaster fans from [taskmaster.info](https://taskmaster.info/), an equally exhaustive Taskmaster resource. .

For now, and to not derail me from my initial interest and excitement on The Median Duck project, I will assume that the data is of high quality (accurate, consistent etc.).

If there are any instances where the data quality is suspect, and/or a contradictory insight or conclusion is identified, a deep dive will likely occurr and the deep dive process will like provide useful insight for any inspiring individuals hoping to get into data analytics more.

## Why This Datasource?

As the Taskmaster is a global phenomena, there is no doubt other datasources that could be used for this project. Most noticeably, Jack Bernhadt has an exhaustive [Google sheet document](https://docs.google.com/spreadsheets/d/1Us84BGInJw8Ef32xCVSVNo1W5mjri9CpUffYfLnq5xA/edit?usp=sharing) in which similar analysis and modelling could be performed.

However, for the purposes of this project, being able to query from database has several advantages. This includes:

- Quality: Data being in a structured tabular format which often leads to better data quality
- Manipulations: Greater manipulation and transformations could potentially be employed (joins, group bys etc)
- Automation, Repeatability and Scalabilty: if we wanted to repeat the same or similar analysis but on a new subset of data (for example updated data due to a new series being broadcast, or new parameters being employed), it is more convenient to do this in a structured data source such as a database.

However, a database approach is by no means perfect either. The barrier to entry is considerably higher than data stored in a spreadsheet (both adding, manipulating and analysing data), and spreadsheets are good for ad-hoc, interactive analysis.

Considering overall vision of The Median Duck, I believe that a database approach is ideal.

## Potential Areas to Explore in the Future

- Greater understanding of how the data is being collected.
- Is it manual, and are their quality checks in place? Is there any opportunity to automate?
- Can we introduce a SLA (service level agreement) of when the data can be expected to be populated. Data associated with more recent seasons don't appear to be present, despite being broadcasted already.
- Introduction of an ETL timestamp.
- Generate a data dictionary page
- What tables are available, samples of the data, what the table pertains to, and key columns.
- A dashboard on data quality.
- A highlevel overview of the quality and how recent the data is.

# Connecting to the Database from `R`

## Downloading the `.db` file

It is possible to view and query these the numerous tables in TdlM from the [website itself](https://tdlm.fly.dev/). However, this does not lead to intuitively to repeatable and reproduceable analysis. Connecting to the database from a statistical programming language such as `R` or `python`, naturally leads to repeatablility and reproduceability.

I opting choosing to choose `R` for this project due to my familarity with it, and the high level visualisations and modelling that can be employed.

The tables displayed on the website are powered from the following [database file](https://tdlm.fly.dev/taskmaster.db) which can downloaded and stored locally. The following code chunk downloads the database file locally (based on the repo directory); a corresponding folder location will be created if it does not already exist.

```{r download}
# URL where Database file resides. We will download from here.
db_url <- "https://tdlm.fly.dev/taskmaster.db"

# Where the data will be stored locally
repo_root_dir <- getwd()
db_file_name <- "taskmaster.db"
data_dir <- "Data"

db_data_location <- file.path(repo_root_dir, data_dir, db_file_name)


# Create Data Directory if does not exist
if(!file.exists(file.path(repo_root_dir, data_dir))){
dir.create(file.path(repo_root_dir, data_dir))
}

# Download file specified by URL, save in the local destination.
if(!file.exists(db_data_location)){
download.file(url = db_url, destfile = db_data_location, mode = "wb")
}

```

## Connecting to the `.db` file

Now that the database file has been successfully downloaded, we can start to connect to it from `R` directory. The `DBI` package will be employed to establish this connection.

```{r db_connect}
package_name <- "RSQLite"

if(!require(package_name, character.only = TRUE)){
install.packages(package_name, character.only = TRUE)
} else{
library(package_name, character.only = TRUE)
}


# Driver used to establish database connection
sqlite_driver <- dbDriver("SQLite")

# Making the connection
tm_db <- dbConnect(sqlite_driver, dbname = db_data_location)

```

If successful, we should be able to list all the tables included in the database.

```{r list_tables}
# List all tables that are available in the database
dbListTables(tm_db)
```

## Querying the Database

Now that we are successfully able to connect to the database, we are able to write queries and execute them directly from `R` to access the data. For example:

### A Basic `SELECT` query

```{r cols.print=25, series_output}

# A Basic Select query on the series table.
query <- "SELECT * FROM series LIMIT 10"

dbGetQuery(tm_db, query)
```

### Advanced query

A more involved query involving `JOIN` and date manipulation

```{r max.print=25, advanced_query}
# A join, and data manipulation
query <- "SELECT ts.name,
ts.special as special_flag,
tp.name as champion_name,
tp.seat as chamption_seat,
DATE(ts.studio_end) as studio_end,
DATE(ts.air_start) as air_start,
JULIANDAY(ts.air_start) - JULIANDAY(ts.studio_end) as broadcast_lag_days
FROM series ts
LEFT JOIN people tp
ON ts.id = tp.series
AND ts.champion = tp.id
WHERE ts.special <> 1
"

results <- dbGetQuery(tm_db, query)
results
```

The results of this query already indicate interesting insights, namely that 204 days (approximately `r round(204/7)` weeks) occurred between the studio record and first air date for Series 13, which is a noticeable deviation from prior seasons (greater broadcast lag). Future series also seem delayed, although to a lesser extent. Could the pandemic have initiated this lag? Or where there other production changes that led to this lag?

# Times Up!

And that concludes this task! Hopefully you've been able to connect to the TdlM database directly through `R` and potentially inspired to start performing your own analysis.
Loading

0 comments on commit fe4d99b

Please sign in to comment.