TdlM connection

bluevolvo87 · Jul 16, 2024 · fe4d99b · fe4d99b
1 parent f26c192
commit fe4d99b
Show file tree

Hide file tree

Showing 71 changed files with 3,469 additions and 327 deletions.
diff --git a/content/posts/2020-12-01-r-rmarkdown/index.html b/content/posts/2020-12-01-r-rmarkdown/index.html
@@ -0,0 +1,50 @@
+---
+title: "Hello R Markdown"
+author: "Frida Gomam"
+date: 2020-12-01T21:13:14-05:00
+categories: ["R"]
+tags: ["R Markdown", "plot", "regression"]
+draft: yes
+---
+
+
+
+<div id="r-markdown" class="section level1">
+<h1>R Markdown</h1>
+<p>This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <a href="http://rmarkdown.rstudio.com" class="uri">http://rmarkdown.rstudio.com</a>.</p>
+<p>You can embed an R code chunk like this:</p>
+<pre class="r"><code>summary(cars)
+##      speed           dist       
+##  Min.   : 4.0   Min.   :  2.00  
+##  1st Qu.:12.0   1st Qu.: 26.00  
+##  Median :15.0   Median : 36.00  
+##  Mean   :15.4   Mean   : 42.98  
+##  3rd Qu.:19.0   3rd Qu.: 56.00  
+##  Max.   :25.0   Max.   :120.00
+fit &lt;- lm(dist ~ speed, data = cars)
+fit
+## 
+## Call:
+## lm(formula = dist ~ speed, data = cars)
+## 
+## Coefficients:
+## (Intercept)        speed  
+##     -17.579        3.932</code></pre>
+</div>
+<div id="including-plots" class="section level1">
+<h1>Including Plots</h1>
+<p>You can also embed plots. See Figure <a href="#fig:pie">1</a> for example:</p>
+<pre class="r"><code>par(mar = c(0, 1, 0, 1))
+pie(
+  c(280, 60, 20),
+  c(&#39;Sky&#39;, &#39;Sunny side of pyramid&#39;, &#39;Shady side of pyramid&#39;),
+  col = c(&#39;#0292D8&#39;, &#39;#F7EA39&#39;, &#39;#C4B632&#39;),
+  init.angle = -50, border = NA
+)</code></pre>
+<div class="figure"><span style="display:block;" id="fig:pie"></span>
+<img src="{{< blogdown/postref >}}index_files/figure-html/pie-1.png" alt="A fancy pie chart." width="672" />
+<p class="caption">
+Figure 1: A fancy pie chart.
+</p>
+</div>
+</div>
diff --git a/...osts/2024-07-10-strength-in-data-connecting-to-the-taskmaster-database/Data/taskmaster.db b/...osts/2024-07-10-strength-in-data-connecting-to-the-taskmaster-database/Data/taskmaster.db
diff --git a/...t/posts/2024-07-10-strength-in-data-connecting-to-the-taskmaster-database/index.Rmarkdown b/...t/posts/2024-07-10-strength-in-data-connecting-to-the-taskmaster-database/index.Rmarkdown
@@ -0,0 +1,179 @@
+---
+title: 'Strength in Data: Connecting to the Taskmaster Database'
+author: Christopher Nam
+date: '2024-07-10'
+slug: []
+categories: ["Getting Started", "Introduction", "Beginner", "Setup"]
+tags:
+  - Introduction
+  - Setup
+  - Beginner
+  - Getting Started
+draft: no
+output:
+  blogdown::html_page:
+    toc: true
+    toc_depth: 1
+---
+
+```{r setup, include=FALSE, echo = FALSE}
+knitr::opts_chunk$set(echo = TRUE, root.dir = "../")
+```
+
+# Your Task
+
+> Successfully connect to the Taskmaster database from within `R`. Fastest wins; your time starts now!
+
+# Introduction and Objective
+
+This article provides an overview of *Trabajo de las Mesas*, a pivotal Taskmaster database that will be central to performing a multitude of analysis and questions that we may want to answer regarding Taskmaster. TOC
+
+The article will also provide guidance on how to connect to the database from within <R>.
+
+# *Trabajo de las Mesas* Database
+
+[*Trabajo de las Mesas*](https://tdlm.fly.dev/) (TdlM^\[Taskmaster fanatics will know that this is in reference to the hint in S2E5's task *Build a bridge for the potato.*, which has since become one of key pieces of advice for all Taskmaster contestants. It has been suitably adapted for working on data tables in a database, rather than a piece of furniture.\]) provides a plethora of data associated with Taskmaster in a database format. Data included in the database includes information pertaining to a series, episode, conntestant, task attempts, and even profanity uttered by a contestant.
+
+The exhaustive nature of the data truly opens the door to potential questions we may want to answer in the Taskmaster universe. For this reasons, I am immensely grateful to the contributors of this project.
+
+## Data Quality
+
+As with any analysis and modelling project, the insights and conclusions generated are only as good as the data supplied to it.
+
+I do not know the specifics regarding how this data is collated and reviewed (my intention is that there will be a future article dedicated to this), but believe the data is inputted by fellow (hardcore) Taskmaster fans from [taskmaster.info](https://taskmaster.info/), an equally exhaustive Taskmaster resource. .
+
+For now, and to not derail me from my initial interest and excitement on The Median Duck project, I will assume that the data is of high quality (accurate, consistent etc.).
+
+If there are any instances where the data quality is suspect, and/or a contradictory insight or conclusion is identified, a deep dive will likely occurr and the deep dive process will like provide useful insight for any inspiring individuals hoping to get into data analytics more.
+
+## Why This Datasource?
+
+As the Taskmaster is a global phenomena, there is no doubt other datasources that could be used for this project. Most noticeably, Jack Bernhadt has an exhaustive [Google sheet document](https://docs.google.com/spreadsheets/d/1Us84BGInJw8Ef32xCVSVNo1W5mjri9CpUffYfLnq5xA/edit?usp=sharing) in which similar analysis and modelling could be performed.
+
+However, for the purposes of this project, being able to query from database has several advantages. This includes:
+
+-   Quality: Data being in a structured tabular format which often leads to better data quality
+-   Manipulations: Greater manipulation and transformations could potentially be employed (joins, group bys etc)
+-   Automation, Repeatability and Scalabilty: if we wanted to repeat the same or similar analysis but on a new subset of data (for example updated data due to a new series being broadcast, or new parameters being employed), it is more convenient to do this in a structured data source such as a database.
+
+However, a database approach is by no means perfect either. The barrier to entry is considerably higher than data stored in a spreadsheet (both adding, manipulating and analysing data), and spreadsheets are good for ad-hoc, interactive analysis.
+
+Considering overall vision of The Median Duck, I believe that a database approach is ideal.
+
+## Potential Areas to Explore in the Future
+
+-   Greater understanding of how the data is being collected.
+    -   Is it manual, and are their quality checks in place? Is there any opportunity to automate?
+    -   Can we introduce a SLA (service level agreement) of when the data can be expected to be populated. Data associated with more recent seasons don't appear to be present, despite being broadcasted already.
+    -   Introduction of an ETL timestamp.
+-   Generate a data dictionary page
+    -   What tables are available, samples of the data, what the table pertains to, and key columns.
+-   A dashboard on data quality.
+    -   A highlevel overview of the quality and how recent the data is.
+
+# Connecting to the Database from `R`
+
+## Downloading the `.db` file
+
+It is possible to view and query these the numerous tables in TdlM from the [website itself](https://tdlm.fly.dev/). However, this does not lead to intuitively to repeatable and reproduceable analysis. Connecting to the database from a statistical programming language such as `R` or `python`, naturally leads to repeatablility and reproduceability.
+
+I opting choosing to choose `R` for this project due to my familarity with it, and the high level visualisations and modelling that can be employed.
+
+The tables displayed on the website are powered from the following [database file](https://tdlm.fly.dev/taskmaster.db) which can downloaded and stored locally. The following code chunk downloads the database file locally (based on the repo directory); a corresponding folder location will be created if it does not already exist.
+
+```{r download}
+# URL where Database file resides. We will download from here.
+db_url <- "https://tdlm.fly.dev/taskmaster.db"
+
+# Where the data will be stored locally
+repo_root_dir <- getwd()
+db_file_name <- "taskmaster.db"
+data_dir <- "Data"
+
+db_data_location <- file.path(repo_root_dir, data_dir, db_file_name)
+
+
+# Create Data Directory if does not exist
+if(!file.exists(file.path(repo_root_dir, data_dir))){
+    dir.create(file.path(repo_root_dir, data_dir))
+}
+
+# Download file specified by URL, save in the local destination.
+if(!file.exists(db_data_location)){
+    download.file(url = db_url, destfile = db_data_location, mode = "wb")
+}
+
+```
+
+## Connecting to the `.db` file
+
+Now that the database file has been successfully downloaded, we can start to connect to it from `R` directory. The `DBI` package will be employed to establish this connection.
+
+```{r db_connect}
+package_name <- "RSQLite"
+
+if(!require(package_name, character.only = TRUE)){
+    install.packages(package_name, character.only = TRUE)
+} else{
+    library(package_name, character.only = TRUE)    
+}
+
+
+# Driver used to establish database connection
+sqlite_driver <- dbDriver("SQLite")
+
+# Making the connection 
+tm_db <- dbConnect(sqlite_driver, dbname = db_data_location)
+
+```
+
+If successful, we should be able to list all the tables included in the database.
+
+```{r list_tables}
+# List all tables that are available in the database
+dbListTables(tm_db)
+```
+
+## Querying the Database
+
+Now that we are successfully able to connect to the database, we are able to write queries and execute them directly from `R` to access the data. For example:
+
+### A Basic `SELECT` query
+
+```{r cols.print=25, series_output}
+
+# A Basic Select query  on the series table.
+query <- "SELECT * FROM series LIMIT 10"
+
+dbGetQuery(tm_db, query)
+```
+
+### Advanced query
+
+A more involved query involving `JOIN` and date manipulation
+
+```{r max.print=25, advanced_query}
+# A join, and data manipulation
+query <- "SELECT ts.name,
+ts.special as special_flag,
+tp.name as champion_name,
+tp.seat as chamption_seat,
+DATE(ts.studio_end) as studio_end, 
+DATE(ts.air_start) as air_start, 
+JULIANDAY(ts.air_start) - JULIANDAY(ts.studio_end) as broadcast_lag_days
+FROM series ts
+LEFT JOIN people tp
+ON ts.id = tp.series
+AND ts.champion = tp.id
+WHERE ts.special <> 1
+"
+
+results <- dbGetQuery(tm_db, query)
+results
+```
+
+The results of this query already indicate interesting insights, namely that 204 days (approximately `r round(204/7)` weeks) occurred between the studio record and first air date for Series 13, which is a noticeable deviation from prior seasons (greater broadcast lag). Future series also seem delayed, although to a lesser extent. Could the pandemic have initiated this lag? Or where there other production changes that led to this lag?
+
+# Times Up!
+
+And that concludes this task! Hopefully you've been able to connect to the TdlM database directly through `R` and potentially inspired to start performing your own analysis.