-
Notifications
You must be signed in to change notification settings - Fork 1
/
SQL.Rmd
1233 lines (904 loc) · 50.6 KB
/
SQL.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "SQL in R"
output:
learnr::tutorial:
css: "css/style.css"
progressive: true
allow_skip: true
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
# Author: Russell McCreath
# Original Date: Feb 2022
# Version of R: 3.6.1
# See here for learnr package documentation: https://rstudio.github.io/learnr/
# Include packages here that are required throughout the training
library(learnr) # Required to build the Shiny app
library(gradethis) # For code checking and user-specific feedback
library(readr) # For reading/writing data
library(RSQLite) # For interfacing with SQLite databases
library(odbc) # For interacting with databases
library(kableExtra)
library(shiny)
library(dplyr)
library(dbplyr)
knitr::opts_chunk$set(echo = FALSE)
tutorial_options(
exercise.checker = gradethis::grade_learnr
)
### Database Setup
# Load Borders data
borders_data <- readRDS("www/data/borders.rds")
# Load pantheon data
pantheon <- read_csv("www/data/gb_us_pantheon.csv")
# Load baby names data
baby <- read_csv("www/data/Baby5.csv")
baby_family <- read_csv("www/data/Baby6.csv") %>%
select(FAMILYID, SURNAME)
# Load hospital lookup data
hospital_lookup <- read_csv("www/data/hospital_lookup.csv")
# Create a connection to the database
conn <- dbConnect(RSQLite::SQLite(), ":memory:")
# Write Borders data into a table within the database
dbWriteTable(conn, "borders_data", borders_data, overwrite = TRUE)
# Write pantheon data into a table within the database
dbWriteTable(conn, "pantheon", pantheon, overwrite = TRUE)
# Write baby data into a table within the database
dbWriteTable(conn, "baby", baby, overwrite = TRUE)
dbWriteTable(conn, "baby_family", baby_family, overwrite = TRUE)
# Write hospital lookup data into a table within the database
dbWriteTable(conn, "hospital_lookup", hospital_lookup, overwrite = TRUE)
# List tables available in the database
dbListTables(conn)
```
```{r phs-logo, echo=FALSE, fig.align='right', out.width="40%"}
knitr::include_graphics("images/phs-logo.png")
```
## Introduction
Welcome to **SQL in R**. This is a self-led course on accessing and utilising databases within Public Health Scotland, specifically SQL databases, in R. The first part is a complete introductory course on SQL, then the knowledge of integrating that within R is next. Throughout this course there will be quizzes to test your knowledge and opportunities to modify and write both R and SQL code.
<div class="info_box">
<h4>Course Info</h4>
<ul>
<li>This course is built to flow through sections and build on previous knowledge. If you're comfortable with a particular section, you can skip it.</li>
<li>The course will also show progress through sections, a green tick will appear on sections you've completed, and it will remember your place if you decide to close your browser and come back later.</li>
</ul>
</div>
</br>
### What is SQL?
SQL (you can read about the [pronounciation *conflict* here](https://database.guide/is-it-pronounced-s-q-l-or-sequel/)) is the **S**tructured **Q**uery **L**anguage used to interact with relational databases. This is, by far, the most popular database structure and for our purposes makes storing and interacting with data so much more efficient.
<div class="supporting-image-right">
```{r intro-image, echo=FALSE, fig.align='center', out.width="90%"}
knitr::include_graphics("images/sql/data_boxes.png")
```
</div>
A relational database is just like a collection of tables, rows and columns, like an Excel spreadsheet. Each table represents a single entity, columns (or fields) represent attributes, and rows (or records) represent information about a single entity. The columns then have rules applied to show what kind of data they should contain, e.g. a text string that is maximum 4 characters long, or a number less than 10. Other rules and logic can also be applied to the setup of the database to ensure consistency and accuracy of data. Have a look at the below table as an example, other tables in this database might include further information about the individual patients or Health Board areas (such as deprivation data):
```{r, echo=FALSE}
graphics_grammar_table <- data.frame(
"ID" = c("1", "2", "3"),
"Name" = c("Harry", "Luna", "Ronald"),
"Specialty" = c("A1", "A11", "A2"),
"LoS" = c("5", "7", "9"),
"HB_Res" = c("S08000031", "S08000024", "S08000031"),
"HB_Treat" = c("S08000031", "S08000031", "S08000031")
)
kableExtra::kbl(graphics_grammar_table) %>%
kable_paper(full_width = FALSE)
```
```{r intro-quiz}
quiz(
question("How many records are in the above table?",
answer("0"),
answer("1"),
answer("2"),
answer("3", correct = TRUE),
answer("4"),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = FALSE
),
question("What was the LoS (Length of Stay) for Luna?",
answer("5"),
answer("9"),
answer("7", correct = TRUE),
answer("2"),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE)
)
```
</br>
PHS have many databases for different purposes, with varying technical infrastructures (Oracle, PostgreSQL, MySQL). All of these utilise a common language, SQL although there are differences as things get more advanced. This course will minimise the need to interact with any live PHS database but will still reference familiar databases and make sure the knowledge gained is transferable. We’ll stay practical in this course though, building knowledge and skill by:
1. **Learning SQL**
* [Query Basics](#query-basics)
* [Aggregate Queries](#aggregate-queries)
* [Joins](#joins)
* [Subqueries](#subqueries)
* [Table Management](#table-management)
2. **Implementing SQL within R**
* [SQL in R](#sql-in-r)
3. Top it off with a **Tidyverse package, [`dbplyr`](https://dbplyr.tidyverse.org/),** used to interact with databases with the Tidyverse style of coding
* [Tidyverse](#tidyverse)
RStudio has functionality to make connecting to and interacting with databases slightly easier but the real power comes from including the SQL code in-line with your R code. This means there’s no interruption to your workflow as you can include it within your automation pipelines, and add supporting data manipulation code in R. However, R relies on processing power and memory and larger data extracts can have an exponential impact on performance. We’ll therefore go over writing efficient queries and other best practice to make our lives that bit easier.
## Query Basics
The basics in this course start with understanding SQL, the language of interacting with databases. We can then plug this SQL into R to keep our code and workflows streamlined. However, as we're learning SQL, the syntax is quite different from R but with a straightforward structure, you're going to be able to do a lot with a little.
SQL can be used for creating and modifying databases but our focus will be querying databases. We work with and rely on data throughout our daily workload, and querying a database is how we can get this data from a database table (or combination of tables).
In the next couple of sections, we're going to be using the `pantheon` table, which is a list of well-known people from the United Kingdom and United States. There's some other variables but why don't you have a look yourself after running this SQL code to retrieve all the data from that table:
```{sql basics-1, connection="conn", exercise=TRUE}
SELECT * FROM pantheon;
```
</br>
```{r basics-1-quiz}
question("What is the listed occupation on the first record, that of Abraham Lincoln?",
answer("President"),
answer("Tennis Player"),
answer("Politician", correct = TRUE),
random_answer_order = TRUE)
```
```{r basics-welcome, echo=FALSE, fig.align='center', out.width="100%"}
knitr::include_graphics("images/sql/tables-welcome.png")
```
</br>
### SELECT
`SELECT` and `FROM` are keywords in SQL, and make the foundation of a query. The `SELECT` statement refers to the attributes (columns) and the `FROM` is the table in the database where that data exists. This forms that most basic query. You can select some (or all) of the columns which will return a two-dimensional copy of the table with the columns requested and all the rows. To select all the columns, we use the wildcard character, asterisk (`*`):
```sql
-- This is a SQL comment
-- Selecting specific columns from the table
SELECT some_column, another_column, ...
FROM my_table;
-- Selecting all columns from the table
SELECT *
FROM my_table;
```
SQL keywords are not case-sensitive but it's here that we can introduce some of our best practice:
* Make SQL keyboards UPPER-CASE, this will distinguish them from other parts of your query.
* End queries with a semicolon, sometimes this is necessary to run the query, other times its not.
* We can already create efficiencies in our code at this stage. Limit the columns queried to what you require in your immediate workflow. If you want to explore/check the data across all columns, you can add a `LIMIT` to your query which will only pull the specified number of rows.
Feel free to play around with this on the `pantheon` table and then finish by creating a table of names (`name`), birth cities (`birthcity`), and occupations (`occupation`). Then, limit the number of results to 30:
```{sql, select-1, connection="conn", exercise=TRUE}
SELECT *
FROM pantheon
LIMIT 10;
```
```{sql, select-1-solution}
SELECT name, birthcity, occupation
FROM pantheon
LIMIT 30;
```
</br>
#### DISTINCT
There may be a situation in which you need the unique values from a column, this is where the `DISTINCT` keyword comes in. As an example, we can get a list of all industries from the `pantheon` table:
```{sql distinct-1, connection="conn", exercise=TRUE}
SELECT DISTINCT industry
FROM pantheon;
```
*`DISTINCT` is attached to the column that you're looking for unique values from. This will completely remove any rows that are duplicated based on that simple identification. Later on, we'll find out about grouping (`GROUP_BY`).*
Now, `DISTINCT` needs to be used with care. To use it, the database system has to use a lot of processing and memory as literally every row is compared. It can also be misused based on how the rest of the query is created, `DISTINCT` will not resolve problems in your code. Only use `DISTINCT` where there are genuine duplicates in the data.
</br>
#### COUNT
Occasionally, it's not even the data that we need, it's the number of records that match the query in the specified table. This is where the `COUNT()` function comes in, and as we learn more about forming SQL queries, each of these components can be used in different ways to return our desired result.
```sql
-- Count total number of rows in a table
SELECT COUNT(*) FROM my_table;
-- Count non-missing values of specific attribute
SELECT COUNT(column) FROM my_table;
-- Count missing values of a specific attribute (more on WHERE next)
SELECT COUNT(column) FROM my_table WHERE column IS NULL;
-- Count the number of distinct values in a column
SELECT COUNT(DISTINCT column) FROM my_table;
```
Now, to get practical, let's get the number of unique cities (`birthcity`) from the `pantheon` table:
```{sql count-1, connection="conn", exercise=TRUE}
```
```{sql count-1-solution}
SELECT COUNT(DISTINCT birthcity)
FROM pantheon;
```
</br>
### WHERE
<div class="supporting-image-right">
```{r where-image, echo=FALSE, fig.align='center', out.width="90%"}
knitr::include_graphics("images/sql/where.jpg")
```
</div>
We often deal with datasets with millions of rows but it's not very often we need every row. It's much more efficient to only extract the data that we need. We can do this by adding constraints with the `WHERE` clause as part of your query. The clause is then applied to each record to determine if that record is included in the results, all at the database system side. This means that no processing is required locally and that only the required data is transferred.
As an example, you might be looking for a particular hospital so you only want records that refer to that hospital. Similarly, you could be looking at groups of specialties, or a range of admission dates.
```sql
-- Selecting specific columns from the table with constraints
SELECT some_column, another_column, ...
FROM my_table
WHERE condition
AND/OR another_condition
...;
```
</br>
We now need to focus on creating the conditions, these will have a familiar logic. Below are some useful operators to start us off:
| Numeric Operator | Condition | Example |
| :---------------: | --------------------------------------- | ------------- |
| `=` `<>` | Equal to and not equal to | `col <> 100` |
| `<` `<=` `>` `>=` | Comparison operators | `col >= 18` |
| `BETWEEN ... AND ...` | Within inclusive range of 2 values | `col BETWEEN 0 AND 5.5` |
| `NOT BETWEEN ... AND ...` | Not within inclusive range of 2 values | `col NOT BETWEEN 0 AND 5.5` |
| `IN (...)` | Exists within list (short for multiple OR conditions) | `col IN (2, 4, 6)` |
| `NOT IN (...)` | Doesn't exist within list | `col NOT IN (1, 3, 5)` |
| String Operator | Condition | Example |
| :---------------: | --------------------------------------- | ------------- |
| `=` `<>` | Equal to and not equal to | `col <> 'Glasgow'` |
| `LIKE` | Case insensitive exact string comparison | `col LIKE 'Glasgow'` |
| `NOT LIKE` | Case insensitive exact string inequality comparison | `col NOT LIKE 'Glasgow'` |
| `%` | Used anywhere in a string to match a sequence of zero or more characters (only with LIKE or NOT LIKE) | `col LIKE '%AT%'` (matches "AT", "ATTIC", "CAT" or "BATS") |
| `_` | Used anywhere in a string to match a single character (only with LIKE or NOT LIKE) | `col LIKE 'AN_'` (matches "AND", but not "AN") |
| `IN (...)` | Exists within list (short for multiple OR conditions) | `col IN ('EDI', 'GLA')` |
| `NOT IN (...)` | Doesn't exist within list | `col NOT IN ('G', 'E', 'A')` |
*For most purposes, this implementation of text search is efficient and works well. However, if the workflow becomes more specified, there are dedicated libraries designed around full text search.*
To test what we've been learning, build a query with the following constraints on the `pantheon` table:
* `birthcity` contains "burgh" in the name
* `domain` is the Arts
* `birthyear` is in the 1920s
```{sql where-1, connection="conn", exercise=TRUE}
```
```{sql where-1-hint-1}
SELECT name
FROM pantheon
WHERE birthcity ___
AND domain ___
AND birthyear ___
```
```{sql where-1-hint-2}
SELECT name
FROM pantheon
WHERE birthcity LIKE ___
AND domain = ___
AND birthyear BETWEEN ___
```
```{sql where-1-solution}
SELECT name
FROM pantheon
WHERE birthcity LIKE '%burgh%'
AND domain = 'Arts'
AND birthyear BETWEEN 1920 AND 1929;
```
```{r where-1-quiz}
question("Who is the result of the above query?",
answer("Gene Kelly"),
answer("F. Murray Abraham"),
answer("Stuart Sutcliffe"),
answer("Andy Warhol", correct = TRUE),
random_answer_order = TRUE)
```
</br>
### ORDER BY
The final, and short, section on basics is about ordering. This isn't always necessary as part of a larger piece of work but can save time if you're looking to prettify a data dump without having to do anything else locally. This is achieved by using `ORDER_BY` which defaults to ascending order but you can append a `DESC` keyword after any of the specified columns to achieve a descending order. You might want to do this by Health Board or date to make the data easier to view as a whole.
```sql
SELECT some_column, another_column, ...
FROM my_table
WHERE condition(s)
ORDER BY some_column, another_column DESC;
-- some_column defaults to ASC (ascending) order
```
## Aggregate Queries
SQL also has the power to provide summarised information through aggregate expressions, rather than just returning the raw data from the database. Here's the template syntax and a list of the most common aggregate functions.
```sql
SELECT AGR_FUNC(column_or_expression), ...
FROM my_table
WHERE condition(s);
```
| Function | Description |
| :---------------: | --------------------------------------- |
| `COUNT(*)` | A count of the total number of rows |
| `COUNT(some_column)` | A count of the total number of non-NULL rows from the specified column |
| `MIN(some_column)` | The smallest numerical value from the specified column |
| `MAX(some_column)` | The largest numerical value from the specified column |
| `AVG(some_column)` | The average numerical value from the specified column |
| `SUM(some_column)` | The total sum of all numerical values from the specified column |
#### AS
Aggregate functions allow you to write efficient queries that save time and processing power locally. However, this can make the SQL and its output harder to read or understand. We can solve this using aliasing (`AS`). To do this, we provide an alias to any part of the `SELECT` line, including column names and our aggregate functions.
To get familiar, let's return the average birth year (`birthyear`) as "Average Birth Year of UK Writers". This will include writers (`occupation`) born in the United Kingdom (`countryName`) from the `pantheon` table.
```{sql agr-1, connection="conn", exercise=TRUE}
```
```{sql agr-1-hint-1}
SELECT ___
FROM pantheon
WHERE occupation ___
AND countryName ___
```
```{sql agr-1-hint-2}
SELECT AVG(birthyear) ___
FROM pantheon
WHERE occupation LIKE "writer"
AND countryName LIKE "United Kingdom"
```
```{sql agr-1-solution}
SELECT AVG(birthyear) AS "Average Birth Year of UK Writers"
FROM pantheon
WHERE occupation LIKE "writer"
AND countryName LIKE "United Kingdom";
```
</br>
### GROUP BY
What's even more useful, perhaps, is the ability to perform these aggregate functions over groups. Similar to how `dplyr::group_by` works, you select the column(s) and every unique group will be generated by the `GROUP BY` clause. Here's some more template syntax:
```sql
SELECT AGR_FUNC(column_or_expression) AS agr_description, ...
FROM my_table
WHERE condition(s)
GROUP BY some_column;
```
</br>
Taking a slightly different version of the problem from the last section, let's find the youngest (`birthyear`) writers (`occupation`) from the UK and US (`countryName`) in the `pantheon` dataset, including a split on gender (`gender`). Let's be explicit and include details in the output such as writer's name (`name`). Think about assigning an alias to any columns too (*note: you can refer to the original column name or alias throughout the rest of your query*)
```{sql agr-2, connection="conn", exercise=TRUE}
```
```{sql agr-2-hint-1}
SELECT name AS Name,
gender AS Sex,
countryName AS "Country of Birth",
___(birthyear) AS "Birth Year"
FROM pantheon
WHERE occupation LIKE "writer"
___
```
```{sql agr-2-hint-2}
SELECT name AS Name,
gender AS Sex,
countryName AS "Country of Birth",
MAX(birthyear) AS "Birth Year"
FROM pantheon
WHERE occupation LIKE "writer"
___
```
```{sql agr-2-hint-3}
SELECT name AS Name,
gender AS Sex,
countryName AS "Country of Birth",
MAX(birthyear) AS "Birth Year"
FROM pantheon
WHERE occupation LIKE "writer"
GROUP BY ___
```
```{sql agr-2-solution}
SELECT name AS Name,
gender AS Sex,
countryName AS "Country of Birth",
MAX(birthyear) AS "Birth Year"
FROM pantheon
WHERE occupation LIKE "writer"
GROUP BY countryName, gender
```
</br>
#### HAVING
With all of this syntax under our belts, our queries can get quite complex. Something that you may encounter is the need to filter rows after your `WHERE` clause, and after they've been grouped and you've used an aggregate function. That's where the `HAVING` clause comes in and works with `GROUP BY` to filter the grouped rows exactly like we would with the `WHERE` clause.
```sql
SELECT group_column, AGR_FUNC(column_or_expression) AS agr_description, ...
FROM my_table
WHERE condition(s)
GROUP BY some_column
HAVING condition(s);
```
As an example, see if you can alter this query to list those occupations where there is more than 50 people. Then, order the table in descending count order.
```{sql agr-3, connection="conn", exercise=TRUE}
SELECT occupation, COUNT(occupation) AS number
FROM pantheon
WHERE birthyear > 1920
GROUP BY occupation;
```
```{sql agr-3-solution}
SELECT occupation, COUNT(occupation) AS number
FROM pantheon
WHERE birthyear > 1920
GROUP BY occupation
HAVING count(occupation) > 50
ORDER BY number DESC;
```
</br>
### Summary
Now is a good point to reflect on our journey with SQL so far. We've looked at a complete SQL query:
```sql
SELECT DISTINCT some_column, AGR_FUNC(column_or_expression) AS agr_description, ...
FROM my_table
WHERE condition(s)
GROUP BY some_column
HAVING condition(s)
ORDER BY some_column ASC/DESC
LIMIT x OFFSET y;
```
However, you may have noticed that has only included a single table, we'll next take a look at how to query multiple related tables and join the data together. Before that, lets run through some questions and check the knowledge we've been building.
```{r queries-quiz}
quiz(
question("In `SELECT * FROM patients;` what does patients represent?",
answer("SQL query"),
answer("SQL statement"),
answer("Database"),
answer("Table", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
),
question("What does the following SQL statement return? `SELECT * FROM patients WHERE name LIKE 'a%'`",
answer("It records in the patients table where the value in the name column doesn't have an 'a'."),
answer("It records in the patients table where the value in the name column has an 'a'."),
answer("It records in the patients table where the value in the name column ends with 'a'."),
answer("It records in the patients table where the value in the name column starts with 'a'.", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
),
question("Which choice is NOT a statement you would use to filter data?",
answer("`WHERE`"),
answer("`LIMIT`"),
answer("`LIKE`"),
answer("`GROUP_BY`", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
),
question("How can you filter duplicate data while retrieving records from a table?",
answer("`WHERE`"),
answer("`LIMIT`"),
answer("`AS`"),
answer("`DISTINCT`", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
),
question("How do you select every row in a given table named 'hospitals'?",
answer("`SELECT all FROM hospitals;`"),
answer("`FROM hospitals SELECT all;`"),
answer("`FROM hospitals SELECT *;`"),
answer("`SELECT * FROM hospitals;`", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
## Joins
Data is often split across multiple tables in a database as a result of a process known as [normalisation](https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description). This is useful as it minimises duplication of data and simplifies the addition of new data. In some cases our internal databases have views created where tables are already joined but knowing how joins work and how to handle them will be useful.
Now, these tables that share information need to have a *primary key* that uniquely identifies the entity across the database. Commonly, this is a increasing integer referred to as an ID, but can be anything that is unique, e.g. a CHI number in some cases.
### INNER JOIN
The first type of join we’re going to look at is the inner join. This process uses a variable common to both tables to match rows and will only keep rows which have a record in both tables - we specify the common variable using the ON constraint. The other arguments we learned about in the previous sections can then be applied as required.
```{r join-inner, echo=FALSE, fig.align='center', out.width="100%"}
knitr::include_graphics("images/sql/join-inner.png")
```
```sql
SELECT some_column_table1, another_column_table2, ...
FROM my_table
INNER JOIN another_table
ON my_table.id = another_table.id;
```
*You may see queries where an `INNER JOIN` is simply written as `JOIN`. This is the default behaviour, however, it's better practice to be explicit.*
#### USING
The IDs from each table don't have to have the same name, this is where the explicit naming of the columns using the `ON` keyword works well. However, it's likely that the columns will have the same name as they are referring to the same thing. When this happens we can use the `USING` keyword and just include the common column name within brackets, like this:
```sql
SELECT some_column_table1, another_column_table2, ...
FROM my_table
INNER JOIN another_table
USING (id);
```
This is most common type of join so let's get practical with it.
You may have come across this example in the ['Introduction to R' course](https://public-health-scotland.github.io/knowledge-base/develop/introduction-to-r). It's small but this allows you to visually check the outcome. We have 2 tables in our database, one refers to newborn babies (`baby`), the other is information relating to the baby's family (`baby_family`). Explore the tables and then join based on (`FAMILYID`), it's the same key for each table.
```{sql inner-join, connection="conn", exercise=TRUE}
```
```{sql inner-join-solution}
SELECT *
FROM baby
INNER JOIN baby_family
USING (FAMILYID);
```
</br>
### OUTER (& CROSS) JOINS
Restricting the data to what's available in both tables might not always be suitable. If the tables are asymmetrical then you might need to use an outer join: left join, right join, or full join. We'll also look at cross joins at the end of this section.
#### LEFT JOIN
When joining table A to table B, a `LEFT JOIN` includes all rows from A regardless of if a matching row is found in B. As an example, you might want to do this when using a postcode to lookup an SIMD. If it's a "lookup" table, this is likely the type of join you're looking for.
```{r join-left, echo=FALSE, fig.align='center', out.width="100%"}
knitr::include_graphics("images/sql/join-left.png")
```
```sql
SELECT some_column_table1, another_column_table2, ...
FROM my_table
LEFT JOIN another_table
ON my_table.id = another_table.id;
```
</br>
#### RIGHT JOIN
The `RIGHT JOIN` is the same as `LEFT JOIN` but reversed. This means that all rows are kept from table B regardless of if a matching row is found in A.
```{r join-right, echo=FALSE, fig.align='center', out.width="100%"}
knitr::include_graphics("images/sql/join-right.png")
```
```sql
SELECT some_column_table1, another_column_table2, ...
FROM my_table
RIGHT JOIN another_table
ON my_table.id = another_table.id;
```
</br>
#### FULL JOIN
A `FULL JOIN` means that all rows from both tables are kept regardless of any match from either side so we don't lose any information.
```{r join-full, echo=FALSE, fig.align='center', out.width="100%"}
knitr::include_graphics("images/sql/join-full.png")
```
```sql
SELECT some_column_table1, another_column_table2, ...
FROM my_table
FULL JOIN another_table
ON my_table.id = another_table.id;
```
</br>
#### CROSS JOIN
A `CROSS JOIN` creates all possible combinations of the 2 given tables, also known as the Cartesian product.
```{r join-cross, echo=FALSE, fig.align='center', out.width="60%"}
knitr::include_graphics("images/sql/join-cross.png")
```
```sql
SELECT some_column_table1, another_column_table2, ...
FROM my_table
CROSS JOIN another_table;
```
</br>
### Knowledge Check
You may be familiar with the Borders dataset from previous courses, this is what we're going to use here but with SQL. The table is called `borders_data` which has a `HospitalCode` to show which hospital the record refers to. We also have another table, `hospital_lookup`, which includes information about all the hospitals. We want to create a table with the `LocationName` (Hospital name), `Specialty`, and `LengthOfStay`. Feel free to use alias' to make your code more readable too. Explore the tables and select a join type that will allow you to create this table.
```{sql join-kc, connection="conn", exercise=TRUE}
```
```{sql join-kc-solution}
SELECT hosp.LocationName, borders.Specialty, borders.LengthOfStay
FROM borders_data AS borders
LEFT JOIN hospital_lookup AS hosp
ON borders.HospitalCode = hosp.Location;
```
```{r joins-quiz}
quiz(
question("Left and right joins are also known as ___?",
answer("Inner join"),
answer("Natural join"),
answer("Cartesian join"),
answer("Outer join", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
),
question("What is wrong with this code? `\n` `SELECT c.name AS country, l.name AS language \nFROM countries AS c \n \t INNER JOIN languages AS l;`",
answer("The number of rows in `languages` and `countries` is different."),
answer("There is more than one language per country."),
answer("A `JOIN` query can't be ended with a semi-colon."),
answer("`INNER JOIN` requires the specification of the key field(s) in each table.", correct = TRUE),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
## Subqueries
<div class="supporting-image-right">
```{r subquery-image, echo=FALSE, fig.align='center', out.width="60%"}
knitr::include_graphics("images/sql/lightbulb.jpg")
```
</div>
Subqueries (also called nested queries, inner queries, and inner selects), are getting to the more advanced end of our journey with SQL. They can be used in a few places, including inside themselves, and are used to return a restricted set of data to the calling query. Here's a few rules to shape our understanding:
* Subqueries must be enclosed within brackets (parentheses)
* A subquery can only have one column in the `SELECT` clause
* Subqueries cannot use `ORDER BY`
* `BETWEEN` can be used within a subquery but will not work to compare between subqueries.
The syntax is pretty much going to be the same as you'd expect, it's a query within another query. However, make sure to consider readability and performance, and confirm that using a subquery is the most appropriate.
#### Inside SELECT
* Return a single aggregate value
* Create calculated columns with a single numeric value
```sql
-- You'll want to give the subquery an alias in this case, can you think why?
SELECT some_column, another_column, ...
(SELECT some_column
FROM my_table
WHERE condition) AS subquery_name
FROM my_table;
```
#### Inside FROM
* Help restructure and transform your data
* Provide additional processing before analysis (e.g. filtering or reshaping data)
* Calculate aggregates of summary information
* Create more than one query in one `FROM` statement
* Subqueries can be joined to each other or to other tables if that have a column in common
```sql
SELECT some_column, another_column, ...
FROM my_table
(SELECT some_column
FROM my_table
WHERE condition) AS subquery_name
WHERE condition(s);
```
#### Inside WHERE
* Filters results, based on data that would otherwise have to be created separately
* Can create a list for filtering
* Returns only a single column
```sql
SELECT some_column, another_column, ...
FROM my_table
WHERE condition(s)
(SELECT some_column
FROM my_table
WHERE condition);
```
This is most common type of subquery so let's get practical for this one.
Find the records where the length of stay (`LengthOfStay`) is more than the average for `HospitalCode` "B120H" in `borders_data`.
```{sql sub-where, connection="conn", exercise=TRUE}
```
```{sql sub-where-solution}
SELECT *
FROM borders_data
WHERE LengthOfStay >
(SELECT AVG(LengthOfStay)
FROM borders_data
WHERE HospitalCode = "B120H")
AND HospitalCode = "B120H";
```
## Table Management
This course has taken a pretty wide view at interacting with existing tables in databases. Using these as building blocks, the power to build all sorts of queries in SQL is in your hands. However, what about changing the tables in some way, either creating new ones, editing, or deleting? Well, this all falls under an acronym (and we love an acronym) to describe operations, [**CRUD**](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete) (create, read, update, and delete).
Databases that we use most likely have access restrictions to prevent issues with our live data. However, other use cases exist where we create a view for ourselves or we may even have the need to utilise a database for a bespoke project. This section explores the creation, updating, and deleting of tables within a database. Just to reiterate, there's a lot of supporting theory around database design and optimal database solutions, this course doesn't cover this in any detail but resources will be made available if this is something you're interested in.
*Most of the exercises in this section don't have output, you can still build a query to view the table after to check the results though. However, you can only run one query at a time.*
### CREATE TABLE
We know that a table in a database has a particular structure with constraints in place, this is an explicit step in the creation of the table. The structure is defined by its [table schema](https://www.sqlite.org/schematab.html). As part of this, each table has:
* A name (ideally in the singular with no spaces)
* Data type
* Constraints (*optional*)
* Default values (*optional*)
*If a table already exists with the same name, SQL will throw an error. You can suppress the error and stop the creation of the table using the `IF NOT EXISTS` clause but this isn't necessary so shown in brackets below. Additionally, the fields which are user-determined, e.g. `DataType` are wrapped in angle brackets (`<>`), these shouldn't be used in the actual query.*
```sql
CREATE TABLE (IF NOT EXISTS) new_table (
new_column <DataType> <TableConstraint> DEFAULT <default_value>...
);
```
#### Data Types
Different databases support different data types, just like programming languages. There are common types, like numeric, string, dates, etc. Here's some of the types you might use:
| Data Type | Description |
| :---------------: | --------------------------------------- |
| `INTEGER` `BOOLEAN` | A whole integer value, e.g. age. Sometimes a boolean value is also represented as an integer value of `0` or `1`. |
| `FLOAT` `DOUBLE` `REAL` | Floating point to represent fractional values. |
| `CHARACTER(num_chars)` `VARCHAR(num_chars)` `TEXT` | Strings and text with the difference related to efficiency of the database. The types with `num_chars` specify the maximum number of characters. |
| `DATE` `DATETIME` | Storing date and time stamps. |
#### Table Constraints
Staying at a high level, here are some of the constraints that can be placed on columns:
| Constraint | Description |
| :---------------: | --------------------------------------- |
| `PRIMARY KEY` | The values in this column are unique and can identify a single row in the table. |
| `AUTOINCREMENT` | For integer values, the value of this will be automatically filled with a automatically incrementing value with each row insertion (not supported in all databases). |
| `UNIQUE` | Similar to a primary key in that the values have to be unique but are not used as a key for that table. |
| `NOT NULL` | The value cannot be `NULL`. |
| `CHECK (expression)` | A more complex expression to test values, e.g. positive values, greater than values, starting with certain prefix, etc. |
| `FOREIGN KEY` | Consistency check that each value in this column corresponds to another value in another table. |
To test this out, let's create a new table called `smra_test`. This will return `NULL` as the query doesn't ask for anything to be returned. Until you add some data, there won't be anything to display either, we'll get to that next. Until then, let's look at our columns:
* `id` - this is just going to be an automatically increasing integer but will be used to uniquely identify the row.
* `first_name` - while it shouldn't be null, it also shouldn't be longer than 50 characters.
* `specialty` - this alphanumeric code will not be longer than 4 characters.
* `los` - (length of stay) is a positive integer.
```{sql create-table, connection="conn", exercise=TRUE}
```
```{sql create-table-solution}
CREATE TABLE IF NOT EXISTS smra_test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
first_name CHARACTER(50) NOT NULL,
specialty CHARACTER(4),
los INTEGER
);
```
</br>
### INSERT INTO
When inserting data into a database, we use the `INSERT` statement followed by the name of the table to write to, the columns of data we are filling, and one or more rows of data to insert. If you are relying on default values and not listing every column you need to state the column names explicitly. It's also possible to utilise expressions for the values to insert.
```sql
INSERT INTO my_table
-- optional explicit column listing
(some_column, another_column, ...)
VALUES (some_value, another_value...)...;
```
In our created table, `smra_test`, let's add a couple of rows:
* `first_name`: Harry, `specialty`: A1, `los`: 5
* `first_name`: Luna, `specialty`: A11, `los`: 7
* `first_name`: Ron, `specialty`: A2, `los`: 9
```{sql prepare-insert-into, connection="conn"}
CREATE TABLE IF NOT EXISTS smra_test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
first_name CHARACTER(50) NOT NULL,
specialty CHARACTER(4),
los INTEGER
);
```
```{sql insert-into, connection="conn", exercise=TRUE, exercise.setup="prepare-insert-into"}
```
```{sql insert-into-solution}
INSERT INTO smra_test
(first_name, specialty, LoS)
VALUES ("Harry", "A1", 5),
("Luna", "A11", 7),
("Ron", "A2", 9);
```
Now that we should have some data in our database, let's take a look at what the right SQL will have done:
```{sql insert-output, connection="conn", exercise=TRUE, exercise.setup="prepare-update-table"}
SELECT * FROM smra_test;
```
</br>
### UPDATE
Depending on your project, it may be a common task to update the data within a table. This is similar to the `INSERT` statement in that you have to be precise with the naming of your table, columns, and also which rows to use the `UPDATE` statement on. Any of the data you are trying to change will still have to match the table constraints from the schema. However, updating data is done with a `WHERE` clause and this is very important or you could end up updating the data across all rows. To avoid this, you could test the query with `SELECT` and ensure you see the right rows before writing the rest of the `UPDATE` statement.
```sql
UPDATE my_table
SET column = some_value,
another_column = another_value, ...
WHERE condition(s);
```
We need to update a couple of things on one of our rows:
* `first_name`: Ron, `specialty`: A2, `los`: 9 **should be** `first_name`: Ronald, `specialty`: A77, `los`: 9
```{sql prepare-update-table, connection="conn", exercise.setup="prepare-insert-into"}
INSERT INTO smra_test
(first_name, specialty, LoS)
VALUES ("Harry", "A1", 5),
("Luna", "A11", 7),
("Ron", "A2", 9);
```
```{sql update-table, connection="conn", exercise=TRUE, exercise.setup="prepare-update-table"}
```
```{sql update-table-solution}
UPDATE smra_test
SET first_name = "Ronald",
specialty = "A77"
WHERE id = 3;
```
Let's see how our data should look like now:
```{sql update-output, connection="conn", exercise=TRUE, exercise.setup="prepare-delete-from"}
SELECT * FROM smra_test;
```
</br>
### DELETE FROM
If you need to delete data from your table, then it's the `DELETE` statement. Similar to the `UPDATE` statement, you specify the table and use a `WHERE` clause to describe the relevant rows to delete. If you leave the `WHERE` clause, all rows will be removed... useful if you mean that. To avoid mistakes, you can test the query like above with a `SELECT` query.
```sql
DELETE FROM my_table
WHERE condition(s);
```
Looks like we have some erroneous data, every record where `specialty` is A77 should be deleted.
```{sql prepare-delete-from, connection="conn", exercise.setup="prepare-update-table"}
UPDATE smra_test
SET first_name = "Ronald",
specialty = "A77"
WHERE id = 3;
```
```{sql delete-from, connection="conn", exercise=TRUE, exercise.setup='prepare-delete-from'}
```