forked from smithjd/sql-pet
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path130-sql-dplyr-joins-data.Rmd
511 lines (390 loc) · 16.7 KB
/
130-sql-dplyr-joins-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
# SQL & dplyr joins additional data {#chapter_sql-dplyr-data}
> This chapter demonstrates how to:
>
> * Use primary and foreign keys to retrieve specific rows of a table
> * do different kinds of join queries
> * Exercises
> * Query the database to get basic information about each dvdrental story
> * How to interact with the database using different strategies
> ** This chapter still points to `dvdrentals` **
These packages are called in almost every chapter of the book:
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(DiagrammeR)
library(RPostgres)
library(glue)
library(here)
require(knitr)
library(dbplyr)
library(sqlpetr)
```
Verify Docker is up and running:
```{r Verify Docker is up}
sp_check_that_docker_is_up()
```
Verify pet DB is available, it may be stopped.
```{r Verify pet DB is available}
sp_show_all_docker_containers()
```
Start up the `docker-pet` container
```{r Start up the docker-pet container}
sp_docker_start("sql-pet")
```
Now connect to the database with R. Need to wait for Docker & Postgres to come up before connecting.
```{r connect to the database}
con <- sp_get_postgres_connection(
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "dvdrental",
seconds_to_test = 30, connection_tab = TRUE
)
```
## Making up data for Join Examples
Each chapter in the book stands on its own. If you have worked through the code blocks in this chapter in a previous session, you created some new customer records in order to work through material in the rest of the chapter.
The DVD rental database data is too clean to demonstrate some join concepts. To dirty the data, this chapter performs a number of database operations on data tables that a data analyst is typically restricted from doing in the real world.
1. Deleting records from tables.
2. Inserting records from tables.
3. Enabling and disabling table constraints.
In our Docker environment, you have no restrictions on the database operations you can perform.
In the next couple of code blocks, we delete the new data and then recreate the data for the join examples in this next chapter.
### SQL Delete Data Syntax
```
DELETE FROM <source> WHERE <where_clause>;
```
### Delete New Practice Store from the Store Table.
In the next code block we delete out the new stores that were added when the book was compliled or added working through the exercises. Out of the box, the DVD rental database's highest store_id = 2.
```{r Delete new stores}
dbExecute(con, "delete from store where store_id > 2;")
```
### Delete film 1001, Sophie's Choice, records in film_category, rental, inventory, and film
The records need to be deleted in a specific order to not violate constraints.
```{r Delete new film}
dbExecute(con, "delete from film_category where film_id >= 1001;")
dbExecute(con, "delete from rental where rental_id >= 16050;")
dbExecute(con, "delete from inventory where film_id >= 1001;")
dbExecute(con, "delete from film where film_id >= 1001;")
```
### Delete New Practice Customers from the Customer table.
In the next code block we delete out the new customers that were added when the book was compliled or added while working through the chapter. Out of the box, the DVD rental database's highest customer_id = 599.
### Delete New Practice Customers from the Customer table.
In the next code block we delete out the new customers that were added when the book was compliled or added while working through the chapter. Out of the box, the DVD rental database's highest customer_id = 599.
```
dbExecute() always returns a scalar numeric that specifies the number of rows affected by the statement.
```
```{r Delete new customers}
dbExecute(
con,
"delete from customer
where customer_id >= 600;
"
)
```
The number above tells us how many rows were actually deleted from the customer table.
### SQL Single Row Insert Data Syntax
```
INSERT INTO <target> <column_list> VALUES <values list>;
<target> : target table/view
<column list> : csv list of columns
<values list> : values assoicated with the column list.
```
The `column list` is the list of column names on the table and the corresponding list of values must have the correct data type. The following code block returns the `CUSTOMER` column names and data types.
```{r SQL Customer Columns}
customer_cols <- dbGetQuery(
con,
"select table_name, column_name, ordinal_position, data_type
from information_schema.columns
where table_catalog = 'dvdrental'
and table_name = 'customer'
;"
)
sp_print_df(customer_cols)
```
In the next code block, we insert Sophie as a new customer into the customer table via a SQL insert statement. The columns list clause has three id columns, customer_id, store_id, and address_id. The customer_id is a primary key column and the other two 'look like' foreign key columns.
For now, we are interested in getting some new customers into the customer table. We look at the relations between the customer and the store tables later in this chapter.
```{r SQL Single Row Insert}
dbExecute(
con,
"
insert into customer
(customer_id,store_id,first_name,last_name,email,address_id,activebool
,create_date,last_update,active)
values(600,3,'Sophie','Yang','sophie.yang@sakilacustomer.org',1,TRUE,now(),now()::date,1)
"
)
```
The number above should be 1 indicating that one record was inserted.
```{r display new customers}
new_customers <- dbGetQuery(con
,"select customer_id,store_id,first_name,last_name
from customer where customer_id >= 600;")
sp_print_df(new_customers)
```
### Primary Key Constraint Error Message
For the new customers, we are concerned with not violating the PK and FK constraints.
In the next SQL code block, we try and reinsert the newly created customer record inserted above. Instead of having the code block fail, it throws a duplicate key exception error message. If you `knit` the document, the exception error message is thrown to the `R Markdown` tab.
```{r Proess Duplicate Customer Key Error}
dbExecute(con, "
do $$
DECLARE v_customer_id INTEGER;
begin
v_customer_id = 600;
insert into customer
(customer_id,store_id,first_name,last_name,email,address_id,activebool
,create_date,last_update,active)
values(v_customer_id,3,'Sophie','Yang','sophie.yang@sakilacustomer.org',1,TRUE
,now(),now()::date,1);
exception
when unique_violation then
raise notice 'SQLERRM = %, customer_id = %', SQLERRM, v_customer_id;
when others then
raise 'SQLERRM = % SQLSTATE =%', SQLERRM, SQLSTATE;
end;
$$ language 'plpgsql';")
```
The number above shows how many rows were inserted. To ensure that the thrown error message is part of the book, the error message is shown below.
```
NOTICE: SQLERRM = duplicate key value violates unique constraint "customer_pkey", customer_id = 600
CONTEXT: PL/pgSQL function inline_code_block line 12 at RAISE
```
### R Exercise: Inserting a Single Row via a Dataframe
In the following code block replace Sophie Yang with your name where appropriate.
Note:
1. The last data frame parameter sets the stringsAsFactors is `FALSE`. Databases do not have a native `FACTOR` type.
2. The dataframe column names must match the table column names.
3. The dbWriteTable function needs `append` = true to actually insert the new row.
4. The dbWriteTable function has an option 'overwrite'. It is set to FALSE by default. If it is set to TRUE, the table is first truncated before the row is inserted.
5. No write occurs if both overwrite and append = FALSE.
```{r R Dataframe Insert}
df <- data.frame(
customer_id = 601
, store_id = 2
, first_name = "Sophie"
, last_name = "Yang"
, email = "sophie.yang@sakilacustomer.org"
, address_id = 1
, activebool = TRUE
, create_date = Sys.Date()
, last_update = Sys.time()
, active = 1
, stringsAsFactors = FALSE
)
dbWriteTable(con, "customer", value = df, append = TRUE, row.names = FALSE)
new_customers <- dbGetQuery(con
, "select customer_id,store_id,first_name,last_name
from customer where customer_id >= 600;")
sp_print_df(new_customers)
```
## SQL Multi-Row Insert Data Syntax
```
INSERT INTO <target> <column_list> VALUES <values list1>, ... <values listn>;
<target> : target table/view
<column list> : csv list of columns
(<values list>) : values assoicated with the column list.
```
Postgres and some other flavors of SQL allow multiple rows to be inserted at a time. The syntax is identical to the Single Row syntax, but includes multiple `(<values list>)` clauses separated by commas. Note that each value list is enclosed it a set of parenthesis. The following code block illustrates the SQL multi-row insert. Note that the customer_id column takes on sequential values to satisfy the PK constraint.
## SQL Multi-Row Insert Data Example
```{r SQL Multi-Row Insert Data}
#
dbExecute(
con,
"insert into customer
(customer_id,store_id,first_name,last_name,email,address_id,activebool
,create_date,last_update,active)
values(602,4,'John','Smith','john.smith@sakilacustomer.org',2,TRUE
,now()::date,now()::date,1)
,(603,5,'Ian','Frantz','ian.frantz@sakilacustomer.org',3,TRUE
,now()::date,now()::date,1)
,(604,6,'Ed','Borasky','ed.borasky@sakilacustomer.org',4,TRUE
,now()::date,now()::date,1)
;"
)
```
## DPLYR Multi-Row Insert Data Example
The Postgres R multi-row insert is similar to the single row insert. The single column values are converted to a vector of values.
### R Exercise: Inserting Multiple Rows via a Dataframe
Replace the two first_name, last_name, and email column values with your own made up values in the following code block. The output should be all of our new customers, customer_id = {600 - 606}.
```{r DPLYR Multi-Row Insert Data}
customer_id <- c(605, 606)
store_id <- c(3, 4)
first_name <- c("John", "Ian")
last_name <- c("Smith", "Frantz")
email <- c("john.smith@sakilacustomer.org", "ian.frantz@sakilacustomer.org")
address_id <- c(3, 4)
activebool <- c(TRUE, TRUE)
create_date <- c(Sys.Date(), Sys.Date())
last_update <- c(Sys.time(), Sys.time())
active <- c(1, 1)
df2 <- data.frame(customer_id, store_id, first_name, last_name, email,
address_id, activebool, create_date, last_update, active,
stringsAsFactors = FALSE
)
dbWriteTable(con, "customer",
value = df2, append = TRUE, row.names = FALSE
)
new_customers <- dbGetQuery(con
, "select customer_id,store_id,first_name,last_name
from customer where customer_id >= 600;")
sp_print_df(new_customers)
```
Confirm that the two new rows, customer_id = { 605, 606} are in the output.
The next two code block show all the rows in the store and staff tables. Notice that neither table has a staff_id or a manager_staff_id = 10. We will attempt to insert such a row in the upcoming code blocks.
```{r new store data}
stores <- dbGetQuery(con,"select * from store;")
sp_print_df(stores)
```
```{r staff table}
staff <- dbGetQuery(con
,"select staff_id, first_name, last_name, address_id, email, store_id
from staff;")
sp_print_df(staff)
```
### Creating a Messy Store Row
A new store row is needed to illustrate a right outer join in a future code block. However, one cannot insert/update a row into the `store` table with a manager_staff_id = 10 because of a foreign key constraint on the manager_staff_id column.
The manager_staff_id value must satisfy two conditions before the database will allow the new store row to be inserted into the table when the table constraints are enabled.:
1. The manager_staff_id must be unique when inserted into the store table.
2. The manager_staff_id must match a `staff` table staff_id value.
Next we show both error messages:
1. The next code block attempts to insert a new store, `store_id = 10`, with manager_staff_id = 1, but fails with a unique constraint error message. The manager_staff_id = 1 already exists in the store table.
```{r Proess Updating of Store manager_staff_id Duplicate Key Error}
dbExecute(con, "
do $$
DECLARE v_manager_staff_id INTEGER;
begin
v_manager_staff_id = 1;
insert into store (store_id,manager_staff_id,address_id,last_update)
values (10,v_manager_staff_id,10,now()::date);
exception
when foreign_key_violation then
raise notice 'SQLERRM = %, manager_staff_id = %', SQLERRM, v_manager_staff_id;
when others then
raise notice 'SQLERRM = % SQLSTATE =%', SQLERRM, SQLSTATE;
end;
$$ language 'plpgsql';")
```
```
Error in result_create(conn@ptr, statement) : Failed to prepare query: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
```
The number above should be 0 and indicates no row was inserted.
2. The next code block attempts to insert a new store, `store_id = 10`, with manager_staff_id = 10, but fails with a foreign key constraint error message because there does not exist a staff table row with staff_id = 10.
```{r Proess Updating of Store manager_staff_id Foreign Key Error}
dbExecute(con, "
do $$
DECLARE v_manager_staff_id INTEGER;
begin
v_manager_staff_id = 10;
insert into store (store_id,manager_staff_id,address_id,last_update)
values (10,v_manager_staff_id,10,now()::date);
exception
when foreign_key_violation then
raise notice 'SQLERRM = %, manager_staff_id = %', SQLERRM, v_manager_staff_id;
when others then
raise notice 'SQLERRM = % SQLSTATE =%', SQLERRM, SQLSTATE;
end;
$$ language 'plpgsql';")
```
```
NOTICE: SQLERRM = insert or update on table "store" violates foreign key constraint "store_manager_staff_id_fkey", manager_staff_id = 10
CONTEXT: PL/pgSQL function inline_code_block line 9 at RAISE
```
Again, the number above should be 0 and indicates no row was inserted.
The following three code blocks
1. disables all the database constraints on the `store` table
2. Inserts the store row with store_id = 10 via a dataframe.
3. Re-enabes the database constraints on the store table
```{r Disable store trigger}
#
dbExecute(con, "ALTER TABLE store DISABLE TRIGGER ALL;")
```
```{r "insert store id = 10 row"}
df <- data.frame(
store_id = 10
, manager_staff_id = 10
, address_id = 10
, last_update = Sys.time()
)
dbWriteTable(con, "store", value = df, append = TRUE, row.names = FALSE)
```
```{r enable store trigger}
dbExecute(con, "ALTER TABLE store ENABLE TRIGGER ALL;")
```
The zeros after the dbExecute code blocks indicate that the dbExecute calls did not alter any rows on the table.
In the next code block we confirm our new row, store_id = 10, was actually inserted.
```{r }
stores <- dbGetQuery(con,"select * from store;")
sp_print_df(stores)
```
## Create a film record
```{r}
dbExecute(
con,
"insert into film
(film_id,title,description,release_year,language_id
,rental_duration,rental_rate,length,replacement_cost,rating
,last_update,special_features,fulltext)
values(1001,'Sophie''s Choice','orphaned language_id=10',2018,1
,7,4.99,120,14.99,'PG'
,now()::date,'{Trailers}','')
,(1002,'Sophie''s Choice','orphaned language_id=10',2018,1
,7,4.99,120,14.99,'PG'
,now()::date,'{Trailers}','')
;
")
```
```{r}
dbExecute(
con,
"insert into film_category
(film_id,category_id,last_update)
values(1001,6,now()::date)
,(1001,7,now()::date)
,(1002,6,now()::date)
,(1002,7,now()::date)
;")
```
```{r}
dbExecute(
con,
"insert into inventory
(inventory_id,film_id,store_id,last_update)
values(4582,1001,1,now()::date)
,(4583,1001,2,now()::date)
;")
```
```{r}
dbExecute(
con,
"insert into rental
(rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update)
values(16050,now()::date - interval '1 week',4582,600,now()::date,1,now()::date)
;")
```
<!--
## Creating Duplicate Customer Rows
In the next section we create a new table, `smy_customer`. We will load all customers with customer_id > 594 twice. The `smy_customer` table will be used in the dplyr semi-join section.
```{r}
dbExecute(con,"drop table if exists smy_customer;")
dbExecute(con,"create table smy_customer
as select *
from customer
where customer_id > 594;")
dbExecute(con,"insert into smy_customer
select *
from customer
where customer_id > 594;")
smy_cust_dupes <- dbGetQuery(con,'select *
from smy_customer
order by customer_id')
sp_print_df(smy_cust_dupes)
```
-->
## Disconnect from the database and stop Docker
```{r}
dbDisconnect(con)
# or if using the connections package, use:
# connection_close(con)
sp_docker_stop("sql-pet")
```
```{r}
knitr::knit_exit()
```