-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path_21-29-drilling-into-db-environment.Rmd
398 lines (325 loc) · 13.5 KB
/
_21-29-drilling-into-db-environment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
# Drilling into Your DB Environment (21-29)
These packaages are called in this Chapter
```{r setup, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(glue)
library(here)
require(knitr)
library(dbplyr)
library(sqlpetr)
display_rows <- 5
```
Start up the `docker-pet` container
```{r}
sp_docker_start("sql-pet")
```
Now connect to the `dvdrental` database with R
```{r}
con <- sp_get_postgres_connection(
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "dvdrental",
seconds_to_test = 30)
con
```
### Which database?
Your DBA will create your user accounts and privileges for the database(s) that you can access.
One of the challenges when working with a database(s) is finding where your data actually resides. Your best resources will be one or more subject matter experts, `SME`, and your DBA. Your data may actually reside in multiple databases, e.g., a detail and summary databases. In our tutorial, we focus on the one database, `dvdrental`. Database names usually reflect something about the data that they contain.
Your laptop is a server for the Docker PostgreSQL databases. A database is a collection of files that PostgreSQL manages in the background.
### How many databases reside in the Docker Container?
```{r}
rs <-
DBI::dbGetQuery(
con,
"SELECT 'DB Names in Docker' showing
,datname DB
FROM pg_database
WHERE datistemplate = false;
"
)
kable(rs)
```
Which databases are available?
```
Modify the connection call to connect to the `postgres` database.
```
```{r}
con <- sp_get_postgres_connection(
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "your code goes here",
seconds_to_test = 30)
con
if (con != 'There is no connection')
dbDisconnect(con)
#Answer: con <PqConnection> postgres@localhost:5439
# Reconnect to dvdrental
con <- sp_get_postgres_connection(
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "dvdrental",
seconds_to_test = 30)
con
```
Note that the two Sys.getenv function calls work in this tutorial because both the user and password are available in both databases. This is a common practice in organizations that have implemented single sign on across their organization.
```
Gotcha:
If one has data in multiple databases or multiple environments, Development, Integration, and Prodution, it is very easy to connect to the wrong database in the wrong environment. Always double check your connection information when logging in and before performing any inserts, updates, or deletes against the database.
```
The following code block should be used to reduce propagating the above gotcha. Current_database(), CURRENT_DATE or CURRENT_TIMESTAMP, and 'result set' are the most useful and last three not so much. Instead of the host IP address having the actual host name would be a nice addition.
```{r}
rs1 <-
DBI::dbGetQuery(
con,
"SELECT current_database() DB
,CURRENT_DATE
,CURRENT_TIMESTAMP
,'result set description' showing
,session_user
,inet_server_addr() host
,inet_server_port() port
"
)
kable(display_rows)
```
Since we will only be working in the `dvdrental` database in this tutorial and reduce the number of output columns shown, only the 'result set description' will be used.
### Which Schema?
In the code block below, we look at the `information_schema.table` which contains information about all the schemas and table/views within our dvdrental database. Databases can have one or more schemas, containers that hold tables or views. Schemas partition the database into big logical blocks of related data. Schema names usually reflect an application or logically related data sets. Occasionally a DBA will set up a new schema and use a users name.
What schemas are in the `dvdrental` database?
How many entries are in each schema?
```{r}
## Database Schemas
#
rs1 <-
DBI::dbGetQuery(
con,
"SELECT 'DB Schemas' showing,t.table_catalog DB,t.table_schema,COUNT(*) tbl_vws
FROM information_schema.tables t
GROUP BY t.table_catalog,t.table_schema
"
)
kable(rs1)
```
We see that there are three schemas. The pg_catalog is the standard PostgreSQL meta data and core schema. PostgreSQL uses this schema to manage the internal workings of the database. DBA's are the primary users of pg_catalog. We used the pg_catalog schema to answer the question 'How many databases reside in the Docker Container?', but normally the data analyst is not interested in analyzing database data.
The information_schema contains ANSI standardized views used across the different SQL vendors, (Oracle, Sybase, MS SQL Server, IBM DB2, etc). The information_schema contains a plethora of metadata that will help you locate your data tables, understand the relationships between the tables, and write efficient SQL queries.
### Exercises
```{r ORDER BY}
#
# Add an order by clause to order the output by the table catalog.
rs1 <- DBI::dbGetQuery(con,"SELECT '1. ORDER BY table_catalog' showing
,t.table_catalog DB,t.table_schema,COUNT(*) tbl_vws
FROM information_schema.tables t
GROUP BY t.table_catalog,t.table_schema
"
)
kable(rs1)
# Add an order by clause to order the output by tbl_vws in descending order.
rs2 <- DBI::dbGetQuery(con,"SELECT '2. ORDER BY tbl_vws desc' showing
,t.table_catalog DB,t.table_schema,COUNT(*) tbl_vws
FROM information_schema.tables t
GROUP BY t.table_catalog,t.table_schema
"
)
kable(rs2)
```
```{r}
# Complete the SQL statement to show everything about all the tables.
rs3 <- DBI::dbGetQuery(con,"SELECT '3. all information_schema tables' showing
,'your code goes here'
FROM information_schema.tables t
"
)
kable(head (rs3,display_rows))
```
```{r}
# Use the results from above to pull interesting columns from just the information_schema
rs4 <- DBI::dbGetQuery(con,"SELECT '4. information_schema.tables' showing
,'your code goes here'
FROM information_schema.tables t
where 'your code goes here' = 'your code goes here'
"
)
head(rs4,display_rows)
# Modify the SQL below with your interesting column names.
# Update the where clause to return only rows from the information schema and begin with 'tab'
rs5 <- DBI::dbGetQuery(con,"SELECT '5. information_schema.tables' showing
,'your code goes here'
FROM information_schema.tables t
where 'your code goes here' = 'your code goes here'
"
)
kable(head(rs5,display_rows))
# Modify the SQL below with your interesting column names.
# Update the where clause to return only rows from the information schema and begin with 'col'
rs6 <- DBI::dbGetQuery(con,"SELECT '6. information_schema.tables' showing
,'your code goes here'
FROM information_schema.tables t
where 'your code goes here' = 'your code goes here'
"
)
kable(head(rs6,display_rows))
```
In the next exercise we combine both the table and column output from the previous exercises. Review the following code block. The last two lines of the WHERE clause are switched. Will the result set be the same or different? Execute the code block and review the two data sets.
```{r}
rs7 <- DBI::dbGetQuery(con,"SELECT '7. information_schema.tables' showing
,table_catalog||'.'||table_schema db_info, table_name, table_type
FROM information_schema.tables t
where table_schema = 'information_schema'
and table_name like 'table%' OR table_name like '%col%'
and table_type = 'VIEW'
"
)
kable(head(rs7,display_rows))
rs8 <- DBI::dbGetQuery(con,"SELECT '8. information_schema.tables' showing
,table_catalog||'.'||table_schema db_info, table_name, table_type
FROM information_schema.tables t
where table_schema = 'information_schema'
and table_type = 'VIEW'
and table_name like 'table%' OR table_name like '%col%'
"
)
kable(head(rs8,display_rows))
```
Operator/Element | Associativity | Description
-----------------|----------------|-------------
. |left| table/column name separator
::| left| PostgreSQL-style typecast
[ ]| left| array element selection
-| right| unary minus
^| left| exponentiation
* / %| left| multiplication, division, modulo
+ -| left| addition, subtraction
IS|| IS TRUE, IS FALSE, IS UNKNOWN, IS NULL
ISNULL|| test for null
NOTNULL|| test for not null
(any other)| left| all other native and user-defined operators
IN|| set membership
BETWEEN|| range containment
OVERLAPS|| time interval overlap
LIKE ILIKE SIMILAR|| string pattern matching
< >|| less than, greater than
=| right| equality, assignment
NOT| right| logical negation
AND| left| logical conjunction
OR| left| logical disjunction
```{r}
rs1 <- DBI::dbGetQuery(con,"SELECT t.table_catalog DB ,t.table_schema
,t.table_name,t.table_type
FROM information_schema.tables t")
rs2 <- DBI::dbGetQuery(con,"SELECT t.table_catalog DB ,t.table_schema
,t.table_type,COUNT(*) tbls
FROM information_schema.tables t
group by t.table_catalog ,t.table_schema
,t.table_type
")
rs3 <- DBI::dbGetQuery(con,"SELECT distinct t.table_catalog DB ,t.table_schema
,t.table_type tbls
FROM information_schema.tables t
")
#kable(head(rs1 %>% arrange (table_name)))
kable(head(rs1))
kable(head(rs2))
kable(head(rs3))
```
www.dataquest.io/blog/postgres-internals
```{r}
## Explain a `dplyr::join
tbl_pk_fk_df <- DBI::dbGetQuery(con,
"
SELECT --t.table_catalog,t.table_schema,
c.table_name
,kcu.column_name
,c.constraint_name
,c.constraint_type
,coalesce(c2.table_name, '') ref_table
,coalesce(kcu2.column_name, '') ref_table_col
FROM information_schema.tables t
LEFT JOIN information_schema.table_constraints c
ON t.table_catalog = c.table_catalog
AND t.table_schema = c.table_schema
AND t.table_name = c.table_name
LEFT JOIN information_schema.key_column_usage kcu
ON c.constraint_schema = kcu.constraint_schema
AND c.constraint_name = kcu.constraint_name
LEFT JOIN information_schema.referential_constraints rc
ON c.constraint_schema = rc.constraint_schema
AND c.constraint_name = rc.constraint_name
LEFT JOIN information_schema.table_constraints c2
ON rc.unique_constraint_schema = c2.constraint_schema
AND rc.unique_constraint_name = c2.constraint_name
LEFT JOIN information_schema.key_column_usage kcu2
ON c2.constraint_schema = kcu2.constraint_schema
AND c2.constraint_name = kcu2.constraint_name
AND kcu.ordinal_position = kcu2.ordinal_position
WHERE c.constraint_type IN ('PRIMARY KEY', 'FOREIGN KEY')
AND c.table_catalog = 'dvdrental'
AND c.table_schema = 'public'
ORDER BY c.table_name;
")
# View(tbl_pk_fk_df)
tables_df <- tbl_pk_fk_df %>% distinct(table_name)
# View(tables_df)
```
```{r create nodes}
library(DiagrammeR)
table_nodes_ndf <- create_node_df(
n <- nrow(tables_df)
,type <- 'table'
,label <- tables_df$table_name
,shape = "rectangle"
,width = 1
,height = .5
,fontsize = 18
)
tbl_pk_fk_ids_df <- inner_join(tbl_pk_fk_df,table_nodes_ndf
,by = c('table_name'='label')
,suffix(c('st','s'))
) %>%
rename('src_tbl_id' = id) %>%
left_join(table_nodes_ndf
,by = c('ref_table'='label')
,suffix(c('st','t'))
) %>%
rename('fk_tbl_id' = id)
tbl_fk_df <- tbl_pk_fk_ids_df %>% filter(constraint_type == 'FOREIGN KEY')
tbl_pk_df <- tbl_pk_fk_ids_df %>% filter(constraint_type == 'PRIMARY KEY')
# View(tbl_pk_fk_ids_df)
kable(head(tbl_fk_df))
kable(head(tbl_pk_df))
```
```{r}
# Create an edge data frame, edf
fk_edf <-
create_edge_df(
from = tbl_fk_df$src_tbl_id,
to = tbl_fk_df$fk_tbl_id,
rel = "fk",
label=tbl_fk_df$constraint_name,
fontsize = 15
)
# View(fk_edf)
```
```{r}
fkgraph_widget <-
create_graph(
nodes_df = table_nodes_ndf,
edges_df = fk_edf,
graph_name = 'Simple FK Graph'
) %>% render_graph()
# export to image files
fkgraph_file <- sqlpetr::sp_make_image_files(
fkgraph_widget,
"diagrams",
"fkgraph"
)
```
```{r echo=FALSE}
# display the PNG/PDF - this works in both HTML and PDF
knitr::include_graphics(fkgraph_file, auto_pdf = TRUE)
```
```{r}
dbDisconnect(con)
# system2('docker','stop sql-pet')
```