forked from smithjd/sql-pet
-
Notifications
You must be signed in to change notification settings - Fork 0
/
180-r-query-postgres-metadata.Rmd
355 lines (291 loc) · 11.9 KB
/
180-r-query-postgres-metadata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
# Getting metadata about and from PostgreSQL {#chapter_postgresql-metadata}
> This chapter demonstrates:
>
> * What kind of data about the database is contained in a dbms
> * Several methods for obtaining metadata from the dbms
The following packages are used in this chapter:
```{r package list, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(glue)
library(here)
require(knitr)
library(dbplyr)
library(sqlpetr)
```
Assume that the Docker container with PostgreSQL and the dvdrental database are ready to go.
```{r}
sp_docker_start("adventureworks")
```
Connect to the database:
```{r}
con <- sqlpetr::sp_get_postgres_connection(
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "adventureworks",
port = 5432,
seconds_to_test = 20,
connection_tab = TRUE
)
```
## Views trick parked here for the time being
### Explore the vsalelsperson and vsalespersonsalesbyfiscalyearsdata views
The following trick goes later in the book, where it's used to prove the finding that to make sense of othe data you need to
```{r}
cat(unlist(dbGetQuery(con, "select pg_get_viewdef('sales.vsalesperson', true)")))
```
```{r,echo=FALSE}
dbGetQuery(con, "select pg_get_viewdef('sales.vsalespersonsalesbyfiscalyearsdata', true)")
# cat(unlist(dbGetQuery(con,"select pg_get_viewdef('sa.soh', true)")))
```
## Database contents and structure
After just looking at the data you seek, it might be worthwhile stepping back and looking at the big picture.
### Database structure
For large or complex databases you need to use both the available documentation for your database (e.g., [the dvdrental](http://www.postgresqltutorial.com/postgresql-sample-database/) database) and the other empirical tools that are available. For example it's worth learning to interpret the symbols in an [Entity Relationship Diagram](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model):
![](./screenshots/ER-diagram-symbols.png)
The `information_schema` is a trove of information *about* the database. Its format is more or less consistent across the different SQL implementations that are available. Here we explore some of what's available using several different methods. PostgreSQL stores [a lot of metadata](https://www.postgresql.org/docs/current/static/infoschema-columns.html).
### Contents of the `information_schema`
For this chapter R needs the `dbplyr` package to access alternate schemas. A [schema](http://www.postgresqltutorial.com/postgresql-server-and-database-objects/) is an object that contains one or more tables. Most often there will be a default schema, but to access the metadata, you need to explicitly specify which schema contains the data you want.
### What tables are in the database?
The simplest way to get a list of tables is with ... *NO LONGER WORKS*:
```{r}
schema_list <- tbl(con, in_schema("information_schema", "schemata")) %>%
select(catalog_name, schema_name, schema_owner) %>%
collect()
sp_print_df(head(schema_list))
```
### Digging into the `information_schema`
We usually need more detail than just a list of tables. Most SQL databases have an `information_schema` that has a standard structure to describe and control the database.
The `information_schema` is in a different schema from the default, so to connect to the `tables` table in the `information_schema` we connect to the database in a different way:
```{r}
table_info_schema_table <- tbl(con, dbplyr::in_schema("information_schema", "tables"))
```
The `information_schema` is large and complex and contains `r table_info_schema_table %>% collect %>% dim %>% pluck(1)` tables. So it's easy to get lost in it.
This query retrieves a list of the tables in the database that includes additional detail, not just the name of the table.
```{r}
table_info <- table_info_schema_table %>%
# filter(table_schema == "public") %>%
select(table_catalog, table_schema, table_name, table_type) %>%
arrange(table_type, table_name) %>%
collect()
sp_print_df(head(table_info))
```
In this context `table_catalog` is synonymous with `database`.
Notice that *VIEWS* are composites made up of one or more *BASE TABLES*.
The SQL world has its own terminology. For example `rs` is shorthand for `result set`. That's equivalent to using `df` for a `data frame`. The following SQL query returns the same information as the previous dplyr code.
```{r}
rs <- dbGetQuery(
con,
"select table_catalog, table_schema, table_name, table_type
from information_schema.tables
where table_schema not in ('pg_catalog','information_schema')
order by table_type, table_name
;"
)
sp_print_df(head(rs))
```
## What columns do those tables contain?
Of course, the `DBI` package has a `dbListFields` function that provides the simplest way to get the minimum, a list of column names:
```{r}
# DBI::dbListFields(con, "rental")
```
But the `information_schema` has a lot more useful information that we can use.
```{r}
columns_info_schema_table <- tbl(con, dbplyr::in_schema("information_schema", "columns"))
```
Since the `information_schema` contains `r columns_info_schema_table %>% collect %>% dim %>% pluck(1)` columns, we are narrowing our focus to just one table. This query retrieves more information about the `rental` table:
```{r}
columns_info_schema_info <- columns_info_schema_table %>%
# filter(table_schema == "public") %>%
select(
table_catalog, table_schema, table_name, column_name, data_type, ordinal_position,
character_maximum_length, column_default, numeric_precision, numeric_precision_radix
) %>%
collect(n = Inf) %>%
mutate(data_type = case_when(
data_type == "character varying" ~ paste0(data_type, " (", character_maximum_length, ")"),
data_type == "real" ~ paste0(data_type, " (", numeric_precision, ",", numeric_precision_radix, ")"),
TRUE ~ data_type
)) %>%
# filter(table_name == "rental") %>%
select(-table_schema, -numeric_precision, -numeric_precision_radix)
glimpse(columns_info_schema_info)
sp_print_df(head(columns_info_schema_info))
```
### What is the difference between a `VIEW` and a `BASE TABLE`?
The `BASE TABLE` has the underlying data in the database
```{r}
table_info_schema_table %>%
filter( table_type == "BASE TABLE") %>%
# filter(table_schema == "public" & table_type == "BASE TABLE") %>%
select(table_name, table_type) %>%
left_join(columns_info_schema_table, by = c("table_name" = "table_name")) %>%
select(
table_type, table_name, column_name, data_type, ordinal_position,
column_default
) %>%
collect(n = Inf) %>%
filter(str_detect(table_name, "cust")) %>%
head() %>%
sp_print_df()
```
Probably should explore how the `VIEW` is made up of data from BASE TABLEs.
```{r}
table_info_schema_table %>%
filter( table_type == "VIEW") %>%
# filter(table_schema == "public" & table_type == "VIEW") %>%
select(table_name, table_type) %>%
left_join(columns_info_schema_table, by = c("table_name" = "table_name")) %>%
select(
table_type, table_name, column_name, data_type, ordinal_position,
column_default
) %>%
collect(n = Inf) %>%
filter(str_detect(table_name, "cust")) %>%
head() %>%
sp_print_df()
```
### What data types are found in the database?
```{r}
columns_info_schema_info %>%
count(data_type) %>%
head() %>%
sp_print_df()
```
## Characterizing how things are named
Names are the handle for accessing the data. Tables and columns may or may not be named consistently or in a way that makes sense to you. You should look at these names *as data*.
### Counting columns and name reuse
Pull out some rough-and-ready but useful statistics about your database. Since we are in SQL-land we talk about variables as `columns`.
*this is wrong!*
```{r}
public_tables <- columns_info_schema_table %>%
# filter(str_detect(table_name, "pg_") == FALSE) %>%
# filter(table_schema == "public") %>%
collect()
public_tables %>%
count(table_name, sort = TRUE) %>%
head(n = 15) %>%
sp_print_df()
```
How many *column names* are shared across tables (or duplicated)?
```{r}
public_tables %>% count(column_name, sort = TRUE) %>%
filter(n > 1) %>%
head()
```
How many column names are unique?
```{r}
public_tables %>%
count(column_name) %>%
filter(n == 1) %>%
count() %>%
head()
```
## Database keys
### Direct SQL
How do we use this output? Could it be generated by dplyr?
```{r}
rs <- dbGetQuery(
con,
"
--SELECT conrelid::regclass as table_from
select table_catalog||'.'||table_schema||'.'||table_name table_name
, conname, pg_catalog.pg_get_constraintdef(r.oid, true) as condef
FROM information_schema.columns c,pg_catalog.pg_constraint r
WHERE 1 = 1 --r.conrelid = '16485'
AND r.contype in ('f','p') ORDER BY 1
;"
)
glimpse(rs)
sp_print_df(head(rs))
```
The following is more compact and looks more useful. What is the difference between the two?
```{r}
rs <- dbGetQuery(
con,
"select conrelid::regclass as table_from
,c.conname
,pg_get_constraintdef(c.oid)
from pg_constraint c
join pg_namespace n on n.oid = c.connamespace
where c.contype in ('f','p')
and n.nspname = 'public'
order by conrelid::regclass::text, contype DESC;
"
)
glimpse(rs)
sp_print_df(head(rs))
dim(rs)[1]
```
### Database keys with dplyr
This query shows the primary and foreign keys in the database.
```{r}
tables <- tbl(con, dbplyr::in_schema("information_schema", "tables"))
table_constraints <- tbl(con, dbplyr::in_schema("information_schema", "table_constraints"))
key_column_usage <- tbl(con, dbplyr::in_schema("information_schema", "key_column_usage"))
referential_constraints <- tbl(con, dbplyr::in_schema("information_schema", "referential_constraints"))
constraint_column_usage <- tbl(con, dbplyr::in_schema("information_schema", "constraint_column_usage"))
keys <- tables %>%
left_join(table_constraints, by = c(
"table_catalog" = "table_catalog",
"table_schema" = "table_schema",
"table_name" = "table_name"
)) %>%
# table_constraints %>%
filter(constraint_type %in% c("FOREIGN KEY", "PRIMARY KEY")) %>%
left_join(key_column_usage,
by = c(
"table_catalog" = "table_catalog",
"constraint_catalog" = "constraint_catalog",
"constraint_schema" = "constraint_schema",
"table_name" = "table_name",
"table_schema" = "table_schema",
"constraint_name" = "constraint_name"
)
) %>%
# left_join(constraint_column_usage) %>% # does this table add anything useful?
select(table_name, table_type, constraint_name, constraint_type, column_name, ordinal_position) %>%
arrange(table_name) %>%
collect()
glimpse(keys)
sp_print_df(head(keys))
```
What do we learn from the following query? How is it useful?
```{r}
rs <- dbGetQuery(
con,
"SELECT r.*,
pg_catalog.pg_get_constraintdef(r.oid, true) as condef
FROM pg_catalog.pg_constraint r
WHERE 1=1 --r.conrelid = '16485' AND r.contype = 'f' ORDER BY 1;
"
)
head(rs)
```
## Creating your own data dictionary
If you are going to work with a database for an extended period it can be useful to create your own data dictionary. This can take the form of [keeping detaild notes](https://caitlinhudon.com/2018/10/30/data-dictionaries/) as well as extracting metadata from the dbms. Here is an illustration of the idea.
*This probably doens't work anymore*
```{r}
# some_tables <- c("rental", "city", "store")
#
# all_meta <- map_df(some_tables, sp_get_dbms_data_dictionary, con = con)
#
# all_meta
#
# glimpse(all_meta)
#
# sp_print_df(head(all_meta))
```
## Save your work!
The work you do to understand the structure and contents of a database can be useful for others (including future-you). So at the end of a session, you might look at all the data frames you want to save. Consider saving them in a form where you can add notes at the appropriate level (as in a Google Doc representing table or columns that you annotate over time).
```{r}
ls()
```
```
## Cleaning up
Always have R disconnect from the database when you're done and stop the Adventureworks Container
```{r}
dbDisconnect(con)
sp_docker_stop("adventureworks")
```