155-sql-joins-exercises-with-answers.Rmd

---
output:
  html_document:
    code_folding: "hide"
---

# SQL Joins Exercises Answered {#chapter_sql-joins-exercises-answered}

> This chapter contains questions one may be curious about or asked about the DVD Rental business.  
> The goal of the exercises is extracting useful or questionable insights from one or more tables.   Each exercise has has some or all of the following parts.

>  1.  The question.
>  2.  The tables used to answer the question.
>  3.  A hidden SQL code block showing the desired output. Click the code button to see the SQL code.   
>  4.  A table of derived values or renamed columns shown in the SQL block to facilitate replicating the desired dplyr solution.  Abbreviated column names are used to squeeze in more columns into the answer to reduce scrolling across the screen.
>  5.  A replication section where you recreate the desired output using dplyr syntax.  Most columns come directly out of the tables.  Each replication code block has three commented function calls 
>    *  sp_tbl_descr('store')        --describes a table, store
>    *  sp_tbl_pk_fk('table_name')   --shows a table's primary and foreign keys
>    *  sp_print_df(table_rows_sql)  --shows table row counts.
>    
> 6. To keep the exercises concentrated on the joins, all derived dates drop their timestamp.
>    *  SQL syntax:   date_column::DATE
>    *  Dplyr syntax: as.date(date_colun)

##  Exercise Instructions

1.  Manually execute all the code blocks up-to the "SQL Union Exercise."  
2.  Most of the exercises can be performed in any order.  
*  There are function exercises that create a function followed by another code block to call the function in the previous exercise.
3.  Use the Show Document Outline, CTL-Shift-O, to navigate to the different exercises.

```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
# These packages are called in almost every chapter of the book:
library(tidyverse)
library(DBI)
library(RPostgres)
library(glue)
library(here)
require(knitr)
library(dbplyr)
library(sqlpetr)
```

```{r codeblock options,echo=FALSE}
ECHO_CODE_BLOCK = TRUE
HEAD_N = 10  #set to 0 for all rows or head(x,n=HEAD_N)
INCLUDE_OUTPUT  = TRUE
```

Verify Docker is up and running:

```{r}
sp_check_that_docker_is_up()
```

Verify pet DB is available, it may be stopped.

```{r}
sp_show_all_docker_containers()
```

Start up the `docker-pet` container

```{r}
sp_docker_start("sql-pet")
```

Now connect to the database with R

```{r}

# need to wait for Docker & Postgres to come up before connecting.

con <- sp_get_postgres_connection(
  user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
  password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
  dbname = "dvdrental",
  seconds_to_test = 30, connection_tab = TRUE
)
```

## Dplyr tables

All the tables defined in the DVD Rental System will fit into memory which is rarely the case when working with a database.  Instead of loading all the DVD Rental System tables into memory via a DBI::dbReadTable, each table is loaded into an R object named TableName_table, via a dplyr::tbl call.

*  actor_table <- dplyr::tbl(con,"actor")

```{r Declare Dplyr Tables}
source(here('book-src','dvdrental-table-declarations.R'), echo = FALSE)
```

The key difference between the DBI::dbTableRead and the dplyr::tbl reference is the first is `not lazy` and the second one is `lazy`.  

The following code block deletes and inserts records into the different tables used in the exercises in this chpater.  The techniques used in this code block are discussed in detail in the appendix, ??add link here.??  

```{r collapse=TRUE}
source(file=here::here('book-src','sql_pet_data.R'),echo=FALSE)
```

## Oveview Exercise

When joining many tables, it is helpful to have the number of rows from each table as an initial sanity check that the joins are returning a reasonable number of rows.

### 1.  How many rows are in each table?

```{r sql union, code_folding='unhide',echo=ECHO_CODE_BLOCK, tidy=TRUE}
table_rows_sql <- dbGetQuery(con, "select *
                 from (      select 'actor' tbl_name,count(*) from actor 
                       union select 'category' tbl_name,count(*) from category
                       union select 'film' tbl_name,count(*) from film
                       union select 'film_actor' tbl_name,count(*) from film_actor
                       union select 'film_category' tbl_name,count(*) from film_category
                       union select 'language' tbl_name,count(*) from language
                       union select 'inventory' tbl_name,count(*) from inventory
                       union select 'rental' tbl_name,count(*) from rental
                       union select 'payment' tbl_name,count(*) from payment
                       union select 'staff' tbl_name,count(*) from staff
                       union select 'customer' tbl_name,count(*) from customer
                       union select 'address' tbl_name,count(*) from address
                       union select 'city' tbl_name,count(*) from city
                       union select 'country' tbl_name,count(*) from country
                       union select 'store' tbl_name,count(*) from store
                       ) counts
                  order by tbl_name
                 ;
                ")
sp_print_df(table_rows_sql)
```

#### Replicate the output above using dplyr syntax.

```{r dplyr union, include=INCLUDE_OUTPUT,echo=ECHO_CODE_BLOCK}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

table_rows_dplyr <- 
  as.data.frame(actor_table %>% mutate(name = "actor") %>% group_by(name) %>% 
                  summarize(rows = n())) %>% 
  union(as.data.frame(address_table %>% mutate(name = "address") %>% group_by(name) %>% 
                        summarize(rows = n()))) %>% 
  union (as.data.frame(category_table %>% mutate(name = "category") %>% group_by(name) %>% 
                         summarize(rows = n()))) %>% 
  union(as.data.frame(country_table %>% mutate(name = "city") %>% group_by(name) %>% 
                        summarize(rows = n()))) %>%    
  union(as.data.frame(country_table %>% mutate(name = "country") %>% group_by(name) %>% 
                        summarize(rows = n()))) %>% 
  union(as.data.frame(customer_table %>% mutate(name = "customer") %>% group_by(name) %>% 
                        summarize(rows = n()))) %>% 
  union(as.data.frame(film_table %>% mutate(name = "film") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(film_actor_table %>% mutate(name = "film_actor") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(film_category_table %>% mutate(name = "film_category") %>% group_by(name) %>%
                      summarize(rows = n()))) %>% 
  union(as.data.frame(inventory_table %>% mutate(name = "inventory") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(language_table %>% mutate(name = "language") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(rental_table %>% mutate(name = "rental") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(payment_table %>% mutate(name = "payment") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(staff_table %>% mutate(name = "staff") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>% 
  union(as.data.frame(store_table %>% mutate(name = "store") %>% group_by(name) %>% 
                      summarize(rows = n()))) %>%
  arrange(name)

sp_print_df(table_rows_dplyr)
```


## Exercises

### 1.  Where is the DVD Rental Business located?

To answer this question we look at the `store`, `address`, `city`, and `country` tables to answer this question.

```{r ex1-s, code_folding='unhide', tidy=TRUE}
store_locations_sql <- dbGetQuery(con,
"select s.store_id
       ,a.address
       ,c.city
       ,a.district
       ,a.postal_code
       ,c2.country
       ,s.last_update
   from store s 
         join address a on s.address_id = a.address_id
         join city c on a.city_id = c.city_id
         join country c2 on c.country_id = c2.country_id
")
sp_print_df(store_locations_sql)
```

`r sp_color_markdown_text("Our DVD Rental business is international and operates in three countries, Canada, Austraila, and the United States.  Each country has one store.","blue")` 

#### Replicate the output above using dplyr syntax.

```{r ex1-d, include=INCLUDE_OUTPUT, tidy=TRUE  }
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

store_locations_dplyr <- store_table %>%
    inner_join(address_table, by = c("address_id" = "address_id"), suffix(c(".s", ".a"))) %>%
    inner_join(city_table, by = c("city_id" = "city_id"), suffix(c(".a", ".c"))) %>%
    inner_join(country_table, by = c("country_id" = "country_id"), suffix(c(".a", ".c"))) %>%
    select (store_id,address,city,district,postal_code,country,last_update.x) %>%
    collect()
sp_print_df(store_locations_dplyr)
```

### 2.  List each store and the staff contact information?

To answer this question we look at the `store`, `staff`, `address`, `city`, and `country` tables.

```{r ex2-s, code_folding='unhide', tidy=TRUE}
store_employees_sql <- dbGetQuery(con,
"select st.store_id
       ,s.first_name
       ,s.last_name
       ,s.email
       ,a.phone
       ,a.address
       ,c.city
       ,a.district
       ,a.postal_code
       ,c2.country
   from store st left join staff s on st.manager_staff_id = s.staff_id 
         left join address a on s.address_id = a.address_id
         left join city c on a.city_id = c.city_id
         left join country c2 on c.country_id = c2.country_id
")
sp_print_df(store_employees_sql)
```

`r sp_color_markdown_text("Our DVD Rental business is international and operates in three countries, Canada, Austraila, and the United States.  Each country has one store.  The stores in Canada and Austrailia have one employee each, Mike Hillyer and Jon Stephens respectively.  The store in the United States has no employees yet.","blue")`

#### Replicate the output above using dplyr syntax.

```{r ex2-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

store_employees_dplyr <- store_table %>%
  left_join (staff_table, by = c("manager_staff_id" = "staff_id"),suffix(c('sto','sta'))) %>%
  left_join(address_table, by = c("address_id.y" = "address_id"), suffix(c(".sta", ".a"))) %>%
  left_join(city_table, by = c("city_id" = "city_id"), suffix(c(".sta", ".city"))) %>%
  left_join(country_table, by = c("country_id" = "country_id"), suffix(c(".city", ".cnt"))) %>%
  select(store_id.x,first_name,last_name,email,phone,address,city,district,postal_code,country) %>%
  collect()
sp_print_df(store_employees_dplyr)

```


### 3.  How many active, inactive, and total customers does the DVD rental business have?

To answer this question we look at the `customer` table.  In a previous chapter we observed that there are two columns, `activebool` and `active`.  We consider `active = 1` as active.

```{r ex3-s, code_folding='unhide', tidy=TRUE}
customer_cnt_sql <- dbGetQuery(con,
"SELECT sum(case when active = 1 then 1 else 0 end) active
       ,sum(case when active = 0 then 1 else 0 end) inactive
       ,count(*) total
   from customer
")

sp_print_df(customer_cnt_sql)
```

`r sp_color_markdown_text('
Our DVD Rental business is international and operates in three countries, Canada, Austraila, and the United States.  Each country has one store.  The stores in Canada and Austrailia have one employee each.  The store in the United States has no employees yet.  

The business has 604 international customers, 589 are active and 15 inactive.','blue')`


#### Replicate the output above using dplyr syntax.

```{r ex3-d, include=INCLUDE_OUTPUT, tidy=TRUE}

customer_cnt_dplyr <- customer_table %>% 
  mutate(inactive = ifelse(active==0,1,0)) %>%
    summarize(active   = sum(active)
             ,inactive = sum(inactive)
             ,total = n()
             ) %>%
  collect()
sp_print_df(customer_cnt_dplyr)
```

### 4.  How many and what percent of customers are from each country?

To answer this question we look at the `customer`, `address`, `city`, and `country` tables. 

```{r ex4-s, code_folding='unhide', tidy=TRUE}
customers_sql <- dbGetQuery(con,
"select c.active,country.country,count(*) count
              ,round(100 * count(*) / sum(count(*)) over(),4) as pct
         from customer c
              join address a on c.address_id = a.address_id
              join city  on a.city_id = city.city_id
              join country on city.country_id = country.country_id
         group by c.active,country
order by count(*) desc
")
sp_print_df(customers_sql)
```

`r sp_color_markdown_text('
Based on the table above, the DVD Rental business has customers in 118 countries.  The DVD Rental business cannot have many walk in customers.  It may possibly use a mail order distribution model.

For an international company, how are the different currencies converted to a standard currency?  Looking at the ERD, there is no currency conversion rate.','blue')`

#### Replicate the output above using dplyr syntax.

> The following snippet of code fails.

```{r ex4-d, include=INCLUDE_OUTPUT, eval=FALSE, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

cust_cnt <- customer_table %>% summarize(rows=n()) %>% collect()

customers_dplyr <- customer_table %>%
    inner_join(address_table, by = c("address_id" = "address_id"), suffix(c(".s", ".a"))) %>%
    inner_join(city_table, by = c("city_id" = "city_id"), suffix(c(".a", ".c"))) %>%
    inner_join(country_table, by = c("country_id" = "country_id"), suffix(c(".a", ".c"))) %>%
    group_by(active,country) %>%
    summarize(count=n()) %>%
   
    mutate(total=cust_cnt$rows #  nrow(customer_table)
          ,pct=round(100 * count/total,4)
          ) %>%
    arrange(desc(count)) %>% 
    select (active,country,count,pct) %>%
  collect()


sp_print_df(customers_dplyr)

```

### 5.  What countries constitute the top 25% of the customer base?

Using the previous code, add two new columns.  One column shows a running total and the second column shows a running percentage.  Order the data by count then by country.

To answer this question we look at the `customer`, `address`, `city`, and `country` tables again.

```{r ex5-s, code_folding='unhide', tidy=TRUE}
country_sql <- dbGetQuery(con,
"select active,country,count
       ,sum(count) over (order by count desc,country rows between unbounded preceding and current row) running_total
       , pct
       ,sum(pct) over (order by pct desc,country rows between unbounded preceding and current row) running_pct
  from (-- Start of inner SQL Block
        select c.active,country.country,count(*) count
              ,round(100 * count(*) / sum(count(*)) over(),4) as pct
         from customer c
              join address a on c.address_id = a.address_id
              join city  on a.city_id = city.city_id
              join country on city.country_id = country.country_id
         group by c.active,country
       ) ctry  -- End of inner SQL Block
 order by count desc,country
")
sp_print_df(country_sql)
```

`r sp_color_markdown_text('
The top 25% of the customer base are from India, China, the United States, and Japan.  The next six countries, the top 10, Mexico, Brazil, Russian Federation, Philipines, Indonesia, and Turkey round out the top 50% of the businesses customer base.','blue')`

#### Replicate the output above using dplyr syntax.

```{r ex5-d, include=INCLUDE_OUTPUT, eval=FALSE, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

cust_cnt <- customer_table %>% summarize(rows=n()) %>% collect()

country_dplyr <- customer_table %>%
    inner_join(address_table, by = c("address_id" = "address_id"), suffix(c(".s", ".a"))) %>%
    inner_join(city_table, by = c("city_id" = "city_id"), suffix(c(".a", ".c"))) %>%
    inner_join(country_table, by = c("country_id" = "country_id"), suffix(c(".a", ".c"))) %>%
    group_by(active,country) %>%
    summarize(count=n()) %>%  
    mutate(total=cust_cnt$rows
          ,pct=round(100 * count/total,4)
          ,csp=1
          ) %>% 
    arrange(desc(count)) %>%

    group_by(csp) %>%

    mutate(running_pct=cumsum(pct)
          ,running_total=cumsum(count)) %>% 
    select (csp,active,country,count,running_total,pct,running_pct) %>%  
    collect()  

sp_print_df(country_dplyr)
```

### 6.  How many customers are in Australia and Canada?

To answer this question we use the results from the previous exercise.

```{r ex6-s, code_folding='unhide', tidy=TRUE}
country_au_ca_sql <- country_sql %>% filter(country == 'Australia' | country == 'Canada')
sp_print_df(country_au_ca_sql)
```

`r sp_color_markdown_text("There are 10 customers in Austrailia and Canada where the brick and mortar stores are located.  The 20 customers are less than 2% of the world wide customer base.  ",'blue')`

#### Replicate the output above using dplyr syntax.

> The following code fails:

```{r ex6-d, include=INCLUDE_OUTPUT, eval=FALSE, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

country_au_ca_dplyr <- country_dplyr %>% filter(country == 'Australia' | country == 'Canada')
sp_print_df(country_au_ca_dplyr)

```

### 7.  How many languages?

With an international customer base, how many languages does the DVD Rental business distribute DVD's in.

To answer this question we look at the `language` table.

```{r ex7-s, code_folding='unhide', tidy=TRUE}
languages_sql <- dbGetQuery(con,
"
select * from language
")

sp_print_df(languages_sql)
```

`r sp_color_markdown_text("DVD's are distributed in six languages.","blue")`

#### Replicate the output above using dplyr syntax.

```{r ex7-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

languages_dplyr <- language_table %>% collect()
sp_print_df(languages_dplyr)
```

### 8.  What is the distribution of DVD's by Language 

To answer this question we look at the `language` and `film` tables.

```{r ex8-s, code_folding='unhide', tidy=TRUE}
language_distribution_sql <- dbGetQuery(con,
'
select l.language_id,name "language",count(f.film_id)
  from language l left join film f on l.language_id = f.language_id
group by l.language_id,name
order by l.language_id
')

sp_print_df(language_distribution_sql)
```

`r sp_color_markdown_text("This is a surprise.  For an international customer base, the entire stock of 1001 DVD's are in English only.",'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex8-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

language_distribution_dplyr <- language_table %>%
    left_join(film_table, by = c("language_id" = "language_id"), suffix(c(".s", ".a"))) %>%
    group_by(language_id,name) %>%
    summarize(count = sum(ifelse(!is.na(title),1,0)),na.rm=TRUE) %>% 
  collect()

sp_print_df(language_distribution_dplyr)
```

### 9.  What are the number of rentals and rented amount by store, by month?

To answer this question we look at the `rental`, `inventory`, and `film` tables to answer this question. 

```{r ex9-s, code_folding='unhide', tidy=TRUE}
store_rentals_by_mth_sql <- dbGetQuery(con,
"select *
       ,sum(rental_amt) over (order by yyyy_mm,store_id rows 
                              between unbounded preceding and current row) running_rental_amt
   from (select yyyy_mm,store_id,rentals,rental_amt
               ,sum(rentals) over(partition by yyyy_mm order by store_id) mo_rentals
               ,sum(rental_amt) over (partition by yyyy_mm order by store_id) mo_rental_amt
           from (select to_char(rental_date,'yyyy-mm') yyyy_mm
                       ,i.store_id,count(*) rentals, sum(f.rental_rate) rental_amt
                   from rental r join inventory i on r.inventory_id = i.inventory_id 
                        join film f on i.film_id = f.film_id
                 group by to_char(rental_date,'yyyy-mm'),i.store_id
                ) as details
        ) as mo_running
order by yyyy_mm,store_id
")
sp_print_df(store_rentals_by_mth_sql)
```


`r sp_color_markdown_text("The current entry, row 11, is our new rental row we added to show the different joins in a previous chapter.
",'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex9-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

store_rentals_by_mth_dplyr <- rental_table %>%
    inner_join(inventory_table, by = c("inventory_id" = "inventory_id"), suffix(c(".r", ".i"))) %>%
    inner_join(film_table, by = c("film_id" = "film_id"), suffix(c(".i", ".f"))) %>%
    mutate(YYYY_MM = to_char(rental_date,"YYYY-MM")
          ,running_total = 'running_total'
          ) %>%
    group_by(running_total,YYYY_MM,store_id) %>%
    summarise(rentals = n()
             ,rental_amt = sum(rental_rate,na.rm = TRUE)
             ) %>%
    mutate(mo_rentals=order_by(store_id,cumsum(rentals))
          ,mo_rental_amt=order_by(store_id,cumsum(rental_amt))
          ) %>%
    group_by(running_total) %>% 
    arrange(YYYY_MM,store_id) %>%
    mutate(running_rental_amt = cumsum(rental_amt)) %>%
    select(-running_total) %>% 
    collect()
    
sp_print_df(head(store_rentals_by_mth_dplyr, n=25))
```

### 10.  Rank films based on the number of times rented and associated revenue

To answer this question we look at the `rental`, `inventory`, and `film` tables.

```{r ex10-s, code_folding='unhide', tidy=TRUE}
film_rank_sql <- dbGetQuery(con,
"select f.film_id,f.title,f.rental_rate,count(*) count,f.rental_rate * count(*) rental_amt
   from rental r join inventory i on r.inventory_id = i.inventory_id 
        join film f on i.film_id = f.film_id
 group by f.film_id,f.title,f.rental_rate
 order by count(*) desc")
  
sp_print_df(film_rank_sql)
```

`r sp_color_markdown_text("The most frequently rented movie, 34 times, is 'Bucket Brotherhood' followed by Rocketeer Mother, 33 times.",'blue')`

#### Replicate the output above using dplyr syntax.


```{r ex10-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_rank_dplyr <- rental_table %>%
    inner_join(inventory_table, by = c("inventory_id" = "inventory_id"), suffix(c(".r", ".i"))) %>%
    inner_join(film_table, by = c("film_id" = "film_id"), suffix(c(".f", ".i"))) %>%
    group_by(film_id,title,rental_rate) %>%
    summarize(count = n()
             ,rental_amt = sum(rental_rate)
             ) %>%
    arrange(desc(count)) %>% 
  collect()

sp_print_df(film_rank_dplyr)
```

### 11.  What is the rental distribution/DVD for the top two rented films?

From the previous exercise we know that the top two films are `Bucket Brotherhood` and `Rocketeer Mother`.  

To answer this question we look at the `rental`, `inventory`, and `film` tables again.  

Instead of looking at the film level, we need to drill down to the individual dvd's for each film to answer this question.


```{r ex11-s, code_folding='unhide', tidy=TRUE}
film_rank2_sql <- dbGetQuery(con,
"select i.store_id,i.film_id,f.title,i.inventory_id,count(*) 
   from rental r join inventory i on r.inventory_id = i.inventory_id 
        join film f on i.film_id = f.film_id
  where i.film_id in (103,738)
group by i.store_id,i.film_id,f.title,i.inventory_id")

sp_print_df(film_rank2_sql)
```

`r sp_color_markdown_text("The 'Bucket Brotherhood' and 'Rocketeer Mother' DVD's are equally distributed between the two stores, 4 dvd's each per film.  The 'Bucket Brotherhood' was rented 17 times from both stores.  The 'Rocketeer Mother' was rented 15 times from store 1 and 18 times from store 2.",'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex11-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_rank2_dplyr <- rental_table %>%
    inner_join(inventory_table, by = c("inventory_id" = "inventory_id"), suffix(c(".r", ".i"))) %>%
    inner_join(film_table, by = c("film_id" = "film_id"), suffix(c(".f", ".i"))) %>%
    filter(film_id %in% c(103,738)) %>%
    group_by(store_id,film_id,title,inventory_id) %>%
    summarize(count = n()) %>%
    arrange(film_id,store_id,inventory_id) %>% 
  collect()

sp_print_df(film_rank2_dplyr)
```


### 12.  List staffing information for store 1 associated with the `Bucket Brother` rentals? 

To answer this question we look at the `rental`, `inventory`, `film`, `staff`, `address`, `city`, and `country` tables.  

```{r ex12-s, code_folding='unhide', tidy=TRUE}
film_103_details_sql <- dbGetQuery(con,
"select i.store_id,i.film_id,f.title,i.inventory_id inv_id,i.store_id inv_store_id
       ,r.rental_date::date rented,r.return_date::date returned
       ,s.staff_id,s.store_id staff_store_id,concat(s.first_name,' ',s.last_name) staff,ctry.country
   from rental r join inventory i on r.inventory_id = i.inventory_id 
        join film f on i.film_id = f.film_id
        join staff s on r.staff_id = s.staff_id
        join address a on s.address_id = a.address_id
        join city c on a.city_id = c.city_id
        join country ctry on c.country_id = ctry.country_id
  where i.film_id in (103)
    and r.rental_date::date between '2005-05-01'::date and '2005-06-01'::date
order by r.rental_date
")
sp_print_df(film_103_details_sql)
```

`r sp_color_markdown_text("In a previous exercise we saw that store 1 based in Canada and store 2 based in Austrailia each had one employee, staff_id 1 and 2 respectively.  We see that Mike from store 1, Canada, had transactions in store 1 and store 2 on 5/25/2005.  Similarly Jon from store 2, Australia, had transaction in store 2 and store 1 on 5/31/2005.  Is this phsically possible, or a key in error?"
,'blue')`

#### Replicate the output above using dplyr syntax.

column         | mapping               |definition
---------------|-----------------------|-----------
inv_id         |inventory.inventory_id |
inv_store_id   |inventory.store_id     |
rented         |rental.rental_date     |
returned       |rental.return_date     |
staff_store_id |store.store_id         |
staff          |first_name+last_name   |

```{r ex12-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_103_details_dplyr <- inventory_table %>% filter(film_id == 103) %>%
  inner_join(film_table, by=c('film_id' = 'film_id'),suffix(c('.f','r'))) %>%
  inner_join(rental_table, by=c('inventory_id' = 'inventory_id'),suffix(c('.i','r'))) %>%
  filter(rental_date < '2005-06-01') %>% 
  inner_join(staff_table, by=c('staff_id' = 'staff_id'),suffix(c('.x','r'))) %>%
  inner_join(address_table, by=c('address_id' = 'address_id'),suffix(c('.a','r'))) %>%
  inner_join(city_table, by=c('city_id' = 'city_id'),suffix(c('.c','a'))) %>%
  inner_join(country_table, by=c('country_id' = 'country_id'),suffix(c('.ctry','city'))) %>%
  mutate(rented = Date(rental_date)
        ,returned = Date(return_date)
        ,staff = paste0(first_name,' ',last_name)
        ) %>%
  rename(inv_store = store_id.x
        ,staff_store_id=store_id.y
        ,inv_id = inventory_id
        ) %>%
  select(inv_store,film_id,title,inv_id,rented,returned,staff_id,staff_store_id
         ,staff,country) %>%
  arrange(rented) %>%
  collect() 

sp_print_df(film_103_details_dplyr)
```


### 13.  Which film(s) have never been rented

To answer this question we look at the `film`, `inventory` and `rental` tables.

```{r ex13-s, code_folding='unhide', tidy=TRUE}
never_rented_dvds_sql <- dbGetQuery(con,
'select i.store_id,f.film_id, f.title,f.description, i.inventory_id
   from film f join inventory i on f.film_id = i.film_id
        left join rental r on i.inventory_id = r.inventory_id 
  where r.inventory_id is null 
'
)

sp_print_df(never_rented_dvds_sql)
```

`r sp_color_markdown_text("There are only two movies that have not been rented, Academy Dinousaur and Sophie's Choice."
,'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex13-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

never_rented_dvds_dplyr <- film_table %>%
    inner_join(inventory_table, by = c("film_id" = "film_id"), suffix(c(".f", ".i"))) %>%
    anti_join(rental_table, by = c('inventory_id','inventory_id'), suffix(c('.i','.r'))) %>%
    select(film_id,title,description,inventory_id) %>% 
  collect()

sp_print_df(never_rented_dvds_dplyr)
```

### 14.  How many films are in each film rating?

To answer this question we look at the `film` table to answer this question.

```{r ex14-s, code_folding='unhide', tidy=TRUE}
film_ratings_sql <- dbGetQuery(con,
'select f.rating,count(*)
   from film f 
group by f.rating
order by count(*) desc
'
)

sp_print_df(film_ratings_sql)
```

`r sp_color_markdown_text("There are 5 ratings and all 5 have roughly 200  movies."
,'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex14-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_ratings_dplyr <- film_table %>%
  group_by(rating) %>%
  summarize(count=n()) %>%
  arrange(desc(count)) %>% 
  collect()

sp_print_df(film_ratings_dplyr)
```

### 15.  What are the different film categories?

To answer this question we look at the `category` table to answer this question.

```{r ex15-s, code_folding='unhide', tidy=TRUE}
film_categories_sql <- dbGetQuery(con,
'select * from category'
)

sp_print_df(film_categories_sql)
```

`r sp_color_markdown_text("There are 16 different categories","blue")`

#### Replicate the output above using dplyr syntax.

```{r ex15-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_categories_dplyr <- category_table %>% 
  collect()

sp_print_df(film_categories_dplyr)
```

### 16.  How many DVD's are in each film categeory?

To answer this question we look at the `category` table again.

```{r ex16-s, code_folding='unhide', tidy=TRUE}
film_categories2_sql <- dbGetQuery(con,
'select c.name,count(*) count
   from category c join film_category fc on c.category_id = fc.category_id
group by c.name
order by count(*) desc
'
)

sp_print_df(film_categories2_sql)
```

`r sp_color_markdown_text('There are 16 film categories.  The highest category, Sports, has 77 films followed by the International category which has 76 film.  What is an example of an international category film where all films are currently in English?','blue')`

#### Replicate the output above using dplyr syntax.

```{r ex16-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_categories2_dplyr <- category_table %>%
  inner_join(film_category_table, by =c('category_id'='category_id') 
            ,suffix(c('.c','.fc'))) %>%
  group_by(name) %>%
  summarise(count=n()) %>%
  arrange(desc(count)) %>% 
  collect()

sp_print_df(film_categories2_dplyr)

```

### 17.  Which films are listed in multiple categories?

To answer this question we look at the `film`, `film_category` and `category` tables.

```{r ex17-s, code_folding='unhide', tidy=TRUE}
multiple_categories_sql <- dbGetQuery(con,
'select f.film_id, f.title,c.name
   from film_category fc join film f on fc.film_id = f.film_id
        join category c on fc.category_id = c.category_id
  where fc.film_id in (select fc.film_id
                         from film f join film_category fc on f.film_id = fc.film_id
                       group by fc.film_id
                       having count(*) > 1
                       ) 
'
)

sp_print_df(multiple_categories_sql)
```

`r sp_color_markdown_text("There is only one film which has two categories, Sophie's Choice.",'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex17-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

multiple_categories_dplyr <- 
  # compute films with multiple categories
  film_table %>% 
    inner_join(film_category_table,by=c('film_id'='film_id'), suffix(c('.f','.fc'))) %>% 
    group_by(film_id,title) %>% 
    summarise(count=n()) %>%
    filter(count > 1) %>% 
  # get the category ids
  inner_join(film_category_table, by = c('film_id'='film_id'),suffix(c('.f','.fc'))) %>%
  # get the category names
  inner_join(category_table, by=c('category_id'='category_id')) %>%
  select(film_id,title,name) %>% 
  collect()

sp_print_df(multiple_categories_dplyr)
```

### 18.  Which DVD's are in one store's inventory but not the other

In the table below we show the first `r HEAD_N` rows.  

To answer this question we look at the `inventory` and `film` tables.

```{r ex18-s, code_folding='unhide', tidy=TRUE}

dvd_in_1_store_sql <- dbGetQuery(
  con,
  "
--   select store1,count(count1) films_not_in_store_2,sum(coalesce(count1,0)) dvds_not_in_store_1
--         ,store2,count(count2) films_not_in_store_1,sum(coalesce(count2,0)) dvds_not_in_store_2
--     from (
             select coalesce(i1.film_id,i2.film_id) film_id,f.title,f.rental_rate
                   ,1 store1,coalesce(i1.count,0) count1
                   ,2 store2,coalesce(i2.count,0) count2
                  -- dvd inventory in store 1
               from (select film_id,store_id,count(*) count 
                       from inventory where store_id = 1 
                      group by film_id,store_id
                    ) as i1
                    full outer join 
                  -- dvd inventory in store 2
                    (select film_id,store_id,count(*) count
                       from inventory where store_id = 2 
                     group by film_id,store_id
                    ) as i2
                 on i1.film_id = i2.film_id 
               join film f 
                 on coalesce(i1.film_id,i2.film_id) = f.film_id
             where i1.film_id is null or i2.film_id is null
             order by f.title  
--          ) as src
--    group by store1,store2
"
)
if(HEAD_N > 0) {
    sp_print_df(head(dvd_in_1_store_sql,n=HEAD_N))
} else {
    sp_print_df(dvd_in_1_store_sql)
}
```


`r sp_color_markdown_text("Store 1 has 196 films, (576 dvd's), that are not in store 2.  Store 2 has 199 films, (607 dvd's), that are not in store 1.",'blue')`

#### Replicate the output above using dplyr syntax.

> The following code isn't working yet.

```{r ex18-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

inv_tbl1 <- inventory_table %>% 
    filter(store_id == 1 ) %>% 
    group_by(film_id) %>% 
    summarise(count=n()) 

inv_tbl2 <- inventory_table %>% 
    filter(store_id == 2 ) %>% 
    group_by(film_id) %>% 
    summarise(count=n()) 

dvd_in_1_store_dplyr <- inv_tbl1 %>% 
    full_join(inv_tbl2, by=c('film_id','film_id'),suffix(c('.i1','.i2'))) %>%
    filter(is.na(count.x) | is.na(count.y)) %>%
#    filter(is.na(count.x + count.y)) %>%   #this works also    
    mutate_all(funs(ifelse(is.na(.), 0, .))) %>% 
    inner_join(film_table,by=c('film_id','film_id'), copy =TRUE) %>%
    mutate(store_id1 = 1, store_id2 = 2) %>%
    select (film_id,title,rental_rate,store_id1,count.x,store_id2,count.y) %>%
    arrange(film_id) %>% 
  collect()

if(HEAD_N > 0) {
    sp_print_df(head(dvd_in_1_store_dplyr,n=HEAD_N))
} else {
    sp_print_df(dvd_in_1_store_dplyr)
}
```

### 19.  Which films are not tracked in inventory?

To answer this question we look at the `film` and `rental` tables.

```{r ex19-s, code_folding='unhide', tidy=TRUE}
films_no_inventory_sql <- dbGetQuery(con,
"
select f.film_id,title,rating,rental_rate,replacement_cost
  from film f left outer join inventory i on f.film_id = i.film_id
 where i.film_id is null;
")

if(HEAD_N > 0) {
    sp_print_df(head(films_no_inventory_sql,n=HEAD_N))
} else {
    sp_print_df(films_no_inventory_sql)
}

```


`r sp_color_markdown_text("There are 42 films that do not exist in inventory or in either store.  These may be DVD's that have been ordered but the business has not received them.  Looking at the price and the replacement cost, it doesn't look like there is any rhyme or reason to the setting of the price.",'blue')`

#### Replicate the output above using dplyr syntax.

```{r ex19-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

films_no_inventory_dplyr <- film_table %>%
    anti_join(inventory_table, by=(c('film_id'='film_id'))) %>%
    select (film_id,title,rating,rental_rate,replacement_cost) %>% 
  collect()

if(HEAD_N > 0) {
    sp_print_df(head(films_no_inventory_dplyr,n=HEAD_N))
} else {
    sp_print_df(films_no_inventory_dplyr)
}
```

### 20.  List film categories in descending accounts receivable.

To answer this question we look at the `rental`, `inventory`, `film`, `film_category` and `category`  tables.


```{r ex20-s, code_folding='unhide', tidy=TRUE}
film_category_AR_rank_sql <- dbGetQuery(con,
"
select category,AR
       ,sum(AR) over (order by AR desc rows between unbounded preceding and current row) running_AR
       ,rentals
       ,sum(rentals) over (order by AR desc rows between unbounded preceding and current row) running_rentals
  from (select c.name category, sum(f.rental_rate) AR, count(*) rentals
          from rental r join inventory i on r.inventory_id = i.inventory_id 
               join film f on i.film_id = f.film_id
               join film_category fc on f.film_id = fc.film_id
               join category c on fc.category_id = c.category_id
       group by c.name
      ) src
")
  
sp_print_df(film_category_AR_rank_sql)
```

`r sp_color_markdown_text('There are 16 film categories.  The top three categories based on highest AR amounts are Sports, Drama, and Sci-Fi.  The total number of rentals are 16046 with an AR amount of 47221.54.','blue')`

#### Replicate the output above using dplyr syntax.

column         | mapping          |definition
---------------|------------------|-----------
category       | category.name    |
ar             | f.rental_rate    |
running_ar     |                  | accumulated ar amounts based on ratings
rentals        |                  | number of rentals associated with the rating
running_rentals|                  | running rating rentals

```{r ex20-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_category_AR_rank_dplyr <- rental_table %>%
    inner_join(inventory_table, by=c('inventory_id'='inventory_id')) %>%
    inner_join(film_table, by=c('film_id'='film_id')) %>%
    inner_join(film_category_table, by=c('film_id'='film_id')) %>%
    inner_join(category_table,by=c('category_id'='category_id')) %>%
    group_by(name) %>%
    summarize(rentals=n()
             ,AR=sum(rental_rate, na.rm = TRUE)
             ) %>%
    arrange(desc(AR)) %>%
    mutate(running_ar=cumsum(AR)
          ,running_rentals=cumsum(rentals)
          ) %>%
    rename(category=name) %>%
    select(category,AR,running_ar,rentals,running_rentals) %>% 
  collect()
    
sp_print_df(film_category_AR_rank_dplyr)
```

### 21.  List film ratings in descending accounts receivable order.

To answer this question we look at the `rental`, `inventory`, and `film` tables.

```{r ex21-s, code_folding='unhide', tidy=TRUE}
film_rating_rank_sql <- dbGetQuery(con,
"select rating,AR
       ,sum(AR) over (order by AR desc rows 
        between unbounded preceding and current row) running_AR
       ,rentals
       ,sum(rentals) over (order by AR desc rows 
        between unbounded preceding and current row) running_rentals
from (select f.rating, sum(f.rental_rate) AR, count(*) rentals
        from rental r join inventory i on r.inventory_id = i.inventory_id 
        join film f on i.film_id = f.film_id
      group by f.rating
     ) as src 
")
  
sp_print_df(film_rating_rank_sql)
```

`r sp_color_markdown_text('There are 5 film ratings.  The total number of rentals are 16045 with an AR amount of 47216.55.

Why do the film categories revenue and film rating revenue amounts and counts differ, 16046 and 47221.54?','blue')`  

#### Replicate the output above using dplyr syntax.

column         | mapping          |definition
---------------|------------------|-----------
rating         | film.rating      |
ar             | f.rental_rate    |
running_ar     |                  | accumulated ar amounts based on ratings
rentals        |                  | number of rentals associated with the rating
running_rentals|                  | running rating rentals


```{r ex21-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

film_rating_rank_dplyr <- rental_table %>%
    inner_join(inventory_table, by=c('inventory_id'='inventory_id')) %>%
    inner_join(film_table, by=c('film_id'='film_id')) %>% 
    group_by(rating) %>%
    summarize(rentals=n()
             ,AR=sum(rental_rate, na.rm = TRUE)
             ) %>%
    arrange(desc(AR)) %>%
    mutate(running_ar=cumsum(AR)
          ,running_rentals=cumsum(rentals)
          ) %>%
    select(rating,AR,running_ar,rentals,running_rentals) %>% 
  collect()


sp_print_df(film_rating_rank_dplyr)
```


### 22.  How many rentals were returned on time, returned late, never returned?

To answer this question we look at the `rental`, `inventory`, and `film` tables.

```{r ex22-s, code_folding='unhide', tidy=TRUE}
returned_sql <- dbGetQuery(con,
"with details as
    (select case when r.return_date is null
                 then null
                 else r.return_date::date  - (r.rental_date + INTERVAL '1 day'  * f.rental_duration)::date
            end rtn_days
           ,case when r.return_date is null
                 then 1
                 else 0
            end not_rtn
       from rental r join inventory i on r.inventory_id = i.inventory_id
                     join film f on i.film_id = f.film_id
    )
 select sum(case when rtn_days <= 0 then 1 else 0 end) on_time
       ,sum(case when rtn_days >  0 then 1 else 0 end) late
       ,sum(not_rtn) not_rtn
       ,count(*) rented
       ,round(100. * sum(case when rtn_days <= 0 then 1 else 0 end)/count(*),2) on_time_pct
       ,round(100. * sum(case when rtn_days >  0 then 1 else 0 end)/count(*),2) late_pct
       ,round(100. * sum(not_rtn)/count(*),2)  not_rtn_pct
   from details
")

sp_print_df(returned_sql)
```

`r sp_color_markdown_text("To date 53.56% of the rented DVD's were returned on time, 45.30% were returned late, and 1.14% were never returned.",'blue')`

#### Replicate the output above using dplyr syntax.

column         | mapping          |definition
---------------|------------------|-----------
on_time        |                  |number of DVD's where rental.return_date <= rental.rental_date + film.rental_duration
late           |                  |number of DVD's where rental.return_date > rental.rental_date + film.rental_duration
not_rtn        |                  |number of DVD's not returned; rental.return_date is null
rented         |                  |number of DVD's rented.
on_time_pct    |                  |Percent of DVD's returned on time
late_pct       |                  |Percent of DVD's returned late
not_rtn_pct    |                  |Percent of DVD's not returned.

```{r ex22-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

returned_dplyr <- rental_table %>%
    inner_join(inventory_table, by=c('inventory_id'='inventory_id')) %>%
    inner_join(film_table, by=c('film_id'='film_id')) %>%
    mutate(rtn_days= date(return_date) - (date(rental_date) + rental_duration)
          ,not_returned=ifelse(is.na(return_date),1,0)
          ) %>%
    summarize(on_time = sum(ifelse(rtn_days <= 0,1,0),na.rm = TRUE)
             ,late = sum(ifelse(rtn_days > 0,1,0),na.rm = TRUE)
             ,not_rtn=sum(not_returned)
             ,rented = n()
             ) %>%
    mutate(on_time_pct = round(100.0 * on_time/rented,2)
          ,late_pct    = round(100.0 * late/rented,2)
          ,not_rtn_pct = round(100.0 * not_rtn/rented,2)
          ) %>%
    collect()

sp_print_df(returned_dplyr)
```

### 23.  Are there duplicate customers?

To answer this question we look at the `customer`, `address`, `city`, and `country` tables.

We assume that if the customer first and last name match in two different rows, then it is a duplicate customer. 

```{r ex23-s, code_folding='unhide', tidy=TRUE}
customer_dupes_sql <- dbGetQuery(
  con,
  "select cust.customer_id id
         ,cust.store_id store
         ,concat(cust.first_name,' ',cust.last_name) customer
         ,cust.email
--         ,a.phone
         ,a.address
         ,c.city
         ,a.postal_code zip
         ,a.district
         ,ctry.country
     from customer cust join address a on cust.address_id = a.address_id
                     join city c on a.city_id = c.city_id
                     join country ctry on c.country_id = ctry.country_id
    where concat(cust.first_name,cust.last_name)
          in (select concat(first_name,last_name)
                from customer
              group by concat(first_name,last_name)
             having count(*) >1
             )
  ")
sp_print_df(customer_dupes_sql)

```

`r sp_color_markdown_text('Sophie is the only duplicate customer.  The only difference between the two records is the store.  Record 600 is associated with store 3, which has no employees, and 601 is associated with store 2','blue')`

#### Replicate the output above using dplyr syntax.

column         | mapping              |definition
---------------|----------------------|-----------
id             |customer.customer_id  |
store          |customer.store_id     |
customer       |first_name + last_name|
zip            |address.postal_code   |

```{r ex23-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

customer_dupes_dplyr <- customer_table %>%
    group_by(first_name,last_name) %>%
    summarize(n = n()) %>%
    filter(n > 1) %>%
    inner_join(customer_table,by=c("first_name"="first_name","last_name"="last_name")) %>%
    inner_join(address_table, by = c("address_id" = "address_id"), suffix(c(".s", ".a"))) %>%
    inner_join(city_table, by = c("city_id" = "city_id"), suffix(c(".a", ".c"))) %>%
    inner_join(country_table, by = c("country_id" = "country_id"), suffix(c(".a", ".c"))) %>%
    mutate(customer=paste0(first_name,last_name,sep=' ')) %>%
    group_by(customer) %>%
    rename(id=customer_id
          ,store=store_id
          ,zip=postal_code
          ) %>%
    select(id,store,customer,email,address,city,zip,district,country) %>% 
  collect()
    
sp_print_df(customer_dupes_dplyr)
```


### 24.  Which customers have never rented a movie?

To answer this question we look at the `customer` and `rental` tables.

```{r ex24-s, code_folding='unhide', tidy=TRUE}
customer_no_rentals_sql <- dbGetQuery(
  con,
  "select c.customer_id id
         ,c.first_name
         ,c.last_name
         ,c.email
         ,a.phone
         ,city.city
         ,ctry.country
         ,c.active 
         ,c.create_date
--         ,c.last_update
     from customer c left join rental r on c.customer_id = r.customer_id
                     left join address a on c.address_id = a.address_id
                     left join city on a.city_id = city.city_id
                     left join country ctry on city.country_id = ctry.country_id
    where r.rental_id is null
  order by c.customer_id

  "
)
sp_print_df(customer_no_rentals_sql)
```

`r sp_color_markdown_text('We see that there are four new customers who have never rented a movie.  These four customers are in the countries that have a manned store.','blue')`

column         | mapping              |definition
---------------|----------------------|-----------
id             |customer.customer_id  |

#### Replicate the output above using dplyr syntax.

```{r ex24-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

customer_no_rentals_dplyr <- customer_table %>%
    anti_join(rental_table, by = "customer_id" ) %>%
    inner_join(address_table, by = c('address_id'='address_id')) %>%
    inner_join(city_table, by = c('city_id'='city_id')) %>%
    inner_join(country_table, by=c('country_id'='country_id')) %>%
    rename(id=customer_id) %>%
    select(id,first_name,last_name,email,phone,active,city,country,create_date) %>% 
  collect()

sp_print_df(customer_no_rentals_dplyr)
```


### 25.  Who are the top 5 customers with the most rentals and associated payments?

This exercise uses the `customer`, `rental`, and `payment` tables.

```{r ex25-s, code_folding='unhide', tidy=TRUE}
customer_top_rentals_sql <- dbGetQuery(
  con,
  "select c.customer_id id,c.store_id
         ,concat(c.first_name,' ',c.last_name) customer
         ,min(rental_date)::date mn_rental_dt
         ,max(rental_date)::date mx_rental_dt
         ,sum(COALESCE(p.amount,0.)) paid
         ,count(r.rental_id) rentals
     from customer c
          left join rental r on c.customer_id = r.customer_id
          left join payment p on r.rental_id = p.rental_id 
   group by  c.customer_id
            ,c.first_name
            ,c.last_name
            ,c.store_id
   order by count(r.rental_id) desc
limit 5
  "
)
sp_print_df(customer_top_rentals_sql)
```

`r sp_color_markdown_text("The top 5 customers all rented between 41 to 46 DVD's.  Three of the top 5 rented about 14 DVD's per month over a three month period.  The other two customers 41 and 42 DVD's per 12 months.",'blue')`

#### Replicate the output above using dplyr syntax

column         | mapping              |definition
---------------|----------------------|-----------
id             |customer.customer_id  |
customer       |first_name + last_name|
mn_rental_dt   |                      |minimum renal date
mx_rental_dt   |                      |maximum rental date
paid           |                      |paid amount
rentals        |                      |customer rentals

Use the dplyr inner_join verb to find the top 5 customers who have rented the most movies.

```{r ex25-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

customer_top_rentals_dplyr <- customer_table %>%
  left_join(rental_table, by = c("customer_id" = "customer_id"), suffix(c(".c", ".r"))) %>%
  left_join(payment_table, by = c("rental_id" = "rental_id"), suffix(c('r','p'))) %>%
  mutate(customer=paste(first_name,last_name,sep=' ')) %>%
  group_by(customer_id.x,customer,store_id) %>%
  summarize(rentals=n()
           ,paid = sum(ifelse(is.na(amount),0,amount),na.rm=TRUE)
           ,mn_rental_dt = Date(min(rental_date,na.rm=TRUE))
           ,mx_rental_dt = Date(max(rental_date,na.rm=TRUE))
           ) %>%
  arrange(desc(rentals)) %>%
  rename(id = customer_id.x) %>%
  select(id,store_id,customer,mn_rental_dt,mx_rental_dt,paid,rentals) %>%
  collect()

sp_print_df(head(customer_top_rentals_dplyr,n=5))
```        

### 26.  Combine the top 5 rental customers, (40 or more rentals), and zero rental customers

To answer this question we look at the `customer`, `rental`, and `payments` tables again.

```{r ex26-s, code_folding='unhide', tidy=TRUE}
customer_rental_high_low_sql <- dbGetQuery(
  con,
  "select c.customer_id id
         ,concat(c.first_name,' ',c.last_name) customer
         ,count(*) cust_cnt
         ,count(r.rental_id) rentals
         ,count(p.payment_id) payments
         ,sum(coalesce(p.amount,0)) paid
     from customer c
          left outer join rental r on c.customer_id = r.customer_id
          left outer join payment p on r.rental_id = p.rental_id
   group by  c.customer_id
            ,c.first_name
            ,c.last_name
   having count(r.rental_id) = 0 or count(r.rental_id) > 40
   order by count(r.rental_id) desc
  "
)
sp_print_df(customer_rental_high_low_sql)
```

`r sp_color_markdown_text('We see that there are four new customers who have never rented a movie.  These four customers are in the countries that have a manned store.

We see that there are four new customers who have never rented a movie.  These four customers are in the countries that have a manned store.','blue')`

#### Replicate the output above using dplyr syntax.

Column          | Mapping             |Definition
----------------|---------------------|-----------------------------------
id              |customer.customer_id |
customer        |first_name + last_name|
rentals         |                     |customer rentals
payments        |                     |customer payments
paid_amt        |payment.amount       |aggregated payment amount 

```{r ex26-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

customer_rental_high_low_dplyr <- customer_table %>%
    left_join(rental_table, by = c("customer_id" = "customer_id"), suffix(c(".c", ".r"))) %>%
    left_join(payment_table, by = c("rental_id" = "rental_id"), suffix(c('r','p'))) %>%
    mutate(customer=paste(first_name,last_name,sep=' ')
          ,rented = if_else(is.na(rental_id),0, 1)
          ,paid = if_else(is.na(payment_id),0,1)
          ) %>%
    group_by(customer_id.x,customer,rented) %>%
    summarize(cust_cnt = n()
             ,rentals=sum(rented, na.rm = TRUE)
             ,payments = sum(paid, na.rm = TRUE)
             ,paid_amt = sum(ifelse(is.na(amount),0,amount), na.rm = TRUE)
            ) %>%
    filter( rentals == 0 | rentals > 40) %>%
    rename(id = customer_id.x) %>%
    select(id,customer,cust_cnt,rentals,payments,paid_amt) %>%
    arrange(desc(rentals)) %>% 
    collect()

sp_print_df(customer_rental_high_low_dplyr)
```

### 27.  Who are the top-n1 and bottom-n2 customers?

The issue with the two previous reports is that the top end is hardcoded, rentals  > 40.  Over time, the current customers will always be in the top section and new customers will get added.  Another way of looking at the previous report is to show just the top and bottom 5 customers.  

Parameterize the previous exercise to show the top 5 and bottom 5 customers. 

To answer this question we look at the `customer`, `rental`, and `payments` tables again.

```{r ex27-s, code_folding='unhide', eval=FALSE, tidy=TRUE}
customer_rentals_hi_low_sql <- function(con,high_n,low_n) {
    customer_rental_high_low_sql <- dbGetQuery(con,
        "select *
           from (     select *
                            ,ROW_NUMBER() OVER(ORDER BY rentals desc) rent_hi_low
                            ,ROW_NUMBER() OVER(ORDER BY rentals ) rent_low_hi
                       FROM (    
                                 select c.customer_id id
                                       ,concat(c.first_name,' ',c.last_name) customer
                                       ,count(*) cust_cnt
                                       ,count(r.rental_id) rentals
                                       ,count(p.payment_id) payments
                                       ,sum(coalesce(p.amount,0)) paid_amt
                                  from customer c 
                                       left outer join rental r on c.customer_id = r.customer_id
                                       left outer join payment p on r.rental_id = p.rental_id
                                 group by c.customer_id
                                        ,c.first_name
                                        ,c.last_name
                            ) as summary
                ) row_nums
           where rent_hi_low <= $1 or rent_low_hi <= $2
          order by rent_hi_low
        "
        ,c(high_n,low_n)
        )
    return (customer_rental_high_low_sql)
}
```

The next code block executes a sql version of such a function.  With top_n = 5 and bot_n = 5, it replicates the hard coded version of the previous exercise. With top_n = 5 and bot_n = 0, it gives a top 5 report.  With top_n = 0 and bot_n = 5, the report returns the bottom 5.  Change the two parameters to see the output from the different combinations.

```{r customer_rentals_hi_low_sql, eval=FALSE}
top_n = 5
bot_n = 5
sp_print_df(customer_rentals_hi_low_sql(con,top_n,bot_n))
```

#### Replicate the function above use dplyr syntax.

Column          | Mapping             |Definition
----------------|---------------------|-----------------------------------
id              |customer.customer_id |
cust_cnt        |                     |customer count
rentals         |                     |customer rentals
payments        |                     |customer payments
paid_amt        |payment.amount       |aggregated payment amount 
rent_hi_low     |                     |sequence with 1 = customer with highest rentals
rent_low_hi     |                     |sequence with 1 = customer with the lowest rentals


```{r ex27-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

# Parameters
#   con: database connection
#   high_n: top n customers
#   low_n: bottom n customers

customer_rentals_hi_low_dplr <- function(con,high_n,low_n) {
  customer_table <- tbl(con, "customer")
  rental_table   <- tbl(con, "rental")
  payment_table   <- tbl(con, "payment")

customer_rental_loj_hi_low_d <- customer_table %>%
  left_join(rental_table, by = c("customer_id" = "customer_id"), suffix(c(".c", ".r"))) %>%
  left_join(payment_table, by = c("rental_id" = "rental_id"), suffix(c('r','p'))) %>%
  mutate(customer=paste(first_name,last_name,sep=' ')
        ,rented = if_else(is.na(rental_id),0, 1)
        ,paid = if_else(is.na(payment_id),0,1)
        ) %>%
  group_by(customer_id.x,customer,rented) %>%
  summarize(cust_cnt = n()
           ,rentals=sum(rented,na.rm = TRUE)
           ,payments = sum(paid,na.rm = TRUE)
           ,paid_amt = sum(ifelse(is.na(amount),0,amount),na.rm = TRUE)
          ) %>%
  rename(id=customer_id.x) %>%
  select(id,customer,cust_cnt,rentals,payments,paid_amt) %>%
  arrange(desc(rentals)) %>%
  collect()
#
#   Add the rankings 
#    
  customer_rental_loj_hi_low_d <- cbind(customer_rental_loj_hi_low_d
                                       ,rent_hi_low = 1:nrow(customer_rental_loj_hi_low_d)
                                       ,rent_low_hi = nrow(customer_rental_loj_hi_low_d):1
                                       )
  customer_rental_loj_hi_low_d %>% 
    filter(rent_hi_low <= high_n | rent_low_hi <= low_n) %>%
    arrange(rent_hi_low) 
}
```

The next code block executes your dplyr version of such a function.  With top_n = 5 and bot_n = 5, it replicates the hard coded version of the previous exercise. With top_n = 5 and bot_n = 0, it gives a top 5 report.  With top_n = 0 and bot_n = 5, the report returns the bottom 5.  Change the two parameters to see the output from the different combinations.

```{r customer_rentals_hi_low_dplr}
# con is the connection string opened at the top of the file.
top_n = 5
bot_n = 5
sp_print_df(customer_rentals_hi_low_dplr(con,top_n,bot_n))
```

### 28.  How much has each store collected?

How are the stores performing?  The SQL code shows the payments made to each store in the business.

```{r ex28-s, code_folding='unhide', tidy=TRUE}
store_payments_sql <- dbGetQuery(
  con,
  "select s.store_id,sum(p.amount) amount,count(*) cnt 
                   from payment p 
                        join staff s 
                          on p.staff_id = s.staff_id  
                 group by store_id order by 2 desc
                 ;
                "
)
sp_print_df(store_payments_sql)
```

`r sp_color_markdown_text('Each store collected just over 30,000 in revenue and each store had about 7300 rentals.','blue')`

#### Replicate the output above using dplyr syntax.


```{r ex28-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

store_payments_dplyr <- payment_table %>% 
  inner_join(staff_table,by=c('staff_id','staff_id')) %>%
  group_by(staff_id) %>% 
  summarize(amount=sum(amount,na.rm=TRUE),cnt=n()) %>%  
  arrange(desc(amount)) %>% 
  collect()

sp_print_df(store_payments_dplyr)
```


### 29.  What is the business' distribution of payments?

To answer this question we look at the `rental`, `payment`, `inventory`, and `film` tables to answer this question.

As a sanity check, we first check the number of rentals and amount payments.


```{r ex29-s1, code_folding='unhide', tidy=TRUE}
rentals_payments_sql <- dbGetQuery(con,
"select 'rentals' rec_type, count(*) cnt_amt from rental
 union
 select 'payments' rec_type, sum(amount) from payment ")
sp_print_df(rentals_payments_sql)
```


```{r ex29-s2, code_folding='unhide',collapse=TRUE}
business_payment_dist_sql <- dbGetQuery(
  con,
 "select no_pay_rec_due
      ,no_pay_rec_cnt
      ,round(100.0 * no_pay_rec_cnt/rentals,2) no_pay_rec_pct
      ,rate_eq_paid
      ,rate_eq_paid_cnt
      ,round(100.0 * rate_eq_paid_cnt/rentals,2) rate_eq_paid_pct
      ,rate_lt_paid
      ,rate_lt_over_paid
      ,rate_lt_paid_cnt
      ,round(100.0 * rate_lt_paid_cnt/rentals,2) rate_lt_paid_pct
      ,rate_gt_paid_due
      ,rate_gt_paid_cnt
      ,round(100.0 * rate_gt_paid_cnt/rentals,2) rate_gt_paid_pct
      ,rentals
      ,rate_eq_paid_cnt + rate_lt_paid_cnt + rate_gt_paid_cnt payments
      ,round(100.0 * 
            (no_pay_rec_cnt + rate_eq_paid_cnt + rate_lt_paid_cnt + rate_gt_paid_cnt)/rentals
            ,2) pct
      ,rate_eq_paid + rate_lt_paid + rate_lt_over_paid amt_paid
      ,no_pay_rec_due + rate_gt_paid_due amt_due
  from (
        select sum(case when p.rental_id is null then rental_rate else 0 end ) no_pay_rec_due
              ,sum(case when p.rental_id is null then 1 else 0 end) no_pay_rec_cnt
              ,sum(case when f.rental_rate = p.amount 
                        then p.amount else 0 end) rate_eq_paid
              ,sum(case when f.rental_rate = p.amount 
                        then 1 else 0 end ) rate_eq_paid_cnt
              ,sum(case when f.rental_rate < p.amount 
                        then f.rental_rate else 0 end) rate_lt_paid
              ,sum(case when f.rental_rate < p.amount 
                        then p.amount-f.rental_rate else 0 end) rate_lt_over_paid
              ,sum(case when f.rental_rate < p.amount 
                        then 1 else 0 end) rate_lt_paid_cnt
              ,sum(case when f.rental_rate > p.amount 
                        then f.rental_rate - p.amount else 0 end ) rate_gt_paid_due
              ,sum(case when f.rental_rate > p.amount 
                        then 1 else 0 end ) rate_gt_paid_cnt
              ,count(*) rentals
            FROM rental r
                 LEFT JOIN payment p ON r.rental_id = p.rental_id and r.customer_id = p.customer_id
                 INNER JOIN inventory i ON r.inventory_id = i.inventory_id
                 INNER JOIN film f ON i.film_id = f.film_id
       ) as details
;"
)
# Rental counts
sp_print_df(business_payment_dist_sql %>% select(ends_with("cnt"),rentals))
# Payments
sp_print_df(business_payment_dist_sql %>% select(ends_with("paid")))
# Not paid amounts
sp_print_df(business_payment_dist_sql %>% select(ends_with("due")))
# Rental payments
sp_print_df(business_payment_dist_sql %>% select(ends_with("pct")))

```

`r sp_color_markdown_text('These are interesting results.  

*  09.06% of the total records have no associated payment record in the amount of 4302.47  
*  49.39% of the rentals have been fully paid in full, 23397.75.
*  41.40% of the rentals have collected more than the rental amount by 18456.75
*  00.15% of the rentals have collected less than the rental amount by 67.76.
*  The no_pay_rec_cnt + rate_gt_paid_cnt, $1453 + 24 = 1477$ is the number of rentals which have not been paid in full.
*  The total outstanding balance is $4302.47 + 67.76 = 4370.23$

With over 40 percent over collection, someone needs to find out what is wrong with the collection process.  Many customers are owed credits or free rentals.
','blue')`

#### Replicate the output above using dplyr syntax.

This table describes the columns in the code block answer that follows.  There are payment records where the charged amount, rental rate, is less than the amount paid.  These payments are split into two pieces, rate_lt_paid and rate_lt_over_paid.  The rate_lt_paid is rental rate amount.  The rate_lt_over_paid is the paid amount - rental rate, the over paid amount. 

Column          | Mapping             |Definition
----------------|---------------------|-------------
no_pay_rec_cnt  |                     |number of DVD rentals without an associated payment record.
rate_eq_paid_cnt|                     |number of DVD payments that match the film rental rate.
rate_lt_paid_cnt|                     |number of DVD rental with rental rate less than the amount paid.
rate_gt_paid_cnt|                     |number of DVD rentals with rental rate greater than the film rental rate.
rentals         |                     |number of rental records analyzed
rate_eq_paid    |                     |amount paid where the rate charged = amount paid
rate_lt_paid    |                     |amount paid where the rate charged <
rate_lt_over_paid|                    |rate charged < amount paid; This represents the amount over paid
amt_paid        |                     |Total amount paid
no_pay_rec_due  |                     |DVD rentals charges due without a payment record
rate_gt_paid_due|                     |DVD rentals charged due with a payment record 
amt_due         |                     |Total amount due and not collected.
no_pay_rec_pct  |                     |Percent of rentals without a payment record.
rate_lt_paid_pct|                     |Percent of rentals where the rental charge is less than the paid amount
rate_gt_paid_pct|                     |Percent of rentals where the rental charge is greater than the paid amount
pct             |                     |Sum of percentages
  
```{r ex29-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

business_payment_dist_dplyr <- rental_table %>%
  left_join(payment_table, by = c("rental_id", "rental_id", "customer_id","customer_id")
            , suffix = c(".r", ".p")) %>%
  inner_join(inventory_table, by = c("inventory_id", "inventory_id"), suffix = c(".r", ".i")) %>%
  inner_join(film_table, by = c("film_id", "film_id"), suffix = c(".i", ".f")) %>%
  summarize(rentals = n()
           ,no_pay_rec_due = sum(ifelse(is.na(payment_id),rental_rate,0),na.rm = TRUE)
           ,no_pay_rec_cnt = sum(ifelse(is.na(payment_id),1,0),na.rm = TRUE)
           ,rate_eq_paid   = sum(ifelse(rental_rate == amount,amount,0),na.rm = TRUE)
           ,rate_eq_paid_cnt   = sum(ifelse(rental_rate == amount,1,0),na.rm = TRUE)
           ,rental_amt     = sum(ifelse(rental_rate < amount,rental_rate,0),na.rm = TRUE)
           ,rate_lt_paid = sum(ifelse(rental_rate < amount, rental_rate,0),na.rm = TRUE)
           ,rate_lt_over_paid  = sum(ifelse(rental_rate < amount,amount-rental_rate,0),na.rm = TRUE)
           ,rate_lt_paid_cnt  = sum(ifelse(rental_rate < amount,1,0),na.rm = TRUE)
           ,rate_gt_paid_due = sum(ifelse(amount < rental_rate,rental_rate-amount,0),na.rm = TRUE)
           ,rate_gt_paid_cnt = sum(ifelse(amount < rental_rate,1,0),na.rm = TRUE)
           ) %>%
  mutate(no_pay_rec_pct = round(100 * no_pay_rec_cnt/rentals,2)
        ,rate_eq_paid_pct   = round(100 * rate_eq_paid_cnt/rentals,2)
        ,rate_lt_paid_pct  = round(100 * rate_lt_paid_cnt/rentals,2)
        ,rate_gt_paid_pct = round(100 * rate_gt_paid_cnt/rentals,2)
        ,payments = rate_eq_paid_cnt + rate_lt_paid_cnt + rate_gt_paid_cnt
        ,amt_paid = rate_eq_paid + rate_lt_over_paid +  rental_amt 
        ,pct = no_pay_rec_pct + rate_eq_paid_pct + rate_lt_paid_pct + rate_gt_paid_pct
        ,amt_due = no_pay_rec_due + rate_gt_paid_due
        ) %>% 
  select (no_pay_rec_due,no_pay_rec_cnt,no_pay_rec_pct
         ,rate_eq_paid,rate_eq_paid_cnt,rate_eq_paid_pct
         ,rate_lt_paid,rate_lt_over_paid,rate_lt_paid_cnt,rate_lt_paid_pct
         ,rate_gt_paid_due,rate_gt_paid_cnt,rate_gt_paid_pct
         ,rentals
         ,payments
         ,pct
         ,amt_paid
         ,amt_due) %>%
  collect()
    

# Rental counts
sp_print_df(business_payment_dist_dplyr %>% select(ends_with("cnt"),rentals))
# Payments
sp_print_df(business_payment_dist_dplyr %>% select(ends_with("paid")))
# Not paid amounts
sp_print_df(business_payment_dist_dplyr %>% select(ends_with("due")))
# Rental payments
sp_print_df(business_payment_dist_dplyr %>% select(ends_with("pct")))

```

#### Bad data analysis

Here are the sanity check numbers calculated at the beginning of this exercise.  

  rec_type |cnt_amt
-----------|-------
payments   |61312.04
 rentals   |16045.00
 
Note that the sanity check numbers above, do not match the numbers above.  If you query returned the numbers above, use the following result set ot see where the differences exist.

```{r bad data help}
rs <- dbGetQuery(
  con,
 "SELECT  'correct join' hint,r.rental_id,r.customer_id,p.customer_id payment_customer_id,p.rental_id payment_rental_id,p.amount
    FROM rental r
         LEFT JOIN payment p ON r.rental_id = p.rental_id and r.customer_id = p.customer_id
   where r.rental_id = 4591
  UNION 
 SELECT  'incorrect join' hint,r.rental_id,r.customer_id,p.customer_id payment_customer_id,p.rental_id payment_rental_id,p.amount
    FROM rental r
         LEFT JOIN payment p ON r.rental_id = p.rental_id
   where r.rental_id = 4591
     and p.customer_id != 182
;")
sp_print_df(head(rs))
```

### 30.  Which customers have the highest open amounts?

From the previous exercise, we know that there are 1477 missing payment records or not fully paid payment records.  List the top 5 customers from each category base on balance due amounts.

To answer this question we look at the `rental`, `payment`, `inventory`, `film` and `customer` tables to answer this question.

```{r ex30-s, code_folding='unhide', tidy=TRUE}

customer_open_amts_sql <- dbGetQuery(
  con,
"  select customer_id
         ,concat(first_name,' ',last_name) customer
         ,pay_record
         ,rental_amt
         ,paid_amt
         ,due_amt
         ,cnt
         ,rn
  from (select c.customer_id
              ,c.first_name
              ,c.last_name
              ,case when p.amount is null then 'No' else 'Yes' end Pay_record
              ,sum(f.rental_rate) rental_amt
              ,sum(coalesce(p.amount,0))  paid_amt
              ,sum(f.rental_rate - coalesce(p.amount,0)) due_amt
              ,count(*) cnt
              ,row_number() over (partition by case when p.amount is null then 'No' else 'Yes' end
                                  order by sum(f.rental_rate - coalesce(p.amount,0)) desc,c.customer_id) rn
          FROM rental r
               LEFT JOIN payment p
                 ON r.rental_id = p.rental_id and r.customer_id = p.customer_id
               INNER JOIN inventory i
                 ON r.inventory_id = i.inventory_id
               INNER JOIN film f
                 ON i.film_id = f.film_id
               INNER JOIN customer c
                 ON r.customer_id = c.customer_id
        WHERE f.rental_rate > coalesce(p.amount, 0)
       group by c.customer_id,c.first_name,c.last_name,case when p.amount is null then 'No' else 'Yes' end
       ) as src
  where rn <= 5 -- and Pay_record = 'No' or Pay_record = 'Yes'
  order by Pay_record,rn
")
sp_print_df(customer_open_amts_sql)
```

`r sp_color_markdown_text("From the previous exercise we see that the number of rentals that have not been paid in full is 1477.  There are 24 records that have a payment record, pay_record = 'Yes', all have a 0 paid amount.  There are 1453 DVD's rented out that have no payment record.   The top 3 customers have 10 DVD's each that have not been paid.
",'blue')`


#### Replicate the output above using dplyr syntax.

column      | definition            | mapping
------------|-----------------------|------------------------------------------------------------
customer    | first_name + last_name|
Pay_record  | Payment record exists Y/N| case when p.amount is null then 'No' else 'Yes' end
rental_amt  | aggrgated film.rental_rate|
paid_amt    | aggregated payment.amount |
due_amt     | aggregated film.rental_rate - payment.amount|
cnt         | number of rentals/customer|
rn          | row number


```{r ex30-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

customer_open_amts_dplyr <- rental_table %>%
  left_join(payment_table
           , by = c("rental_id", "rental_id", "customer_id","customer_id")
           , suffix = c(".r", ".p")) %>%
  inner_join(inventory_table, by = c("inventory_id", "inventory_id"), suffix = c(".r", ".i")) %>%
  inner_join(film_table, by = c("film_id", "film_id"), suffix = c(".i", ".f")) %>%
  inner_join(customer_table, by = c('customer_id' = 'customer_id')) %>%
  filter(rental_rate > ifelse(is.na(amount), 0,amount)) %>%
  mutate(customer=paste0(first_name,' ',last_name)
         ,pay_record = ifelse(is.na(amount),'No','Yes')
         ,paid = ifelse(is.na(amount),0,amount)
        ) %>%
  group_by(customer_id,customer,pay_record) %>%    
  summarize(rental_amt = sum(rental_rate, na.rm = TRUE)
           ,paid_amt = sum(paid, na.rm = TRUE)
           ,due_amt  = sum(rental_rate - paid, na.rm = TRUE)
           ,cnt = n()
           ) %>%
  arrange(pay_record,desc(due_amt)) %>% 
  group_by(pay_record) %>% 
  mutate(id = row_number()) %>%
  filter(id <= 5) %>%
  select(customer_id,customer,pay_record,rental_amt,paid_amt,due_amt,cnt,id) %>% 
  collect()

sp_print_df(customer_open_amts_dplyr)
```


### 31.  What is the business' cash flow?

In the previous exercise we saw that about 50% of the rentals collected the correct amount and 40% of the rentals over collected.  The last 10% were never collected.

Calculate the number of days it took before the payment was collected and the amount collected?

To answer this question we look at the `rental`, `customer`, `payment`, `inventory`, `payment` and `film` tables to answer this question.

```{r ex31-s, code_folding='unhide', tidy=TRUE}
cash_flow_sql <- dbGetQuery(con,
"SELECT payment_date - exp_rtn_dt payment_days
    ,sum(coalesce(amount, charges)) paid_or_due
    ,count(*) late_returns
FROM (
    SELECT payment_date::DATE 
        ,(r.rental_date + INTERVAL '1 day' * f.rental_duration)::DATE exp_rtn_dt
        ,p.amount 
        ,f.rental_rate charges
        ,r.rental_date
        ,r.return_date
    FROM rental r
         LEFT JOIN customer c ON c.customer_id = r.customer_id
         LEFT JOIN address a  ON c.address_id = a.address_id
         LEFT JOIN city       ON city.city_id = a.city_id
         LEFT JOIN country ctry ON ctry.country_id = city.country_id
         LEFT JOIN inventory i  ON r.inventory_id = i.inventory_id
         LEFT JOIN payment p    ON c.customer_id = p.customer_id 
                               AND p.rental_id = r.rental_id
         LEFT JOIN film f       ON i.film_id = f.film_id
    WHERE return_date > (r.rental_date + INTERVAL '1 day' * f.rental_duration)::DATE 
    ) AS src
GROUP BY payment_date - exp_rtn_dt
ORDER BY payment_date - exp_rtn_dt DESC")

sp_print_df(cash_flow_sql)
```

`r sp_color_markdown_text("Wow those are really generous terms.  Customers are paying 1.2 to 1.7 years after they returned the DVD.  This business is in serious financial trouble!
",'blue')`

#### Replicate the output above using dplyr syntax.

column      | definition            | mapping
------------|-----------------------|------------------------------------------------------------
paid_or_due |paid amt associated with rental or the rental_rate |ifelse(is.na(amount),rental_rate,amount)
payment_days|days til payment       | payment_date - rental_date

```{r ex31-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

cash_flow_dplyr <- rental_table %>%
  left_join(payment_table, by=c('rental_id'='rental_id','customer_id'='customer_id')) %>%
  left_join(inventory_table, by=('inventory_id'='inventory_id')) %>%
  left_join(film_table, by=('film_id'='film_id')) %>%
  mutate(pay_dt = date(payment_date)
        ,exp_rtn_dt = date(rental_date) + rental_duration
        ,rdate=date(rental_date)
        ,payment_days = date(payment_date) - (date(rental_date) + rental_duration)
        ) %>%
  filter(return_date > exp_rtn_dt) %>%
  group_by(payment_days) %>%
  summarize(paid_or_due=sum(ifelse(is.na(amount),rental_rate,amount), na.rm = TRUE)
           ,late_returns=n()
           ) %>%
  arrange(desc(payment_days)) %>%
  select(payment_days,paid_or_due,late_returns) %>%
  collect()

sp_print_df(cash_flow_dplyr)
```

### 32.  Customer information

Create a function that takes a customer id and returns

*  customer address information
*  films rented and returned information
*  customer payment information

The hidden code block implements such a function in SQL.

To answer this question we look at the `rental`, `customer`, `address`, `city, `country`, `inventory`, `payment` and `film` tables to answer this question.  

```{r ex32-s, code_folding='unhide', tidy=TRUE}
customer_details_fn_sql <- function(cust_id) {
    customer_details_sql <- dbGetQuery(con,
    "select c.customer_id id,concat(first_name,' ',c.last_name) customer
          ,c.email,a.phone,a.address,address2,city.city,a.postal_code,ctry.country
          ,c.store_id cust_store_id
          ,i.store_id inv_store_id
          ,f.film_id
          ,f.title
          ,r.rental_date::date rented
          ,r.return_date::date returned
          ,(r.rental_date + INTERVAL '1 day'  * f.rental_duration)::date exp_rtn_dt
          ,case when r.return_date is null
                then null
                else r.return_date::date  - (r.rental_date + INTERVAL '1 day'  * f.rental_duration)::date
           end rtn_stat
          ,case when r.rental_id is null
                then null
                      -- dvd returned             
                when r.return_date is null
                then 1
                else 0
           end not_rtn
          ,payment_date::date pay_dt
          ,f.rental_rate charges
          ,p.amount paid
          ,p.amount-f.rental_rate delta
          ,p.staff_id pay_staff_id
          ,payment_date::date - rental_date::date pay_days
          ,r.rental_id,i.inventory_id,payment_id
      from customer c left join rental r on c.customer_id = r.customer_id
                      left join address a on c.address_id = a.address_id
                      left join city on city.city_id = a.city_id
                      left join country ctry on ctry.country_id = city.country_id
                      left join inventory i on r.inventory_id = i.inventory_id
                      left join payment p on c.customer_id = p.customer_id and p.rental_id = r.rental_id
                      left join film f on i.film_id = f.film_id
     where c.customer_id = $1
    order by id,rented desc
    "
    ,cust_id
    )
    return(customer_details_sql)
}
```

The following code block executes the customer function. Change the `cust_id` value to see differnt customers.

```{r customer_details_fn_sql}
cust_id <- 600
sp_print_df( customer_details_fn_sql(cust_id))

```

#### Replicate the output above using dplyr syntax.

column      | definition            | mapping
------------|-----------------------|------------------------------------------------------------
id          |customer_id            | 
customer    |first_name + last_name |
exp_rtn_dt  |expected return date   | rental.rental_date + film.rental_duration
rtn_stat    |return status          | rental.return_date - (rental.rental_date + film duration)
not_rtn     |dvd not returned       | null if rental_id is null;not rented; 1 return_date null else 0
pay_dt      |payment_date           | 
delta       |                       | payment.amount-film.rental_rate
pay_staff_id|payment.staff_id       | payment.staff_id
pay_days    |days til payment       | payment_date - rental_date

```{r ex32-d, include=INCLUDE_OUTPUT, tidy=TRUE}
# sp_tbl_descr('table_name')
# sp_tbl_pk_fk('table_name')
# sp_print_df(table_rows_sql)

customer_details_fn_dplyr <- function(cust_id) {

customer_details_dplyr <- customer_table %>%
    left_join(rental_table, by=c('customer_id'='customer_id')) %>%    
    left_join(address_table, by=c('address_id'='address_id')) %>%
    left_join(city_table,by=c('city_id'='city_id'))  %>%
    left_join(country_table,by=c('country_id'='country_id'))  %>%
    left_join(inventory_table,by=c('inventory_id'='inventory_id')) %>% 
    mutate(inv_store_id = store_id.y) %>%    
    left_join(payment_table,by=c('customer_id'='customer_id','rental_id'='rental_id'))  %>%
    left_join(film_table,by=c('film_id'='film_id')) %>%
    filter(customer_id == cust_id ) %>% 
    mutate(customer=paste0(first_name,' ',last_name)
          ,exp_rtn_dt = date(rental_date) + rental_duration
          ,rtn_days= date(return_date) - (date(rental_date) + rental_duration)
          ,rented = Date(rental_date)
          ,returned = Date(return_date)
          ,not_rtn=ifelse(is.na(rental_id),rental_id,ifelse(is.na(return_date),1,0))
          ,delta = amount-rental_rate
          ,pay_days = date(payment_date) - (date(rental_date) + rental_duration)
          ) %>%
    rename(id = customer_id
          ,cust_store_id = store_id.x
          ,charges = rental_rate
          ,paid = amount
          ,pay_dt = payment_date
          ,pay_staff_id = staff_id.y
          ) %>%
    select(id,customer,email,phone,address,address2,city,postal_code,country
          ,cust_store_id
          ,inv_store_id
          ,film_id,title,rented,returned
          ,exp_rtn_dt,rtn_days,not_rtn
          ,pay_dt
          ,charges,paid,delta,pay_staff_id
          ,pay_days,film_id,rental_id,inventory_id,payment_id
          ) %>% 
  collect()

return(customer_details_dplyr)
}
```

Use the following code block to test the dplyr function.

```{r customer_details_fn_dplyr}
cust_id <- 601
sp_print_df(customer_details_fn_dplyr(cust_id))
```

## Different strategies for interacting with the database

select examples

    dbGetQuery returns the entire result set as a data frame.  For large returned datasets, complex or inefficient SQL statements, this may take a long time.

      dbSendQuery: parses, compiles, creates the optimized execution plan.  
          dbFetch: Execute optimzed execution plan and return the dataset.
    dbClearResult: remove pending query results from the database to your R environment

### 1.  dbGetQuery Versus dbSendQuery+dbFetch+dbClearResult

How many customers are there in the DVD Rental System?

```{r dbGetQuery, code_folding='unhide', tidy=TRUE}
rs1 <- dbGetQuery(con, "select * from customer;")
sp_print_df(head(rs1))

fetch <- 0
rows <- 0
pco <- dbSendQuery(con, "select * from customer;")
while(!dbHasCompleted(pco)) {
    rs2 <- dbFetch(pco,n=100)
    fetch <- fetch + 1
    rows <- rows + nrow(rs2)
    print(paste0("fetch=",fetch," fetched rows=",nrow(rs2)," running rows fetched=",rows))
    # add additional code to process fetched records
}    
dbClearResult(pco)
sp_print_df(head(rs2))
```

### 2.  Dplyr write results to the database example

```{r compute, tidy=TRUE}
smy_customer_details_dplyr <- customer_table %>%
    left_join(rental_table, by=c('customer_id'='customer_id')) %>%    
    left_join(address_table, by=c('address_id'='address_id')) %>%
    left_join(city_table,by=c('city_id'='city_id'))  %>%
    left_join(country_table,by=c('country_id'='country_id'))  %>%
    left_join(inventory_table,by=c('inventory_id'='inventory_id')) %>% 
    mutate(inv_store_id = store_id.y) %>%    
    left_join(payment_table,by=c('customer_id'='customer_id','rental_id'='rental_id'))  %>%
    left_join(film_table,by=c('film_id'='film_id')) %>%

    mutate(customer=paste0(first_name,' ',last_name)
          ,exp_rtn_dt = date(rental_date) + rental_duration
          ,rtn_days= date(return_date) - (date(rental_date) + rental_duration)
          ,rented = Date(rental_date)
          ,returned = Date(return_date)
          ,not_rtn=ifelse(is.na(rental_id),rental_id,ifelse(is.na(return_date),1,0))
          ,delta = amount-rental_rate
          ,pay_days = date(payment_date) - (date(rental_date) + rental_duration)
          ) %>%
    rename(id = customer_id
          ,cust_store_id = store_id.x
          ,charges = rental_rate
          ,paid = amount
          ,pay_dt = payment_date
          ,pay_staff_id = staff_id.y
          ) %>%
    select(id,customer,email,phone,address,address2,city,postal_code,country
          ,cust_store_id
          ,inv_store_id
          ,film_id,title,rented,returned
          ,exp_rtn_dt,rtn_days,not_rtn
          ,pay_dt
          ,charges,paid,delta,pay_staff_id
          ,pay_days,film_id,rental_id,inventory_id,payment_id
          ) 

# drop table
if(db_has_table(con,'smy_compute_exercise')) {
  db_drop_table(con,'smy_compute_exercise')
}
# create database tabe
compute(smy_customer_details_dplyr,name='smy_compute_exercise',temporary = FALSE)

  
```

## Disconnect from the database and stop Docker

```{r}
dbDisconnect(con)
sp_docker_stop("adventureworks")
```

```{r}
knitr::knit_exit()
```