Skip to content

Commit

Permalink
versioning permission note
Browse files Browse the repository at this point in the history
  • Loading branch information
tomwilsonsco committed Aug 12, 2024
1 parent 9930cb0 commit f005357
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 37 deletions.
93 changes: 56 additions & 37 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,63 +16,74 @@ knitr::opts_chunk$set(
# RtoSQLServer

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![R-CMD-check](https://github.com/DataScienceScotland/RtoSQLServer/workflows/R-CMD-check/badge.svg)](https://github.com/DataScienceScotland/RtoSQLServer/actions)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental) [![R-CMD-check](https://github.com/DataScienceScotland/RtoSQLServer/workflows/R-CMD-check/badge.svg)](https://github.com/DataScienceScotland/RtoSQLServer/actions)

<!-- badges: end -->

R package aiming to make importing of R dataframes into a MS SQL Server database more robust.

## Installation

R package can be installed directly from Github or locally from zip.
R package can be installed directly from Github or locally from zip.\
To install directly from GitHub:

```r
``` r
remotes::install_github("DataScienceScotland/rtosqlserver", upgrade = "never")
```

If the above does not work, install by downloading:

1. Download the [zip of the repository](https://github.com/DataScienceScotland/RtoSQLServer/archive/refs/heads/main.zip) from GitHub.
2. Save the downloaded zip to a specific directory (e.g. C:/temp).
3. Install with this command specifying the path to the downloaded zip:
```r
1. Download the [zip of the repository](https://github.com/DataScienceScotland/RtoSQLServer/archive/refs/heads/main.zip) from GitHub.
2. Save the downloaded zip to a specific directory (e.g. C:/temp).
3. Install with this command specifying the path to the downloaded zip:

``` r
remotes::install_local("C:/temp/RtoSQLServer-main.zip", upgrade="never")
```

## Functionality

As well as loading R dataframes into SQL Server databases, functions are also currently available to:
As well as loading R dataframes into SQL Server databases, functions are also currently available to:

- Read a database table into an R dataframe, optionally specifying a subset of table columns or row filter.
- Drop a table from the database.
- Run any other input sql in the database and return a dataframe if a select statement.
- Read a database table into an R dataframe, optionally specifying a subset of table columns or row filter.
- Drop a table from the database.
- Run any other input sql in the database and return a dataframe if a select statement.

It is recommend to ensure using the latest versions of [Rcpp](https://cran.r-project.org/web/packages/Rcpp/index.html), [odbc](https://cran.r-project.org/web/packages/odbc/index.html) and [DBI](https://cran.r-project.org/web/packages/DBI/index.html). If using Windows in a secure environment install these from source or a Windows binary compiled at the version of R you are using.

## Loading method used

When loading an R dataframe into SQL Server using `write_dataframe_to_db`, following steps are followed:

1. The R dataframe is loaded into a staging table in the database in batches of n rows at a time.
1. The R dataframe is loaded into a staging table in the database in batches of n rows at a time.

2.

2. a) If table of the specified name does NOT already exist in the database schema:
i) Create target table in the database.
ii) Insert all rows from staging table to target table.
a) If table of the specified name does NOT already exist in the database schema:
i) Create target table in the database.\
ii) Insert all rows from staging table to target table.

3. b) If table of same name does already exist in the database schema:
3.

If 'append_to_existing'=FALSE (this will result in an overwrite):
i) Drop the existing copy of the target table and create a new one from staging table definition.
ii) Insert all rows from staging table into target table.
b) If table of same name does already exist in the database schema:

If 'append_to_existing'=TRUE:
i) Check that staging table columns and existing target table columns are the same. If not, cancel loading and give a warning.
ii) If check passes, insert all rows from staging table into target table.
If 'append_to_existing'=FALSE (this will result in an overwrite):

4. Delete the staging table.
i) Drop the existing copy of the target table and create a new one from staging table definition.\
ii) Insert all rows from staging table into target table.

If 'append_to_existing'=TRUE:

i) Check that staging table columns and existing target table columns are the same. If not, cancel loading and give a warning.\
ii) If check passes, insert all rows from staging table into target table.

4. Delete the staging table.

## Example Usage

Here is an example using the main functions:

```{r eval=FALSE}
# Make a test dataframe with n rows
test_n_rows <- 1234567
Expand Down Expand Up @@ -137,10 +148,12 @@ drop_table_from_db(
table_name = "test_r_tbl"
)
```

## Read database table with column or row filtering
The `read_table_from_db` function includes optional arguments so you can read only specific columns into an R dataframe or so you can filter which rows are read.

To select only specific columns, specify a vector of column_names for example `c("Sepal_Length", "Species")` for the `columns` argument of `read_table_from_db`.
The `read_table_from_db` function includes optional arguments so you can read only specific columns into an R dataframe or so you can filter which rows are read.

To select only specific columns, specify a vector of column_names for example `c("Sepal_Length", "Species")` for the `columns` argument of `read_table_from_db`.

To select only specific rows, specify a filter string using R syntax (not SQL syntax) for the `filter_stmt` argument. For example `"(Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0"`. The function will convert this to SQL.

Expand All @@ -159,41 +172,47 @@ db_test_iris <- read_table_from_db(
include_pk = FALSE
)
```
For the `filter_stmt` argument, this should be passed as a string. If you find the syntax confusing when wrapping your filter within quotation marks, then instead try using `deparse1(substitute())`. For example, the above filter example could be passed as `deparse1(substitute((Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0))`.

For the `filter_stmt` argument, this should be passed as a string. If you find the syntax confusing when wrapping your filter within quotation marks, then instead try using `deparse1(substitute())`. For example, the above filter example could be passed as `deparse1(substitute((Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0))`.

If you are comfortable with SQL, you may prefer to use the `execute_sql` function where a SQL select statement can be input to return an R dataframe from a database.

## Background Information - Why use this package

The `RtoSQLServer` package relies on `DBI` and `odbc` packages for its database functionality, however you should consider using it instead of `DBI` or `odbc` for the following reasons:

#### Importing large dataframes
The main reason to use the package is to ensure the data loading process is more robust than using `DBI` / `odbc` packages alone.

When loading large dataframes from R into MS SQL Server using `DBI` it was found that the loading process will often hang and never complete.
The main reason to use the package is to ensure the data loading process is more robust than using `DBI` / `odbc` packages alone.

When loading large dataframes from R into MS SQL Server using `DBI` it was found that the loading process will often hang and never complete.

People using R to load data into SQL Server may be working at home and have varying quality of connection to the database server.

`RtoSQLServer` does two things to help with data loading:

1. Loading the dataframe into a staging table in the database first of all. When the load into the staging table is complete, the target table is created as a copy of the staging table. This helps to prevent misleading problems of incomplete loads. The staging table is dropped once the target table has been created successfully.
1. Loading the dataframe into a staging table in the database first of all. When the load into the staging table is complete, the target table is created as a copy of the staging table. This helps to prevent misleading problems of incomplete loads. The staging table is dropped once the target table has been created successfully.

2. Loading in batches. The `RtoSQLServer` package's `write_dataframe_to_db` function includes a `batch_size` argument. Batch size is the number of rows of the R dataframe that are loaded into the database staging table at once. Using the batch loading process in `RtoSQLServer` it has been possible to load large tables it was not possible to load in one using `DBI`. The batch size is set for the user at a default of 100K so the user does not need to worry about a suitable batch size or splitting up the source dataframe in R.
2. Loading in batches. The `RtoSQLServer` package's `write_dataframe_to_db` function includes a `batch_size` argument. Batch size is the number of rows of the R dataframe that are loaded into the database staging table at once. Using the batch loading process in `RtoSQLServer` it has been possible to load large tables it was not possible to load in one using `DBI`. The batch size is set for the user at a default of 100K so the user does not need to worry about a suitable batch size or splitting up the source dataframe in R.

#### Self-contained database transactions
#### Self-contained database transactions

Unlike `DBI` and `odbc` where user must open a database connection object in R and then pass this connection to functions (before hopefully remembering to close it at the end of the R session), in `RtoSQLServer` a database connection is established when calling each function and closed at the end of that function. This means there are not hung connections to the database, waiting until the user closes R.
Unlike `DBI` and `odbc` where user must open a database connection object in R and then pass this connection to functions (before hopefully remembering to close it at the end of the R session), in `RtoSQLServer` a database connection is established when calling each function and closed at the end of that function. This means there are not hung connections to the database, waiting until the user closes R.

The `RtoSQLServer` functions can be used within user custom functions as self-contained tasks and do not rely on an open connection input argument. This also means a user does not need to know how to setup a connection, they simply pass the server and database name to the function. This can make it easier to run code others have written.
The `RtoSQLServer` functions can be used within user custom functions as self-contained tasks and do not rely on an open connection input argument. This also means a user does not need to know how to setup a connection, they simply pass the server and database name to the function. This can make it easier to run code others have written.

Repeatedly opening and closing connections might make things slightly slower, however it is thought this will not have much impact on the workflow of the average R user where there will be only a few database transactions in an R session.
Repeatedly opening and closing connections might make things slightly slower, however it is thought this will not have much impact on the workflow of the average R user where there will be only a few database transactions in an R session.

#### Checking of table, column names and datatypes from R to SQL Server

The `DBI` and `odbc` packages are designed to work with many different database engines. The `RtoSQLServer` package has focused on Microsoft SQL Server. This allows checks to be written to ensure the table name is compatible with SQL Server when being read by an ODBC driver.
The `DBI` and `odbc` packages are designed to work with many different database engines. The `RtoSQLServer` package has focused on Microsoft SQL Server. This allows checks to be written to ensure the table name is compatible with SQL Server when being read by an ODBC driver.

The package maps R dataframe datatypes to specific SQL Server datatypes ensuring consistent, predictable data storage. The current max length of string in an R character column vector is used to select the size of the SQL Server `nvarchar()` database column.

#### System Versioning
`RtoSQLServer` allows a user to create a table with MS SQL Server [System Versioning](https://docs.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table?view=sql-server-ver15) enabled. To do this using `RtoSQLServer` the `write_dataframe_to_db` function is used with the `versioned_table` argument set to TRUE.

`RtoSQLServer` allows a user to create a table with MS SQL Server [System Versioning](https://docs.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table?view=sql-server-ver15) enabled. To do this using `RtoSQLServer` the `write_dataframe_to_db` function is used with the `versioned_table` argument set to TRUE.

System Versioning adds date / timestamp (SQL Server datatype `DATETIME2`) start and end columns to the table. A `<table name>History` table is created in the database schema, storing the history of previous records following updates and deletes. This allows users to maintain an archive of their table setup in R, without needing to know the SQL used to establish System Versioning as part of a `create table` SQL expression.

Note dropping a system versioned table requires extra privileges and you may need to contact a system administrator to do this if you get a warning the table is versioned followed by an error when using the package's drop_table_from_db function.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![R-CMD-check](https://github.com/DataScienceScotland/RtoSQLServer/workflows/R-CMD-check/badge.svg)](https://github.com/DataScienceScotland/RtoSQLServer/actions)

<!-- badges: end -->

R package aiming to make importing of R dataframes into a MS SQL Server
Expand Down Expand Up @@ -278,3 +279,8 @@ records following updates and deletes. This allows users to maintain an
archive of their table setup in R, without needing to know the SQL used
to establish System Versioning as part of a `create table` SQL
expression.

Note dropping a system versioned table requires extra privileges and you
may need to contact a system administrator to do this if you get a
warning the table is versioned followed by an error when using the
package’s drop_table_from_db function.

0 comments on commit f005357

Please sign in to comment.