Skip to content

Commit

Permalink
Add initial files
Browse files Browse the repository at this point in the history
  • Loading branch information
jamesmbaazam committed Mar 23, 2024
1 parent eb07d56 commit 4c35148
Show file tree
Hide file tree
Showing 138 changed files with 9,454 additions and 28 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,3 @@
.Ruserdata

/.quarto/
/_site/
62 changes: 61 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,61 @@
# My Quarto Presentation Template
# Intro to Arrow Workshop

by Steph Hazlitt & Nic Crane


### Workshop Website

This repository contains materials for the **Intro to Arrow** workshop.

### Workshop Overview

This workshop will focus on using the arrow R package---a mature R interface to Apache Arrow---to process larger-than-memory files and multi-file datasets with arrow using familiar dplyr syntax. You'll learn to create and use interoperable data file formats like Parquet for efficient data storage and access, and also how to exercise fine control over data types to avoid common large data pipeline problems. This workshop will provide a foundation for using Arrow, giving you access to a powerful suite of tools for performant analysis of larger-than-memory data in R.

*This course is for you if you:*

- want to learn how to work with tabular data that is too large to fit in memory using existing R and tidyverse syntax implemented in Arrow
- want to learn about Parquet and other file formats that are powerful alternatives to CSV files
- want to learn how to engineer your tabular data storage for more performant access and analysis with Apache Arrow

### Workshop Prework

Detailed instructions for software requirements and data sources are shown below.

#### Packages

To install the required core packages for the workshop, run the following:

```{r}
install.packages(c(
"arrow", "dplyr", "stringr", "lubridate", "tictoc"
))
```
#### Seattle Checkouts by Title Data

This is the data we will use in the workshop. It's a good-sized, single CSV file---*9GB* on-disk in total, which can be downloaded from an AWS S3 bucket via https:

```{r}
options(timeout = 1800)
download.file(
url = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
destfile = "./data/seattle-library-checkouts.csv"
)
```

#### Tiny Data Option

If you don't have time or disk space to download the 9Gb dataset (and still have disk space to do the exercises), you can run the code in the workshop with "tiny" version of this data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data---although note you may not see the same performance improvements that you would get when working with larger data.

```{r}
options(timeout = 1800)
download.file(
url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv",
destfile = "./data/seattle-library-checkouts-tiny.csv"
)
```

If you want to participate in the coding exercise or follow along, please try your very best to begin the workshop ready with the required software & packages installed and the data downloaded on to your laptop.

------------------------------------------------------------------------

![](https://i.creativecommons.org/l/by/4.0/88x31.png) This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
14 changes: 14 additions & 0 deletions _freeze/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"hash": "21ece2c7abde2d9f3883038a17ddad68",
"result": {
"markdown": "---\ntitle: Intro to Arrow in R\nsubtitle: A short workshop\neditor: source\n---\n\n\n![](images/logo.png){width=\"30%\" fig-align=\"center\"}\n\n### Workshop Overview\n\nThis workshop will focus on using the arrow R package to process larger-than-memory files and multi-file datasets with arrow using familiar dplyr syntax. You'll learn to create and use interoperable data file formats like Parquet for efficient data storage and access, and also how to exercise fine control over data types to avoid common large data pipeline problems. This workshop will provide a foundation for using Arrow, giving you access to a powerful suite of tools for performant analysis of larger-than-memory data in R.\n\n*This course is for you if you:*\n\n- want to learn how to work with tabular data that is too large to fit in memory using existing R and tidyverse syntax implemented in Arrow\n- want to learn about Parquet and other file formats that are powerful alternatives to CSV files\n- want to learn how to engineer your tabular data storage for more performant access and analysis with Apache Arrow\n\n### Workshop Prework\n\nDetailed instructions for software requirements and data sources are show below.\n\n#### Packages\n\nTo install the required core packages for the workshop, run the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(c(\n \"arrow\", \"dplyr\", \"stringr\", \"lubridate\", \"tictoc\"\n))\n```\n:::\n\n\n\n#### Seattle Checkouts by Title Data\n\nThis is the data we will use in the workshop. It's a good-sized, single CSV file---*9GB* on-disk in total, which can be downloaded from an AWS S3 bucket via https:\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(timeout = 1800)\ndownload.file(\n url = \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\",\n destfile = \"./data/seattle-library-checkouts.csv\"\n)\n```\n:::\n\n\n#### Tiny Data Option\n\nIf you don't have time or disk space to download the 9Gb dataset (and still have disk space to do the exercises), you can run the code in the workshop with the \"tiny\" version of this data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data---although note you may not see the same performance improvements that you would get when working with larger data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(timeout = 1800)\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv\",\n destfile = \"./data/seattle-library-checkouts-tiny.csv\"\n)\n```\n:::\n\n\nIf you want to participate in the coding exercise or follow along, please try your very best to begin the workshop ready with the required software & packages installed and the data downloaded on to your laptop.\n\n------------------------------------------------------------------------\n\n![](https://i.creativecommons.org/l/by/4.0/88x31.png) This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
7 changes: 7 additions & 0 deletions _freeze/site_libs/clipboard/clipboard.min.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

30 changes: 30 additions & 0 deletions _freeze/site_libs/revealjs/dist/reset.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
/* http://meyerweb.com/eric/tools/css/reset/
v4.0 | 20180602
License: none (public domain)
*/

html, body, div, span, applet, object, iframe,
h1, h2, h3, h4, h5, h6, p, blockquote, pre,
a, abbr, acronym, address, big, cite, code,
del, dfn, em, img, ins, kbd, q, s, samp,
small, strike, strong, sub, sup, tt, var,
b, u, i, center,
dl, dt, dd, ol, ul, li,
fieldset, form, label, legend,
table, caption, tbody, tfoot, thead, tr, th, td,
article, aside, canvas, details, embed,
figure, figcaption, footer, header, hgroup,
main, menu, nav, output, ruby, section, summary,
time, mark, audio, video {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
/* HTML5 display-role reset for older browsers */
article, aside, details, figcaption, figure,
footer, header, hgroup, main, menu, nav, section {
display: block;
}
8 changes: 8 additions & 0 deletions _freeze/site_libs/revealjs/dist/reveal.css

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions _freeze/site_libs/revealjs/dist/reveal.esm.js

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions _freeze/site_libs/revealjs/dist/reveal.esm.js.map

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions _freeze/site_libs/revealjs/dist/reveal.js

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions _freeze/site_libs/revealjs/dist/reveal.js.map

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SIL Open Font License (OFL)
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
@font-face {
font-family: 'League Gothic';
src: url('./league-gothic.eot');
src: url('./league-gothic.eot?#iefix') format('embedded-opentype'),
url('./league-gothic.woff') format('woff'),
url('./league-gothic.ttf') format('truetype');

font-weight: normal;
font-style: normal;
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 4c35148

Please sign in to comment.