forked from CorrelAid/workshop-webscraping
-
Notifications
You must be signed in to change notification settings - Fork 0
/
presentation.Rmd
313 lines (242 loc) · 8.98 KB
/
presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
---
title: "Web Scraping with R"
subtitle: "Workshop at the CorrelAid Community Event"
author: "Zoé Wolter"
date: "2022-11-26"
output:
ioslides_presentation:
widescreen: true
logo: logo.png
css: styles.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(knitr)
library(forcats)
library(countrycode)
library(tidyr)
library(ggplot2)
```
# Introduction
## Outline
- Understanding the structure of **HTMLs**
- Extracting information based on its **XPaths**
- **"Web etiquette"** & the robots.txt
- **Hands-On**: Scraping a Website
# Before we start - Setup
## Setup & Installation
- You need to have R and RStudio installed!
- Please download or clone the repository: `https://github.com/ZoeWolter/workshop-webscraping/`
- Please install the packages via this code:
```{r, eval = FALSE}
source(knitr::purl('code/packages.Rmd', quiet = TRUE))
```
# Some theory | Web Data Collection
## Why?
## Why?
<blockquote>
**Web Scraping** = collection of information from websites by extracting code directly from the html code
</blockquote>
- Daten über Daten
- kein Copy&Paste mehr
- Automatisierung der Datensammlung
- reproduzierbare und aktualisierbare Datensammlung
## What are websites built of?
## What are websites built of?
![](assets/img/websites.png){width=80%}
## HTML (1)
- **H**ypertext **M**arkup **L**anguage
- Instructions to the bwoser what to **display** when and where
- for web scraping: we don't need to write HTML, but understanding helps a lot!
- hierarchical tree structure
- tags with attributes
## HTML (2)
![](assets/img/html.png){width=60% height=100%}
## HTML (2)
| Tag | Description |
|-----------------------------------|--------------------------------|
| `<a href=""></a>` | Link / URL |
| `<div>` and `<span>` | Blocks to structure the page |
| `<p></p>` | Paragraph |
| `<h1>`, `<h2>`,... | Headers |
| `<ul>`, `<ol>`, `<dl>` | Lists |
| `<li></li>` | Single list element |
| `<br>` | Line break |
| `<b>`, `<i>`, `<strong>` | Layout options |
| `<table>`, `<th>`, `<td>`, `<tr>` | Tables |
| `<script></script>` | Script container |
## XPath
- **X**ML **Path** Language
- **query language** to extract parts of HTML/XML-files
- use tags, attributes, and relations between **nodes** and tags
- based on **hierarchical structure** of nodes
- absolute paths: '/html/body/div/p'
- relative paths: '//p'
## robots.txt
- **Robots Exclusion Standard**
- message to (search engine) crawlers about the urls they are allowed to access
- goal: avoid a website to break down due to too many requests at once
- definition of a **crawl-delay** (e.g. 5 sec)
<br>
![](assets/img/robotstxt.png){width=50%}
## Web Scraping in R - Packages
![](assets/img/packages.png){width=100%}
# Hands-On | Web Scraping Workflow in R
## Website
$\longrightarrow$ find the URL to the website you want to scrape
```{r, results = 'hide'}
base_url <- 'https://www.bertelsmann-stiftung.de/'
projects_url <- 'https://www.bertelsmann-stiftung.de/en/our-projects/project-search?page=1'
```
## Be polite! (1)
$\longrightarrow$ tell the website who you are!<br>
$\longrightarrow$ check whether you are allowed to scrape!<br>
```{r, results = 'hide'}
polite::bow(url = stringr::str_c(base_url),
user_agent = 'Workshop Web Data Collection - zoe.w@correlaid.org') -> session
```
```{r echo = FALSE}
session
```
## Be polite! (2)
```{r, results = 'hide'}
session
session$robotstxt
session$robotstxt$permissions
session$robotstxt$crawl_delay
```
$\longrightarrow$ Are we **allowed** to scrape? <br>
$\longrightarrow$ Which **crawl-delay** is set for this website? <br>
$\longrightarrow$ Are there any **rules** for some bots? <br>
## Download htmls (1)
1 - Load the html page (as list of `<head>` and `<body>`) in R:
```{r, results = 'hide'}
# call the session you created
session %>%
# be polite & specify url path
polite::nod(stringr::str_c('en/our-projects/project-search?page=1')) %>%
# scrape!
polite::scrape() -> projects_html
```
## Download htmls (2)
2 - Best practice: download htmls and save them locally:
```{r, results = 'hide'}
# create directory to store the htmls
if (!dir.exists(here::here('assets', 'htmls'))) {
dir.create(here::here('assets', 'htmls'))
}
# function to download htmls
download_html <- function(url, filename) {
polite::nod(session, url) %>%
polite::rip(destfile = filename,
path = here::here('assets', 'htmls'),
overwrite = TRUE)
}
# call function to download html
download_html(stringr::str_c(base_url, 'en/our-projects/project-search?page=1'),
'projects.html')
```
## XPath: Extract data (1)
$\longrightarrow$ click right on the webpage $\rightarrow$ inspect $\rightarrow$ search for html node <br>
$\longrightarrow$ [Selector Gadget](https://selectorgadget.com/): "SelectorGadget is an open source tool that makes CSS selector generation and discovery on complicated sites a breeze" <br>
$\longrightarrow$ Can you find the XPath to the first project on the website?
## XPath: Extract data (2)
//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/<br>
div/div/div[**1**]/article/div[2]/div/div[2]/h2/a <br>
//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/<br>
div/div/div[**2**]/article/div[2]/div/div[2]/h2/a
Now extract the title of the first project:
```{r}
projects_html %>%
rvest::html_element(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[1]/article/div[2]/div/div[2]/h2/a') %>%
rvest::html_text2()
```
## XPath: Extract data (3)
What else can we extract?
```{r}
# url to project
projects_html %>%
rvest::html_element(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[1]/article/div[2]/div/div[2]/h2/a') %>%
rvest::html_attr('href')
# project description
projects_html %>%
rvest::html_element(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[2]/article/div[2]/div/div[3]/div/p') %>%
rvest::html_text2()
```
## Data Cleaning
Before cleaning: we need a data frame!
```{r}
data.frame(
project = projects_html %>%
rvest::html_nodes(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[*]/article/div[2]/div/div[2]/h2/a') %>%
rvest::html_text2(),
text = projects_html %>%
rvest::html_nodes(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[*]/article/div[2]/div/div[3]/div/p') %>%
rvest::html_text2()
) -> df
```
## Store data
Since you don't want to run your scraping script each time you do some analysis:
```{r}
saveRDS(df, file = here::here('data', 'projects.RDS'))
```
# Scraping at Scale
## Sraping at Scale (1)
- https://www.bertelsmann-stiftung.de/en/our-projects/project-search?page=1
- https://www.bertelsmann-stiftung.de/en/our-projects/project-search?page=2
```{r}
# define base URL
base_url <- 'https://www.bertelsmann-stiftung.de/'
# Be polite
session <- polite::bow(url = base_url,
user_agent = 'Workshop Web Data Collection - zoe.w@correlaid.org')
# Vektor to define all pages we want to have a look at
pages <- 1:7
```
## Sraping at Scale (2)
1 - Load ALL the html pages in R:
```{r}
# With purrr you can map over all numbers in the vector "pages"
purrr::map(.x = pages, ~ {
#...you create the url for each of the pages...
polite::nod(session, stringr::str_c('en/our-projects/project-search?page=', .x)) %>%
#...and scrape the htmls!
polite::scrape()
}) -> results
```
## Sraping at Scale (3)
2 - Best practice: download htmls and save them locally:
```{r, results = 'hide'}
# With purrr you can map over all numbers in the vector "pages"
purrr::map(.x = pages, ~ {
#...you create the url for each of the pages...
polite::nod(session, stringr::str_c('en/our-projects/project-search?page=', .x)) %>%
#...and save!
polite::rip(destfile = stringr::str_c('projects_', .x, '.html'),
path = here::here('assets', 'htmls'),
overwrite = TRUE)
})
```
## Scraping at Scale (4)
Create a data frame with ALL projects:
```{r}
purrr::map_dfr(.x = results, ~ {
data.frame(
project = .x %>%
rvest::html_nodes(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[*]/article/div[2]/div/div[2]/h2/a') %>%
rvest::html_text2(),
text = .x %>%
rvest::html_nodes(xpath = '//*[@id="c199640"]/div[2]/div/div[2]/div[2]/div[2]/
div/div/div[*]/article/div[2]/div/div[3]/div/p') %>%
rvest::html_text2()
)
}) -> all_events
```
# Thank you! | Contact me via zoe.w@correlaid.org if you've any questions!