Skip to content

Commit

Permalink
fix some check errors
Browse files Browse the repository at this point in the history
  • Loading branch information
JBGruber committed Oct 5, 2023
1 parent 92c8ce8 commit 01bacd8
Show file tree
Hide file tree
Showing 5 changed files with 33 additions and 21 deletions.
6 changes: 5 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,13 @@ Imports:
adaR,
callr,
cli,
cookiemonster,
curl,
dplyr,
jsonlite,
lubridate,
magrittr,
methods,
praise,
purrr,
rlang,
Expand All @@ -32,7 +35,8 @@ Suggests:
rmarkdown,
rstudioapi,
spelling,
testthat
testthat,
withr
URL: https://github.com/JBGruber/paperboy
Encoding: UTF-8
BugReports: https://github.com/JBGruber/paperboy/issues
Expand Down
2 changes: 1 addition & 1 deletion R/collect.R
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ pb_collect <- function(urls,

res <- purrr::map(url_batches, function(b) {
domain <- adaR::ada_get_domain(b[1])
cookies_str <- cookiemonster::get_cookies(paste0("\\b", domain, "\\b"), as = "string")
cookies_str <- cookiemonster::get_cookies(paste0(domain, "\\b"), as = "string")
rp <- callr::r_bg(async_requests,
args = list(
urls = b,
Expand Down
12 changes: 6 additions & 6 deletions R/utils_dev.R
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ use_new_parser <- function(x,
rss = NULL,
test_data = NULL) {

x <- head(adaR::ada_get_domain(x), 1)
x <- utils::head(adaR::ada_get_domain(x), 1)

cli::cli_progress_step(
"Creating R file",
Expand Down Expand Up @@ -73,7 +73,7 @@ use_new_parser <- function(x,
)

if (file.exists("inst/status.csv")) {
status <- read.csv("inst/status.csv")
status <- utils::read.csv("inst/status.csv")
if (!gsub("^www.", "", x) %in% status$domain) {
status <- status %>%
rbind(list(domain = sub("^www.", "", x),
Expand All @@ -82,7 +82,7 @@ use_new_parser <- function(x,
issues = issue,
rss = rss)) %>%
dplyr::arrange(domain)
write.csv(status, "inst/status.csv", row.names = FALSE)
utils::write.csv(status, "inst/status.csv", row.names = FALSE)
} else if (rss == "") {
# if entry already present, get rss value
rss <- status[grepl(gsub("^www.", "", x), status$domain), "rss"]
Expand Down Expand Up @@ -154,13 +154,13 @@ use_new_parser <- function(x,
msg_done = "status.csv updated."
)
x <- utils::head(adaR::ada_get_domain(x), 1)
status <- read.csv("inst/status.csv")
status <- utils::read.csv("inst/status.csv")
status[status$domain == gsub("^www.", "", x), "status"] <-
"![](https://img.shields.io/badge/status-gold-%23ffd700.svg)"
cli::cli_alert_info("Check the entry manually. Press quit when you're happy.")
status[status$domain == gsub("^www.", "", x), ] <-
edit(status[status$domain == gsub("^www.", "", x), ])
write.csv(status, "inst/status.csv", row.names = FALSE)
utils::edit(status[status$domain == gsub("^www.", "", x), ])
utils::write.csv(status, "inst/status.csv", row.names = FALSE)

}

Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ df <- pb_deliver("https://tinyurl.com/386e98k5")
df
```

| url | expanded_url | domain | status | datetime | author | headline | text | misc |
|:-------------------------------|:----------------------------------------------------------------------------------|:--------------------|-------:|:--------------------|:------------------------------------------------------|:------------------------|:-------------------------------|:-----|
| <https://tinyurl.com/386e98k5> | <https://www.theguardian.com/tv-and-radio/2021/jul/12/should-marge-divorce-homer> | www.theguardian.com | 200 | 2021-07-12 12:00:13 | <https://www.theguardian.com/profile/stuart-heritage> | ’A woman trapped in an… | In the Guide’s weekly Solved!… | NULL |
| url | expanded_url | domain | status | datetime | author | headline | text | misc |
|:-------------------------------|:----------------------------------------------------------------------------------|:----------------|-------:|:--------------------|:------------------------------------------------------|:------------------------|:-------------------------------|:-----|
| <https://tinyurl.com/386e98k5> | <https://www.theguardian.com/tv-and-radio/2021/jul/12/should-marge-divorce-homer> | theguardian.com | 200 | 2021-07-12 12:00:13 | <https://www.theguardian.com/profile/stuart-heritage> | ’A woman trapped in an… | In the Guide’s weekly Solved!… | NULL |

The returned `data.frame` contains important meta information about the
news items and their full text. Notice, that the function had no problem
Expand All @@ -56,13 +56,12 @@ therefore often encounter this warning:

``` r
pb_deliver("google.com")
#> Warning: ℹ No parser for domain www.google.com yet, attempting generic
#> approach.
#> Warning: ℹ No parser for domain google.com yet, attempting generic approach.
```

| url | expanded_url | domain | status | datetime | author | headline | text | misc |
|:-----------|:-------------------------|:---------------|-------:|:---------|:-------|:---------|:---------------------------------------------|:-----|
| google.com | <http://www.google.com/> | www.google.com | 200 | NA | NA | Google | © 2023 - Ochrana soukromí - Smluvní podmínky | NULL |
| url | expanded_url | domain | status | datetime | author | headline | text | misc |
|:-----------|:-------------------------|:-----------|-------:|:---------|:-------|:---------|:----------------------------------------------------|:-----|
| google.com | <http://www.google.com/> | google.com | 200 | NA | NA | Google | © 2023 - Datenschutzerklärung - Nutzungsbedingungen | NULL |

The function still returns a data.frame, but important information is
missing — in this case because it isn’t there. The other URLs will be
Expand All @@ -77,9 +76,9 @@ later parse it yourself:
pb_collect("google.com")
```

| url | expanded_url | domain | status | content_raw |
|:-----------|:-------------------------|:---------------|-------:|:-----------------------------------|
| google.com | <http://www.google.com/> | www.google.com | 200 | \<!doctype html\>\<html itemscope… |
| url | expanded_url | domain | status | content_raw |
|:-----------|:-------------------------|:-----------|-------:|:-----------------------------------|
| google.com | <http://www.google.com/> | google.com | 200 | \<!doctype html\>\<html itemscope… |

`pb_collect` uses concurrent requests to download many pages at the same
time, making the function very quick to collect large amounts of data.
Expand Down Expand Up @@ -113,6 +112,7 @@ column was included so these can be retained.

| domain | status | author | issues |
|:-------------------------------|:--------------------------------------------------------------|:------------------------------------------|:-----------------------------------------------------|
| ad.nl | ad.nl | [@JBGruber](https://github.com/JBGruber/) | |
| anotherangryvoice.blogspot.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | [@JBGruber](https://github.com/JBGruber/) | |
| boston.com | ![](https://img.shields.io/badge/status-requested-lightgrey) | | [\#1](https://github.com/JBGruber/paperboy/issues/1) |
| bostonglobe.com | ![](https://img.shields.io/badge/status-requested-lightgrey) | | [\#1](https://github.com/JBGruber/paperboy/issues/1) |
Expand Down Expand Up @@ -157,6 +157,7 @@ column was included so these can be retained.
| tribpub.com | ![](https://img.shields.io/badge/status-requested-lightgrey) | | [\#1](https://github.com/JBGruber/paperboy/issues/1) |
| us.cnn.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | [@JBGruber](https://github.com/JBGruber/) | |
| usatoday.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | [@JBGruber](https://github.com/JBGruber/) | |
| volkskrant.nl | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | [@JBGruber](https://github.com/JBGruber/) | |
| washingtonpost.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | [@JBGruber](https://github.com/JBGruber/) | |
| wsj.com | ![](https://img.shields.io/badge/status-gold-%23ffd700.svg) | [@JBGruber](https://github.com/JBGruber/) | |

Expand Down
11 changes: 9 additions & 2 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
CMD
Codecov
Datenschutzerklrung
Datenschutzerklärung
Guide’s
Lifecycle
Nutzungsbedingungen
POSIXct
Expand All @@ -16,6 +17,8 @@ cbsnews
cnet
cnn
com
csv
cz
dailymail
datetime
doctype
Expand All @@ -28,13 +31,17 @@ foxnews
ftw
huffingtonpost
huffpost
idnes
itemscope
latimes
lnk
marketwatch
mediacloud
msnbc
newsweek
nl
nos
nrc
nypost
nytimes
org
Expand All @@ -55,9 +62,9 @@ uk
un
urls
usatoday
volkskrant
washingtonpost
webscraper
webscraping
wsj
www
’A

0 comments on commit 01bacd8

Please sign in to comment.