Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenated words #250

Open
seb-29 opened this issue May 15, 2024 · 1 comment
Open

Hyphenated words #250

seb-29 opened this issue May 15, 2024 · 1 comment

Comments

@seb-29
Copy link

seb-29 commented May 15, 2024

The spaCy tokenizer splits hyphenated words by inserting a space before and after the hyphen. For example, "eye-opening" becomes "eye - opening". Is there a way to keep hyphenated words together, like with the quanteda tokenizers? (@JBGruber : Any idea? :))

@JBGruber
Copy link
Collaborator

Changing the behaviour requires changing the infix patterns, which would be non trivial from R. But we can use a quick workaround:

library("spacyr")
txt <- "The spaCy tokenizer splits hyphenated words, like eye-opening, by inserting a space before and after the hyphen."
# replace hypens with a symbol that is not part of the infix patterns
txt2 <- gsub("-", "§", txt, fixed = TRUE)
spacy_parse(txt2)$lemma
#>  [1] "the"         "spacy"       "tokenizer"   "split"       "hyphenated" 
#>  [6] "word"        ","           "like"        "eye§opening" ","          
#> [11] "by"          "insert"      "a"           "space"       "before"     
#> [16] "and"         "after"       "the"         "hyphen"      "."

I use the obscure section sign (§) to replace hyphens, but you can use anything else not in the infix pattern list. After parsing, you can then just change it back:

gsub("§", "-", spacy_parse(txt2)$lemma, fixed = TRUE)
#>  [1] "the"         "spacy"       "tokenizer"   "split"       "hyphenated" 
#>  [6] "word"        ","           "like"        "eye-opening" ","          
#> [11] "by"          "insert"      "a"           "space"       "before"     
#> [16] "and"         "after"       "the"         "hyphen"      "."

Created on 2024-05-17 with reprex v2.1.0

Just make sure you don't happen to have any § (or other replacement symbol) in your original text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants