-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common misspellings #43
Comments
List is likely not the biggest problem, as it can be built up over time, adding anything that is noticed. |
List of typos: https://lv.wikipedia.org/wiki/Vikiprojekts:Vikip%C4%93dijas_uzlabo%C5%A1ana/Raksti/Typo Spellchecker dictionary used by Firefox is based on this: https://dict.dv.lv/documentation.php?prj=lv I think we can try and use one or both of them. |
So are we/you proposing to run against known incorrect values (blacklist) or against all known correct values (spellcheck - whitelist)? I don't want to speculate before even trying it out, but I suspect it will have either way too many false positives with whitelist or way too few actionable true positives with blacklist. This is why I was pessimistic with this task and calling producing an actual "list" the "biggest problem". I also imagine names on a map likely have somewhat different occurrences than prose text on Wikipedia. But hey, I'd like to be proven wrong. Thanks for looking into at least, I might download those and at least try it and see what the results are. |
I think both approaches would be (or at least might be) useful. Check against dictionary get us a quick reference for general spelling. It probably will result in a lot of false positives, but I hope that it will be manageable. One possible caveat with this approach: words with upper case first letter might be ignored or have lesser checks, at least as I understand from browser's spellcheck behaviour. Check against known incorrect words would allow to have a more narrow list of words that are not expected. Also, this might provide a possibility to check for words that are existing and spelled correctly, but not expected to be on map, like obscenities or cases of commonly confused words. |
I suspect Wikipedia's typo list might not be a good fit for OSM, like "paronoja -> paranoja" :) Spellcheck dictionary wouldn't account for shop names, brand names etc, but extending it with such could be a great long-term goal. Which tags were meant to be covered here? I guess name, brand, operator, inscription... |
As an experiment, I put all names through a spellchecker (i.e. whitelist) with predictable results of "There are 13035 unknown-spelling values from 27861 (out of 56202) elements" https://osmlatvija.github.io/Osmalyzer/Spelling%20report.html So it's currently about half of all the names. I have some ideas, but make your own conclusions for now ;) |
Took a peek at results. One thing that immediately stood out are transborder objects, that have name in multiple languages, like |
Yes, this was something I also noticed. I was working on splitting slashes, just hadn't committed because I kept finding new ways for things to break. Notably, there are valid reasons to have a I committed my changes now and it splits the name into parts and checks each individually, not that it knows that they aren't all Latvian. Ideally, OSM object would have |
I meant to ignore those altogether, because part in different language will (always?) cause a false positive. * : partially might be implemented in the transliteration check, but only for roads. |
Latvian language-specific known misspellings and unlikely spellings/words/terms. Biggest problem is coming up with a list. Would need to manually go through all the names or slowly build up over time. Need to consider false positives and exceptions.
The text was updated successfully, but these errors were encountered: