Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing words and bad hyphenation in french #4

Open
aadant opened this issue Jul 28, 2015 · 3 comments
Open

Missing words and bad hyphenation in french #4

aadant opened this issue Jul 28, 2015 · 3 comments
Assignees

Comments

@aadant
Copy link

aadant commented Jul 28, 2015

java -jar target/wikiforia-1.2.1.jar --pages ../frwiki-20150602-pages-articles-multistream.xml.bz2 -lang fr -o xml

interrupt after a couple of minutes since the issue is in the first pages

Example : Amsterdam, id = 245

Le est considéré comme l'âge d'or d'Amsterdam car elle devient à cette époque la ville la plus riche du monde.

should be

Le XVIIe siècle est considéré comme l'âge d'or d'Amsterdam car elle devient à cette époque la ville la plus riche du monde

LAndalousie

LAndalousie

should be

L'Andalousie

@marcusklang marcusklang self-assigned this Aug 2, 2015
@marcusklang
Copy link
Owner

Sorry for the late response. This is a problem with unsupported template expansion.

The raw wikimarkup for the text that is incorrectly translated is:

Le {{s|XVII|e}} est considéré comme l'[[âge d'or]] d'Amsterdam car elle devient à cette époque la ville la plus riche du monde

Which uses a template "s". The French edition uses templates far more frequent for common formatting than that of e.g. English and Swedish.

I have plans on implementing template expansion by using a fast disk-based hashmap, but the performance will depend on how much memory that is available for caching and you will have to do two passes over the data.

I cannot give you a timeline for when this feature will be included other than that it is on the TODO list and is considered highly important.

@aadant
Copy link
Author

aadant commented Aug 2, 2015

Thank you for your feedback. It might be a sweble issue. I will raise another issue for the missing hyphen in Andalousie

@aadant
Copy link
Author

aadant commented Aug 30, 2015

Hey Marcus, I was looking at this project : attardi/wikiextractor#32 (comment)

Looks like you will also need to support Modules (and Lua !). Fortunately there are Java implementations of Lua. So it can still be full java.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants