Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-ASCII Characters in fluent_text Validator #47

Closed
khashashin opened this issue Sep 18, 2023 · 2 comments
Closed

Support for non-ASCII Characters in fluent_text Validator #47

khashashin opened this issue Sep 18, 2023 · 2 comments

Comments

@khashashin
Copy link
Contributor

khashashin commented Sep 18, 2023

Hello,

I have been using the ckanext-fluent extension to facilitate multilingual inputs in my CKAN instance. During the usage, I encountered an issue where non-ASCII characters (like "ä", "ö", "ü", etc.) are being stored as Unicode escaped strings in the database. This is happening because the json.dumps method in the fluent_text validator is encoding these characters to their Unicode escape sequences.

For instance, a text like:

Stromtarif Tarifanteil KEV Standardprodukt gemäss ElCom pro Kategorie

is being stored in the database as:

Stromtarif Tarifanteil KEV Standardprodukt gem\u00e4ss ElCom pro Kategorie

Currently, the relevant part of the code in the fluent_text validator looks like this:

data[key] = json.dumps(value)

and

data[key] = json.dumps(output)

This issue not only affects the way data is stored but also adversely impacts the search functionality in CKAN, as the SOLR search engine fails to match these Unicode escaped sequences with the actual characters in search queries.

To resolve this, I propose updating the above lines to:

data[key] = json.dumps(value, ensure_ascii=False)

and

data[key] = json.dumps(output, ensure_ascii=False)

This modification will ensure that non-ASCII characters are stored as they are, without being converted to their Unicode escape sequences, thus preserving the original characters and facilitating accurate search results.

Moreover, I noticed that other extensions use a validator called "unicode_safe" to handle non-ASCII characters gracefully. I tried using this validator but it seems that the fluent_text validator does not recognize it. Therefore, it would be greatly beneficial if the fluent_text validator could be updated to integrate or recognize the "unicode_safe" validator to allow for the proper handling of non-ASCII characters.

I look forward to hearing your thoughts on this and would greatly appreciate any guidance or support in this regard.

Thank you.

@wardi
Copy link
Contributor

wardi commented Sep 18, 2023

Sure I'd accept a PR that makes these changes.

@khashashin
Copy link
Contributor Author

fixed in #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants