-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove Kconv.toutf8 conversion #16
base: master
Are you sure you want to change the base?
Conversation
…to be misinterpreted as chinese characters
Thanks for your report, and this PR. 😃 But, at first, I don't recommend to use this rubygem in production 🙏🏼 |
@@ -17,7 +17,6 @@ def fetch(headers = {}) | |||
acceptable_content!(head.headers[:content_type]) | |||
|
|||
res = send_request(:get, @uri, headers) | |||
Kconv.toutf8(res.to_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary: my opinion is that such behavior (converting string encodings in this gem) better to be configurable for various use cases, instead of removing this line simply.
read the followings for the detail. 🙏🏼
At first, let's check the String
value in the OGP spec.
https://ogp.me/#string 👀
As you can see in the official docs, String
value is described as A sequence of Unicode characters
. (Unicode, but not UTF-8
)
So, I think that this gem should follow the String
value spec as possible.
Based on this thought, and just for my personal use,
I had decided to convert those web contents(meta tags) into UTF-8
encoding.
(I think that this is the root cause of those encoding issue in this gem, and my bad decision. 😢 )
However, web contents (especially meta tag values in HTML files in this context) could be in various encodings as you know.
After merging your PR, users of this library will have to consider OGP string encoding without any additional information (like, which string encoding was used in each web site).
Due to above reason, I don't think that removing converting string encodings is the best way, like this PR. 🤔
So, as the result, as I wrote in the head of this comment,
my opinion is that such behavior (converting string encodings in this gem) better to be configurable for various cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I simply made a GitHub issue for this encoding issue, #17
In lib/ogpr/fetcher/html_fetcher.rb:20 the fetched meta tag content is forced to UTF-8 using the stdlib Kconv. This conversion seems unnecessary, but also introduces a lot of wrongly converted characters. In my use case, a lot of accented latin letters are converted to chinese characters. This also seems to happen with some punctuation.