Properly decode UTF-8 from gsheet csv #1548
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What i was referring to in #1546 (comment) is not actually a problem with extended ascii or unicode characters -- unicode seems like it should actually be fine to use as long as it is decoded properly...
In Update Google Docs Meta Data #1546, i saw the replacement of weird ascii characters (that i thought might have been escaped to comply with some CSV format) with what i thought were proper/plain quotation marks (they were actually unicode's stylized quotes).
db_signals.csv
in the currentmain
branch has the weird extended ascii chars (Search for "Children cannot get COVID-19" to see an example:delphi-epidata/src/server/endpoints/covidcast_utils/db_signals.csv
Line 139 in 9e21bfb
).
Somehow, that broken encoding was fixed in Update Google Docs Meta Data #1546. You will have to take my word for it or carefully examine that version of the file for yourself, the diff from that PR is too big for github to display it nicely.
You can see that the broken decoding came back in https://github.com/cmu-delphi/delphi-epidata/pull/1547/files#diff-5acd8942e330087af27801ddefd15a2ece4fef9aedcd37ea2e5bcbdb42808bf1R22 after i did a find/replace for some of the quotes. I dont know why we have a mix of low-ascii and higher unicode characters for quotes in the sheet, but maybe google does some "smart" autoformatting for us.
I still cant explain how they were "fixed" in Update Google Docs Meta Data #1546 but reverted in Update Google Docs Meta Data #1547 ¯\_(ツ)_/¯
This code demonstrates that it is a bad decoding:
.text
accessor inrequests
: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8/52615216#52615216