Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly decode UTF-8 from gsheet csv #1548

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

melange396
Copy link
Collaborator

What i was referring to in #1546 (comment) is not actually a problem with extended ascii or unicode characters -- unicode seems like it should actually be fine to use as long as it is decoded properly...

>>> chr(8220) # unicode quote (left double quote, to be specific)
'“'
>>> chr(8220).encode("utf8") # how its represented in utf8 (3 bytes in the extended ascii range)
b'\xe2\x80\x9c'
>>> chr(8220).encode("utf8").decode("8859") # what it looks like when its improperly decoded (an "a-hat" and two effectively unprintable chars)
\x80\x9c'

Copy link

sonarcloud bot commented Oct 9, 2024

@melange396
Copy link
Collaborator Author

I ran the CSV file sync GH action using the branch from this PR, so it should be using the associated changes: https://github.com/cmu-delphi/delphi-epidata/actions/runs/11247472116

It produced its own PR that appears to preserve the unicode quotes from #1546: #1549 . This is the desired result, but since i cant currently explain the difference in behavior between #1546 and #1547, i cant rule out that this run could also be a fluke.

The code at the core of the GH action is pretty hokey. For example, it uses a strange procedure to omit lines at the end of the csv that are ~empty, but doesnt explain itself. If it were any more complicated than it is, i would suggest redoing it with csv-specific methods/packages/libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants