-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Palestine row violates schema, renders dataset unusable. #26
Comments
thanks for pointing this out. this is a great example of when reality is difficult to standardize! Palestine is not a universally recognized state and includes disputed territories. The goal of this dataset is to aggregate up-to-date country codes from various standards bodies-- not to take a political position. Since this is how the upstream sources have formatted these data, I'm inclined to leave as-is. Clarity on how to handle this (split to two rows or redefine schema to allow for multiple codes) should come from the upstream standards body. Unless this is resolved in the geopolitical arena, I think we should maintain a malformed dataset that reflects the current reality of the upstream standards. |
While I can appreciate this perspective, it seems the real problem is that the package spec isn't conforming to the upstream datasources -- i.e. if there are legitimately multiple values for some columns then shouldn't those be arrays in the spec? I would suggest that in its current form the Palestine data is simply unusable unless the consumer knows that the spec is wrong and that certain fields have commas in them. (Ideally the datapackage spec would accommodate exceptions like this.) I don't really have a solution other than duplicating the rows to blow out the embedded mult-value columns, but it seems that having two entries would be better than zero entries. OTOH, this doesn't really affect me since I don't think Palestine will ever come up in my datasets :) |
I just wanted to follow up on this, since this dataset is still broken and unusable. Here is my workaround code that I have to employ to consume this dataset with the python datapackage library: with open(downloaded_files.source, 'rb') as readfp:
lines = readfp.readlines()
# Currently there is a bug in the dataset for the Palestine row. We'll just remove it, since we won't need it anyway.
with open(downloaded_files.source, 'wb') as writefp:
for line in lines:
if not line.decode('utf-8').startswith(r'"Palestine'):
writefp.write(line) I understand the desire to not take a political stance, but at least update the spec for the datapackage to match the underlying data. As it is currently, the mismatch between the data and the specification render the dataset unusable. |
@hozn I do not encounter problems when using pandas.read_csv to load the data for PS
@ewheeler Customary names may be preferred than official names on various grounds, especially for the aim to be user friendly. #16 More here: "Customary names" Section in http://cldr.unicode.org/translation/country-names |
I'm trying to use parse this dataset using the python
datapackage
utility. Among other issues (bugs indatapackage
), I am unable to deal with the Palestine row due to the fact that there are two values for the GAUL column:I also see that there are multiple country codes specified (comma-sep), and while that isn't triggering an error (since it's just a "string"), that is going to result in unehlpful data.
If this is conforming to the tabular data spec, I can work with
datapackage
author to get this resolved there. If not, I'd propose either:The text was updated successfully, but these errors were encountered: