Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Palestine row violates schema, renders dataset unusable. #26

Open
hozn opened this issue Sep 10, 2015 · 4 comments
Open

Palestine row violates schema, renders dataset unusable. #26

hozn opened this issue Sep 10, 2015 · 4 comments

Comments

@hozn
Copy link
Contributor

hozn commented Sep 10, 2015

I'm trying to use parse this dataset using the python datapackage utility. Among other issues (bugs in datapackage), I am unable to deal with the Palestine row due to the fact that there are two values for the GAUL column:

"Palestine, State of","Palestine, État de",PS,PSE,275, ,"gz,wj", , ,970,PLE,"GZ,WE","91,267",PLE,,"PALESTINE, STATE OF",,No universal currency,,In contention

I also see that there are multiple country codes specified (comma-sep), and while that isn't triggering an error (since it's just a "string"), that is going to result in unehlpful data.

If this is conforming to the tabular data spec, I can work with datapackage author to get this resolved there. If not, I'd propose either:

  1. Moving this entry to two rows.
  2. Changing the affected field types to arrays (though this seems quite inelegant, since all other rows would have arrays of a single value and seems to run contrary to the essence of this table)
@ewheeler
Copy link
Contributor

thanks for pointing this out. this is a great example of when reality is difficult to standardize!

Palestine is not a universally recognized state and includes disputed territories. The goal of this dataset is to aggregate up-to-date country codes from various standards bodies-- not to take a political position.

Since this is how the upstream sources have formatted these data, I'm inclined to leave as-is. Clarity on how to handle this (split to two rows or redefine schema to allow for multiple codes) should come from the upstream standards body. Unless this is resolved in the geopolitical arena, I think we should maintain a malformed dataset that reflects the current reality of the upstream standards.

@hozn
Copy link
Contributor Author

hozn commented Sep 21, 2015

While I can appreciate this perspective, it seems the real problem is that the package spec isn't conforming to the upstream datasources -- i.e. if there are legitimately multiple values for some columns then shouldn't those be arrays in the spec? I would suggest that in its current form the Palestine data is simply unusable unless the consumer knows that the spec is wrong and that certain fields have commas in them. (Ideally the datapackage spec would accommodate exceptions like this.) I don't really have a solution other than duplicating the rows to blow out the embedded mult-value columns, but it seems that having two entries would be better than zero entries.

OTOH, this doesn't really affect me since I don't think Palestine will ever come up in my datasets :)

@hozn
Copy link
Contributor Author

hozn commented Feb 16, 2016

I just wanted to follow up on this, since this dataset is still broken and unusable.

Here is my workaround code that I have to employ to consume this dataset with the python datapackage library:

        with open(downloaded_files.source, 'rb') as readfp:
            lines = readfp.readlines()

        # Currently there is a bug in the dataset for the Palestine row.  We'll just remove it, since we won't need it anyway.
        with open(downloaded_files.source, 'wb') as writefp:
            for line in lines:
                if not line.decode('utf-8').startswith(r'"Palestine'):
                    writefp.write(line)

I understand the desire to not take a political stance, but at least update the spec for the datapackage to match the underlying data. As it is currently, the mismatch between the data and the specification render the dataset unusable.

@hozn hozn changed the title Palestine row violates schema Palestine row violates schema, renders dataset unusable. Feb 16, 2016
@hanteng
Copy link
Contributor

hanteng commented May 25, 2016

@hozn I do not encounter problems when using pandas.read_csv to load the data for PS

df_cc_indexd.loc['PS']
name Palestine
name_fr Palestine, État de
ISO3166-1-Alpha-3 PSE
ISO3166-1-numeric 275
ITU  
MARC gz,wj
WMO  
DS  
Dial 970
FIFA PLE
FIPS GZ,WE
GAUL 91,267
IOC PLE
currency_alphabetic_code
currency_country_name PALESTINE, STATE OF
currency_minor_unit
currency_name No universal currency
currency_numeric_code
is_independent In contention
Name: PS, dtype: object

@ewheeler Customary names may be preferred than official names on various grounds, especially for the aim to be user friendly. #16 More here: "Customary names" Section in http://cldr.unicode.org/translation/country-names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants