Palestine row violates schema, renders dataset unusable. #26

hozn · 2015-09-10T15:22:12Z

I'm trying to use parse this dataset using the python datapackage utility. Among other issues (bugs in datapackage), I am unable to deal with the Palestine row due to the fact that there are two values for the GAUL column:

"Palestine, State of","Palestine, État de",PS,PSE,275, ,"gz,wj", , ,970,PLE,"GZ,WE","91,267",PLE,,"PALESTINE, STATE OF",,No universal currency,,In contention

I also see that there are multiple country codes specified (comma-sep), and while that isn't triggering an error (since it's just a "string"), that is going to result in unehlpful data.

If this is conforming to the tabular data spec, I can work with datapackage author to get this resolved there. If not, I'd propose either:

Moving this entry to two rows.
Changing the affected field types to arrays (though this seems quite inelegant, since all other rows would have arrays of a single value and seems to run contrary to the essence of this table)

The text was updated successfully, but these errors were encountered:

ewheeler · 2015-09-21T09:09:43Z

thanks for pointing this out. this is a great example of when reality is difficult to standardize!

Palestine is not a universally recognized state and includes disputed territories. The goal of this dataset is to aggregate up-to-date country codes from various standards bodies-- not to take a political position.

Since this is how the upstream sources have formatted these data, I'm inclined to leave as-is. Clarity on how to handle this (split to two rows or redefine schema to allow for multiple codes) should come from the upstream standards body. Unless this is resolved in the geopolitical arena, I think we should maintain a malformed dataset that reflects the current reality of the upstream standards.

hozn · 2015-09-21T11:36:43Z

While I can appreciate this perspective, it seems the real problem is that the package spec isn't conforming to the upstream datasources -- i.e. if there are legitimately multiple values for some columns then shouldn't those be arrays in the spec? I would suggest that in its current form the Palestine data is simply unusable unless the consumer knows that the spec is wrong and that certain fields have commas in them. (Ideally the datapackage spec would accommodate exceptions like this.) I don't really have a solution other than duplicating the rows to blow out the embedded mult-value columns, but it seems that having two entries would be better than zero entries.

OTOH, this doesn't really affect me since I don't think Palestine will ever come up in my datasets :)

hozn · 2016-02-16T12:56:24Z

I just wanted to follow up on this, since this dataset is still broken and unusable.

Here is my workaround code that I have to employ to consume this dataset with the python datapackage library:

        with open(downloaded_files.source, 'rb') as readfp:
            lines = readfp.readlines()

        # Currently there is a bug in the dataset for the Palestine row.  We'll just remove it, since we won't need it anyway.
        with open(downloaded_files.source, 'wb') as writefp:
            for line in lines:
                if not line.decode('utf-8').startswith(r'"Palestine'):
                    writefp.write(line)

I understand the desire to not take a political stance, but at least update the spec for the datapackage to match the underlying data. As it is currently, the mismatch between the data and the specification render the dataset unusable.

hanteng · 2016-05-25T07:29:36Z

@hozn I do not encounter problems when using pandas.read_csv to load the data for PS

df_cc_indexd.loc['PS']
name Palestine
name_fr Palestine, État de
ISO3166-1-Alpha-3 PSE
ISO3166-1-numeric 275
ITU
MARC gz,wj
WMO
DS
Dial 970
FIFA PLE
FIPS GZ,WE
GAUL 91,267
IOC PLE
currency_alphabetic_code
currency_country_name PALESTINE, STATE OF
currency_minor_unit
currency_name No universal currency
currency_numeric_code
is_independent In contention
Name: PS, dtype: object

@ewheeler Customary names may be preferred than official names on various grounds, especially for the aim to be user friendly. #16 More here: "Customary names" Section in http://cldr.unicode.org/translation/country-names

hozn changed the title ~~Palestine row violates schema~~ Palestine row violates schema, renders dataset unusable. Feb 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Palestine row violates schema, renders dataset unusable. #26

Palestine row violates schema, renders dataset unusable. #26

hozn commented Sep 10, 2015

ewheeler commented Sep 21, 2015

hozn commented Sep 21, 2015

hozn commented Feb 16, 2016

hanteng commented May 25, 2016

Palestine row violates schema, renders dataset unusable. #26

Palestine row violates schema, renders dataset unusable. #26

Comments

hozn commented Sep 10, 2015

ewheeler commented Sep 21, 2015

hozn commented Sep 21, 2015

hozn commented Feb 16, 2016

hanteng commented May 25, 2016