Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue writing the dataset as parquet in add_features #105

Open
radkoff opened this issue Dec 5, 2020 · 1 comment
Open

Issue writing the dataset as parquet in add_features #105

radkoff opened this issue Dec 5, 2020 · 1 comment

Comments

@radkoff
Copy link

radkoff commented Dec 5, 2020

I'm running the following (locally, not docker) with the latest from mainline: python -m deepform.data.add_features data/3_year_manifest.csv

And I'm getting this traceback:

Traceback (most recent call last):
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/evan/deepform/deepform/data/add_features.py", line 263, in <module>
    extend_and_write_docs(
  File "/Users/evan/deepform/deepform/data/add_features.py", line 98, in extend_and_write_docs
    doc_index.to_parquet(pq_index)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 2365, in to_parquet
    to_parquet(
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 270, in to_parquet
    return impl.write(
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 101, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1376, in pyarrow.lib.Table.from_pandas
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays
    arrays[i] = maybe_fut.result()
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 565, in convert_column
    raise e
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 0.0 with type str: tried to convert to double', 'Conversion failed for column gross_amount with type object')

I inspected the DataFrame, and the issue appears to be that the document with slug 499480-cancel-68803-13518579030793-_-pdf has 0.0 for gross_amount, which prevents conversion to double.

One solution might be to do:
doc_index['gross_amount'] = doc_index.gross_amount.apply(pd.to_numeric, errors='coerce')
before exporting to parquet format, but I wanted to confirm with you all that this field is supposed to be float, and that the 0.0 amount isn't a mistake. I'm also not sure why no one else has run into this, so maybe something else is up.

@radkoff
Copy link
Author

radkoff commented Dec 12, 2020

I think this is happening because of mixed dtypes in dollar_amount (both str and float) which doesn't vibe with pyarrow (or at least version 1.0.1 for me). I got the pd.to_numeric solution working, but it required too many other changes around the codebase, so instead I'm casting to str in my PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant