You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running the following (locally, not docker) with the latest from mainline: python -m deepform.data.add_features data/3_year_manifest.csv
And I'm getting this traceback:
Traceback (most recent call last):
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/evan/deepform/deepform/data/add_features.py", line 263, in <module>
extend_and_write_docs(
File "/Users/evan/deepform/deepform/data/add_features.py", line 98, in extend_and_write_docs
doc_index.to_parquet(pq_index)
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper
return func(*args, **kwargs)
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 2365, in to_parquet
to_parquet(
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 270, in to_parquet
return impl.write(
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 101, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 1376, in pyarrow.lib.Table.from_pandas
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays
arrays[i] = maybe_fut.result()
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 565, in convert_column
raise e
File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 0.0 with type str: tried to convert to double', 'Conversion failed for column gross_amount with type object')
I inspected the DataFrame, and the issue appears to be that the document with slug 499480-cancel-68803-13518579030793-_-pdf has 0.0 for gross_amount, which prevents conversion to double.
One solution might be to do: doc_index['gross_amount'] = doc_index.gross_amount.apply(pd.to_numeric, errors='coerce')
before exporting to parquet format, but I wanted to confirm with you all that this field is supposed to be float, and that the 0.0 amount isn't a mistake. I'm also not sure why no one else has run into this, so maybe something else is up.
The text was updated successfully, but these errors were encountered:
I think this is happening because of mixed dtypes in dollar_amount (both str and float) which doesn't vibe with pyarrow (or at least version 1.0.1 for me). I got the pd.to_numeric solution working, but it required too many other changes around the codebase, so instead I'm casting to str in my PR
I'm running the following (locally, not docker) with the latest from mainline:
python -m deepform.data.add_features data/3_year_manifest.csv
And I'm getting this traceback:
I inspected the DataFrame, and the issue appears to be that the document with slug
499480-cancel-68803-13518579030793-_-pdf
has 0.0 forgross_amount
, which prevents conversion to double.One solution might be to do:
doc_index['gross_amount'] = doc_index.gross_amount.apply(pd.to_numeric, errors='coerce')
before exporting to parquet format, but I wanted to confirm with you all that this field is supposed to be float, and that the 0.0 amount isn't a mistake. I'm also not sure why no one else has run into this, so maybe something else is up.
The text was updated successfully, but these errors were encountered: