GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

seisman · 2024-03-21T05:38:14Z

Sometimes, GMT may output nothing to a virtual file, for example, gmt select may find no data points that satisfy the specified criteria.

In such cases, GMT_DATASET.to_dataframe raises an exception. As shown below:

In [1]: import pandas as pd

In [2]: pd.concat(objs=[], axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 pd.concat(objs=[], axis=1)

File ~/opt/miniconda/envs/pygmt/lib/python3.12/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    379 elif copy and using_copy_on_write():
    380     copy = False
--> 382 op = _Concatenator(
    383     objs,
    384     axis=axis,
    385     ignore_index=ignore_index,
    386     join=join,
    387     keys=keys,
    388     levels=levels,
    389     names=names,
    390     verify_integrity=verify_integrity,
    391     copy=copy,
    392     sort=sort,
    393 )
    395 return op.get_result()

File ~/opt/miniconda/envs/pygmt/lib/python3.12/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    442 self.verify_integrity = verify_integrity
    443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
    447 # figure out what our result ndim is going to be
    448 ndims = self._get_ndims(objs)

File ~/opt/miniconda/envs/pygmt/lib/python3.12/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
    504     objs_list = list(objs)
    506 if len(objs_list) == 0:
--> 507     raise ValueError("No objects to concatenate")
    509 if keys is None:
    510     objs_list = list(com.not_none(*objs_list))

ValueError: No objects to concatenate

Instead of raising exceptions, GMT_DATASET.to_dataframe should return a reasonable value. I think an empty DataFrame (i.e., pd.DataFrame()) makes more sense than None.

This PR updates the GMT_DATASET.to_dataframe() method to return an empty DataFrame in such cases and also add two tests (one for benchmark and one for testing the empty DataFrame).

seisman · 2024-03-21T06:00:21Z

pygmt/tests/test_datatypes_dataset.py

+    return df
+
+
+def dataframe_from_gmt(fname):


For reference, GMT provides two special/undocumented modules read and write (their source codes are gmt/src/gmtread.c/gmt/src/gmtwrite.c) that can read a file into a GMT object (e.g, reading a tabular file as GMT_DATASET, or reading a grid as GMT_GRID). Currently, we're frequently using the special read module in the doctest of the pygmt.clib.session module (similar to lines 46-50 below). We may want to make it public in the future as already done in GMT.jl (https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtread-Tuple{String} and https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtwrite).

pygmt/tests/test_datatypes_dataset.py

Co-authored-by: Yvonne Fröhlich <94163266+yvonnefroehlich@users.noreply.github.com>

weiji14 · 2024-03-22T08:18:41Z

pygmt/datatypes/dataset.py

@@ -211,5 +213,5 @@ def to_dataframe(self) -> pd.DataFrame:
                pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
            )

-        df = pd.concat(objs=vectors, axis=1)
+        df = pd.concat(objs=vectors, axis=1) if vectors else pd.DataFrame()


An empty pd.DataFrame() won't have any columns. Should there still be columns returned (even if there are no rows)? How would this work with #3117 for example?

A DataFrame with columns but no rows is still empty. So I guess it's fine.

In [1]: import pandas as pd In [2]: df = pd.DataFrame() In [3]: df Out[3]: Empty DataFrame Columns: [] Index: [] In [4]: df = pd.DataFrame(columns=None) In [5]: df Out[5]: Empty DataFrame Columns: [] Index: [] In [6]: df = pd.DataFrame(columns=["col1", "col2"]) In [7]: df Out[7]: Empty DataFrame Columns: [col1, col2] Index: [] In [8]: df.empty Out[8]: True

Column names are set here like so:

pygmt/pygmt/clib/session.py

Lines 1853 to 1861 in 1eb6dec

# Read the virtual file as a GMT dataset and convert to pandas.DataFrame

result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe()

if output_type == "numpy": # numpy.ndarray output

return result.to_numpy()

# Assign column names

if column_names is not None:

result.columns = column_names

return result # pandas.DataFrame output

So we would do something like:

import pandas as pd df = pd.DataFrame(columns=None) assert df.empty df.columns = ["col1", "col2"]

But setting column names to ["col1", "col2"] errors with:

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[8], line 1 ----> 1 df.columns = ["col1", "col2"] File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:6310, in NDFrame.__setattr__(self, name, value) 6308 try: 6309 object.__getattribute__(self, name) -> 6310 return object.__setattr__(self, name, value) 6311 except AttributeError: 6312 pass File properties.pyx:69, in pandas._libs.properties.AxisProperty.__set__() File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:813, in NDFrame._set_axis(self, axis, labels) 808 """ 809 This is called from the cython code when we set the `index` attribute 810 directly, e.g. `series.index = [1, 2, 3]`. 811 """ 812 labels = ensure_index(labels) --> 813 self._mgr.set_axis(axis, labels) 814 self._clear_item_cache() File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels) 236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None: 237 # Caller is responsible for ensuring we have an Index object. --> 238 self._validate_set_axis(axis, new_labels) 239 self.axes[axis] = new_labels File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/base.py:98, in DataManager._validate_set_axis(self, axis, new_labels) 95 pass 97 elif new_len != old_len: ---> 98 raise ValueError( 99 f"Length mismatch: Expected axis has {old_len} elements, new " 100 f"values have {new_len} elements" 101 ) ValueError: Length mismatch: Expected axis has 0 elements, new values have 2 elements

I think we need to refactor GMT_DATASET.to_dataframe() to accept more panda-specific parameters (e.g., column_names, index_col). The the virtualfile_to_dataset will be called like:

result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe(columns=column_names) if output_type == "numpy": # numpy.ndarray output return result.to_numpy() return result # pandas.DataFrame output

Right, so moving these lines that handle the column names from virtualfile_to_dataset to to_dataframe:

pygmt/pygmt/clib/session.py

Lines 1861 to 1863 in 62eb5d6

# Assign column names

if column_names is not None:

result.columns = column_names

Do you want to just do that in #3117? Or have a separate PR to handle this?

Better to do it in a separate PR so that #3117 can focus on parsing the column names from header.

Right, so moving these lines that handle the column names from virtualfile_to_dataset to to_dataframe:

pygmt/pygmt/clib/session.py

Lines 1861 to 1863 in 62eb5d6

# Assign column names

if column_names is not None:

result.columns = column_names

Do you want to just do that in #3117? Or have a separate PR to handle this?

Done in #3140, so need to refactor this PR after #3140 is merged.

seisman · 2024-03-27T04:55:24Z

pygmt/datatypes/dataset.py

@@ -226,8 +228,13 @@ def to_dataframe(
                pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
            )

+        # Return an empty DataFrame if no columns are found.
+        if len(vectors) == 0:
+            return pd.DataFrame()


Currently, it returns an empty DataFrame without columns and rows, but an empty DataFrame with columns is also allowed, e.g.,

return pd.DataFrame(column=column_names)

I guess either is fine. I think we can use return pd.DataFrame() now and make changes if necessary.

I think we should return the column_names, so that users who want to e.g. do pd.concat on multiple pd.DataFrame outputs from running an algorithm like pygmt.select in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object:

df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"]) print(df1.dtypes) # x int64 # y int64 # z int64 # dtype: object df2 = pd.DataFrame(columns=["x", "y", "z"]) print(df2.dtypes) # x object # y object # z object # dtype: object pd.concat(objs=[df1, df2]).dtypes # x object # y object # z object # dtype: object

See my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame() early, until the dtype is set with df.astype(dtype) below.

pygmt/datatypes/dataset.py

weiji14 · 2024-03-29T01:30:26Z

pygmt/datatypes/dataset.py

@@ -226,8 +228,13 @@ def to_dataframe(
                pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
            )

+        # Return an empty DataFrame if no columns are found.
+        if len(vectors) == 0:
+            return pd.DataFrame()


I think we should return the column_names, so that users who want to e.g. do pd.concat on multiple pd.DataFrame outputs from running an algorithm like pygmt.select in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object:

df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"]) print(df1.dtypes) # x int64 # y int64 # z int64 # dtype: object df2 = pd.DataFrame(columns=["x", "y", "z"]) print(df2.dtypes) # x object # y object # z object # dtype: object pd.concat(objs=[df1, df2]).dtypes # x object # y object # z object # dtype: object

See my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame() early, until the dtype is set with df.astype(dtype) below.

Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

pygmt/datatypes/dataset.py

weiji14

Should probably find a way to test the case where an empty pd.DataFrame is returned with column names and set dtypes, but it looks a bit tricky 🙂

pygmt/datatypes/dataset.py

Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

seisman added the enhancement Improving an existing feature label Mar 21, 2024

seisman added this to the 0.12.0 milestone Mar 21, 2024

seisman added the needs review This PR has higher priority and needs review. label Mar 21, 2024

seisman force-pushed the dataset/empty_dataframe branch 2 times, most recently from 9d4abf9 to 3f3c0c5 Compare March 21, 2024 05:41

seisman added the run/benchmark Trigger the benchmark workflow in PRs label Mar 21, 2024

GMT_DATASET: Return an empty DataFrame if the file has no data

175ba3c

seisman force-pushed the dataset/empty_dataframe branch from 3f3c0c5 to 175ba3c Compare March 21, 2024 05:46

seisman commented Mar 21, 2024

View reviewed changes

yvonnefroehlich reviewed Mar 21, 2024

View reviewed changes

Apply suggestions from code review

2e6e277

Co-authored-by: Yvonne Fröhlich <94163266+yvonnefroehlich@users.noreply.github.com>

seisman changed the title ~~GMT_DATASET: Return an empty DataFrame if a file has no data~~ GMT_DATASET: Return an empty DataFrame if a file contains no data Mar 21, 2024

weiji14 reviewed Mar 22, 2024

View reviewed changes

michaelgrund approved these changes Mar 22, 2024

View reviewed changes

seisman added 2 commits March 23, 2024 22:57

Merge branch 'main' into dataset/empty_dataframe

7482b25

Merge branch 'main' into dataset/empty_dataframe

3246e5c

seisman mentioned this pull request Mar 26, 2024

Session.virtualfile_to_dataset: Add new parameters 'dtype'/'index_col' for pandas output #3140

Merged

seisman added 2 commits March 27, 2024 10:30

Merge branch 'main' into dataset/empty_dataframe

ec59f9c

Fixes

a2c48d5

seisman changed the title ~~GMT_DATASET: Return an empty DataFrame if a file contains no data~~ GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data Mar 27, 2024

seisman commented Mar 27, 2024

View reviewed changes

Add more comments

1281ec0

seisman removed the run/benchmark Trigger the benchmark workflow in PRs label Mar 27, 2024

Merge branch 'main' into dataset/empty_dataframe

b817e91

weiji14 reviewed Mar 29, 2024

View reviewed changes

Return an empty DataFrame with column names

71cc9b7

Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

seisman commented Mar 29, 2024

View reviewed changes

pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved

seisman added 2 commits March 29, 2024 09:57

Do not assign column names again for empty DataFrame

065ec12

Improve type hints

dbfc2ae

seisman force-pushed the dataset/empty_dataframe branch from d5283cd to dbfc2ae Compare March 29, 2024 03:30

weiji14 approved these changes Mar 29, 2024

View reviewed changes

pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved

pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved

weiji14 added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Mar 29, 2024

Apply suggestions from code review

06790e2

Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

seisman removed the final review call This PR requires final review and approval from a second reviewer label Mar 29, 2024

seisman merged commit 85d4ed2 into main Mar 29, 2024
19 checks passed

seisman deleted the dataset/empty_dataframe branch March 29, 2024 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

seisman commented Mar 21, 2024 •

edited

Loading

seisman Mar 21, 2024

weiji14 Mar 22, 2024 •

edited

Loading

seisman Mar 22, 2024 •

edited

Loading

weiji14 Mar 24, 2024

seisman Mar 24, 2024

weiji14 Mar 26, 2024

seisman Mar 26, 2024

seisman Mar 26, 2024

seisman Mar 27, 2024

weiji14 Mar 29, 2024 •

edited

Loading

weiji14 Mar 29, 2024 •

edited

Loading

weiji14 left a comment

	# Read the virtual file as a GMT dataset and convert to pandas.DataFrame
	result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe()
	if output_type == "numpy": # numpy.ndarray output
	return result.to_numpy()

	# Assign column names
	if column_names is not None:
	result.columns = column_names
	return result # pandas.DataFrame output

GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

Conversation

seisman commented Mar 21, 2024 • edited Loading

seisman Mar 21, 2024

Choose a reason for hiding this comment

weiji14 Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

seisman Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

weiji14 Mar 24, 2024

Choose a reason for hiding this comment

seisman Mar 24, 2024

Choose a reason for hiding this comment

weiji14 Mar 26, 2024

Choose a reason for hiding this comment

seisman Mar 26, 2024

Choose a reason for hiding this comment

seisman Mar 26, 2024

Choose a reason for hiding this comment

seisman Mar 27, 2024

Choose a reason for hiding this comment

weiji14 Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

weiji14 Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

weiji14 left a comment

Choose a reason for hiding this comment

seisman commented Mar 21, 2024 •

edited

Loading

weiji14 Mar 22, 2024 •

edited

Loading

seisman Mar 22, 2024 •

edited

Loading

weiji14 Mar 29, 2024 •

edited

Loading

weiji14 Mar 29, 2024 •

edited

Loading