Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083

Merged
merged 13 commits into from
Mar 11, 2024
1 change: 1 addition & 0 deletions doc/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,7 @@ conversion of Python variables to GMT virtual files:
clib.Session.virtualfile_from_grid
clib.Session.virtualfile_in
clib.Session.virtualfile_out
clib.Session.virtualfile_to_dataset
Copy link
Member

@weiji14 weiji14 Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At L287 above, could you change the sentence to read "These methods are context managers that automate the conversion of Python variables to and from GMT virtual files"? Since we can convert GMT virtualfiles to Python objects now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in aee0499.


Low level access (these are mostly used by the :mod:`pygmt.clib` package):

Expand Down
121 changes: 121 additions & 0 deletions pygmt/clib/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -1738,6 +1738,127 @@ def read_virtualfile(
dtype = {"dataset": _GMT_DATASET, "grid": _GMT_GRID}[kind]
return ctp.cast(pointer, ctp.POINTER(dtype))

def virtualfile_to_dataset(
self,
output_type: Literal["pandas", "numpy", "file"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a default output type here? It looks like we're using pandas as the default in #3092.

Suggested change
output_type: Literal["pandas", "numpy", "file"],
output_type: Literal["pandas", "numpy", "file"] = "pandas",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no differences because we always call the function with the output_type parameter, e.g.,:

        return lib.return_dataset(
            output_type=output_type,
            vfile=vouttbl,
        )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't make any difference in the PyGMT modules, but this is a good central location to document that output_type="pandas" is the default output (though in #1318, it seemed like most of us were in favour of output_type="input" or auto as the default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_type="input" or auto may not make sense for PyGMT, especially in cases like:

  1. the input data is a file, then auto means outputting to a file by default, then outfile is required.
  2. the input data is vectors (e.g., x/y/z) and each vector can be a list/ndarray/pd.Series. Then what's the expected format if auto/input is used?

Copy link
Member

@weiji14 weiji14 Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not saying that output_type="auto" would be easy to implement 🙂 I think the default output_type="pandas" is fine for now since it is an in-memory format that can be converted to virtualfiles relatively easily. We can discuss more about what the ideal output type would be in #1318 (if there is still any debate that needs to be had).

vfile: str,
column_names: list[str] | None = None,
) -> pd.DataFrame | np.ndarray | None:
"""
Output a tabular dataset stored in a virtual file to a different format.

The format of the dataset is determined by the ``output_type`` parameter.

Parameters
----------
output_type
Desired output type of the result data.

- ``"pandas"`` will return a :class:`pandas.DataFrame` object.
- ``"numpy"`` will return a :class:`numpy.ndarray` object.
- ``"file"`` means the result was saved to a file and will return ``None``.
vfile
The virtual file name that stores the result data. Required for ``"pandas"``
and ``"numpy"`` output type.
column_names
The column names for the :class:`pandas.DataFrame` output.

Returns
-------
result
The result dataset. If ``output_type="file"`` returns ``None``.

Examples
--------
>>> from pathlib import Path
>>> import numpy as np
>>> import pandas as pd
>>>
>>> from pygmt.helpers import GMTTempFile
>>> from pygmt.clib import Session
>>>
>>> with GMTTempFile(suffix=".txt") as tmpfile:
... # prepare the sample data file
... with open(tmpfile.name, mode="w") as fp:
... print(">", file=fp)
... print("1.0 2.0 3.0 TEXT1 TEXT23", file=fp)
... print("4.0 5.0 6.0 TEXT4 TEXT567", file=fp)
... print(">", file=fp)
... print("7.0 8.0 9.0 TEXT8 TEXT90", file=fp)
... print("10.0 11.0 12.0 TEXT123 TEXT456789", file=fp)
...
... # file output
... with Session() as lib:
... with GMTTempFile(suffix=".txt") as outtmp:
... with lib.virtualfile_out(
... kind="dataset", fname=outtmp.name
... ) as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... result = lib.virtualfile_to_dataset(
... output_type="file", vfile=vouttbl
... )
... assert result is None
... assert Path(outtmp.name).stat().st_size > 0
...
... # numpy output
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outnp = lib.virtualfile_to_dataset(
... output_type="numpy", vfile=vouttbl
... )
... assert isinstance(outnp, np.ndarray)
...
... # pandas output
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outpd = lib.virtualfile_to_dataset(
... output_type="pandas", vfile=vouttbl
... )
... assert isinstance(outpd, pd.DataFrame)
...
... # pandas output with specified column names
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outpd2 = lib.virtualfile_to_dataset(
... output_type="pandas",
... vfile=vouttbl,
... column_names=["col1", "col2", "col3", "coltext"],
... )
... assert isinstance(outpd2, pd.DataFrame)
>>> outnp
array([[1.0, 2.0, 3.0, 'TEXT1 TEXT23'],
[4.0, 5.0, 6.0, 'TEXT4 TEXT567'],
[7.0, 8.0, 9.0, 'TEXT8 TEXT90'],
[10.0, 11.0, 12.0, 'TEXT123 TEXT456789']], dtype=object)
>>> outpd
0 1 2 3
0 1.0 2.0 3.0 TEXT1 TEXT23
1 4.0 5.0 6.0 TEXT4 TEXT567
2 7.0 8.0 9.0 TEXT8 TEXT90
3 10.0 11.0 12.0 TEXT123 TEXT456789
>>> outpd2
col1 col2 col3 coltext
0 1.0 2.0 3.0 TEXT1 TEXT23
1 4.0 5.0 6.0 TEXT4 TEXT567
2 7.0 8.0 9.0 TEXT8 TEXT90
3 10.0 11.0 12.0 TEXT123 TEXT456789
"""
if output_type == "file": # Already written to file, so return None
return None

# Read the virtual file as a GMT dataset and convert to pandas.DataFrame
result = self.read_virtualfile(vfile, kind="dataset").contents.to_dataframe()
if output_type == "numpy": # numpy.ndarray output
return result.to_numpy()

# Assign column names
if column_names is not None:
result.columns = column_names
return result # pandas.DataFrame output

def extract_region(self):
"""
Extract the WESN bounding box of the currently active figure.
Expand Down
Loading