Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Improved user guide for cloud functionality #11646

Merged
merged 1 commit into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 59 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,65 @@ The most important components of Polars documentation are the [user guide](https

### User guide

The user guide is maintained in the `docs` folder.
Further contributing information will be added shortly.
The user guide is maintained in the `docs/user-guide` folder. Before creating a PR first raise an issue to discuss what you feel is missing or could be improved.

#### Building and serving the user guide

The user guide is built using [MkDocs](https://www.mkdocs.org/). You install the dependencies for building the user guide by running `make requirements` in the root of the repo.

Run `mkdocs serve` to build and serve the user guide so you can view it locally and see updates as you make changes.

#### Creating a new user guide page

Each user guide page is based on a `.md` markdown file. This file must be listed in `mkdocs.yml`.

#### Adding a shell code block

To add a code block with code to be run in a shell with tabs for Python and Rust, use the following format:

````
=== ":fontawesome-brands-python: Python"

```shell
$ pip install fsspec
```

=== ":fontawesome-brands-rust: Rust"

```shell
$ cargo add aws_sdk_s3
```
````

#### Adding a code block

The snippets for Python and Rust code blocks are in the `docs/src/python/` and `docs/src/rust/` directories, respectively. To add a code snippet with Python or Rust code to a `.md` page, use the following format:

```
{{code_block('user-guide/io/cloud-storage','read_parquet',[read_parquet,read_csv])}}
```

- The first argument is a path to either or both files called `docs/src/python/user-guide/io/cloud-storage.py` and `docs/src/rust/user-guide/io/cloud-storage.rs`.
- The second argument is the name given at the start and end of each snippet in the `.py` or `.rs` file
- The third argument is a list of links to functions in the API docs. For each element of the list there must be a corresponding entry in `docs/_build/API_REFERENCE_LINKS.yml`

If the corresponding `.py` and `.rs` snippet files both exist then each snippet named in the second argument to `code_block` above must exist or the build will fail. An empty snippet should be added to the `.py` or `.rs` file if the snippet is not needed.

Each snippet is formatted as follows:

```python
# --8<-- [start:read_parquet]
import polars as pl

df = pl.read_parquet("file.parquet")
# --8<-- [end:read_parquet]
```

The snippet is delimited by `--8<-- [start:<snippet_name>]` and `--8<-- [end:<snippet_name>]`. The snippet name must match the name given in the second argument to `code_block` above.

#### Linting

Before committing, install `dprint` (see above) and run `dprint fmt` from the `docs` directory to lint the markdown files.

### API reference

Expand Down
13 changes: 10 additions & 3 deletions docs/_build/API_REFERENCE_LINKS.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ python:
write_csv: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_csv.html
read_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_json.html
write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
read_ipc: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_ipc.html
min: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.min.html
max: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.max.html
value_counts: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.value_counts.html
Expand Down Expand Up @@ -65,6 +64,7 @@ python:
write_database:
name: write_database
link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_database.html
read_database_uri: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_database_uri.html
read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
scan_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html
Expand All @@ -73,6 +73,7 @@ python:
write_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_ndjson.html
write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
scan_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_ndjson.html
scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html
from_arrow:
name: from_arrow
link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html
Expand Down Expand Up @@ -197,7 +198,7 @@ rust:
feature_flags: ['json']
read_ndjson:
name: JsonLineReader
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson_core/ndjson/struct.JsonLineReader.html
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson/core/struct.JsonLineReader.html
feature_flags: ['json']
write_json:
name: JsonWriter
Expand All @@ -223,6 +224,12 @@ rust:
name: scan_parquet
link: https://pola-rs.github.io/polars/docs/rust/dev/polars/prelude/struct.LazyFrame.html#method.scan_parquet
feature_flags: ['parquet']
read_ipc:
name: IpcReader
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/prelude/struct.IpcReader.html
feature_flags: ['ipc']
scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html

min: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.min
max: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.max
struct:
Expand Down
14 changes: 0 additions & 14 deletions docs/src/python/user-guide/io/aws.py

This file was deleted.

63 changes: 63 additions & 0 deletions docs/src/python/user-guide/io/cloud-storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""
# --8<-- [start:read_parquet]
import polars as pl

source = "s3://bucket/*.parquet"

df = pl.read_parquet(source)
# --8<-- [end:read_parquet]

# --8<-- [start:scan_parquet]
import polars as pl

source = "s3://bucket/*.parquet"

storage_options = {
"aws_access_key_id": "<secret>",
"aws_secret_access_key": "<secret>",
"aws_region": "us-east-1",
}
df = pl.scan_parquet(source, storage_options=storage_options)
# --8<-- [end:scan_parquet]

# --8<-- [start:scan_parquet_query]
import polars as pl

source = "s3://bucket/*.parquet"


df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()
# --8<-- [end:scan_parquet_query]

# --8<-- [start:scan_pyarrow_dataset]
import polars as pl
import pyarrow.dataset as ds

dset = ds.dataset("s3://my-partitioned-folder/", format="parquet")
(
pl.scan_pyarrow_dataset(dset)
.filter("foo" == "a")
.select(["foo", "bar"])
.collect()
)
# --8<-- [end:scan_pyarrow_dataset]

# --8<-- [start:write_parquet]

import polars as pl
import s3fs

df = pl.DataFrame({
"foo": ["a", "b", "c", "d", "d"],
"bar": [1, 2, 3, 4, 5],
})

fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"

# write parquet
with fs.open(destination, mode='wb') as f:
df.write_parquet(f)
# --8<-- [end:write_parquet]

"""
32 changes: 22 additions & 10 deletions docs/src/python/user-guide/io/database.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,44 @@
"""
# --8<-- [start:read]
# --8<-- [start:read_uri]
import polars as pl

connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_database(query=query, connection_uri=connection_uri)
# --8<-- [end:read]
pl.read_database_uri(query=query, uri=uri)
# --8<-- [end:read_uri]

# --8<-- [start:read_cursor]
import polars as pl
from sqlalchemy import create_engine

conn = create_engine(f"sqlite:///test.db")

query = "SELECT * FROM foo"

pl.read_database(query=query, connection=conn.connect())
# --8<-- [end:read_cursor]


# --8<-- [start:adbc]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_database(query=query, connection_uri=connection_uri, engine="adbc")
pl.read_database_uri(query=query, uri=uri, engine="adbc")
# --8<-- [end:adbc]

# --8<-- [start:write]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})

df.write_database(table_name="records", connection_uri=connection_uri)
df.write_database(table_name="records", uri=uri)
# --8<-- [end:write]

# --8<-- [start:write_adbc]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})

df.write_database(table_name="records", connection_uri=connection_uri, engine="adbc")
df.write_database(table_name="records", uri=uri, engine="adbc")
# --8<-- [end:write_adbc]

"""
24 changes: 24 additions & 0 deletions docs/src/python/user-guide/io/json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# --8<-- [start:setup]
import polars as pl

# --8<-- [end:setup]

"""
# --8<-- [start:read]
df = pl.read_json("docs/data/path.json")
# --8<-- [end:read]

# --8<-- [start:readnd]
df = pl.read_ndjson("docs/data/path.json")
# --8<-- [end:readnd]

"""

# --8<-- [start:write]
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_json("docs/data/path.json")
# --8<-- [end:write]

# --8<-- [start:scan]
df = pl.scan_ndjson("docs/data/path.json")
# --8<-- [end:scan]
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
# --8<-- [start:bucket]
# --8<-- [start:read_parquet]
use aws_sdk_s3::Region;

use aws_config::meta::region::RegionProviderChain;
Expand Down Expand Up @@ -28,5 +28,18 @@ async fn main() {

println!("{:?}", df);
}
# --8<-- [end:bucket]
# --8<-- [end:read_parquet]

# --8<-- [start:scan_parquet]
# --8<-- [end:scan_parquet]

# --8<-- [start:scan_parquet_query]
# --8<-- [end:scan_parquet_query]

# --8<-- [start:scan_pyarrow_dataset]
# --8<-- [end:scan_pyarrow_dataset]

# --8<-- [start:write_parquet]
# --8<-- [end:write_parquet]

"""
20 changes: 0 additions & 20 deletions docs/user-guide/io/aws.md

This file was deleted.

51 changes: 51 additions & 0 deletions docs/user-guide/io/cloud-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Cloud storage

Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.

To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:

=== ":fontawesome-brands-python: Python"

```shell
$ pip install fsspec s3fs adlfs gcsfs
```

=== ":fontawesome-brands-rust: Rust"

```shell
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full
```

## Reading from cloud storage

Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage.

{{code_block('user-guide/io/cloud-storage','read_parquet',['read_parquet','read_csv','read_ipc'])}}

This eager query downloads the file to a buffer in memory and creates a `DataFrame` from there. Polars uses `fsspec` to manage this download internally for all cloud storage providers.

## Scanning from cloud storage with query optimisation

Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a `dict` as the `storage_options` argument.

{{code_block('user-guide/io/cloud-storage','scan_parquet',['scan_parquet'])}}

This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.

If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md), the query optimiszr will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.

{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}

## Scanning with PyArrow

We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.

We first create a PyArrow dataset and then create a `LazyFrame` from the dataset.

{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',['scan_pyarrow_dataset'])}}

## Writing to cloud storage

We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example, we write a Parquet file to S3.

{{code_block('user-guide/io/cloud-storage','write_parquet',['write_parquet'])}}
Loading