Skip to content

Commit

Permalink
docs(python): Improved user guide for cloud functionality (#11646)
Browse files Browse the repository at this point in the history
  • Loading branch information
stinodego authored Oct 10, 2023
1 parent 35f3ff9 commit 8ece165
Show file tree
Hide file tree
Showing 18 changed files with 313 additions and 102 deletions.
61 changes: 59 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,65 @@ The most important components of Polars documentation are the [user guide](https

### User guide

The user guide is maintained in the `docs` folder.
Further contributing information will be added shortly.
The user guide is maintained in the `docs/user-guide` folder. Before creating a PR first raise an issue to discuss what you feel is missing or could be improved.

#### Building and serving the user guide

The user guide is built using [MkDocs](https://www.mkdocs.org/). You install the dependencies for building the user guide by running `make requirements` in the root of the repo.

Run `mkdocs serve` to build and serve the user guide so you can view it locally and see updates as you make changes.

#### Creating a new user guide page

Each user guide page is based on a `.md` markdown file. This file must be listed in `mkdocs.yml`.

#### Adding a shell code block

To add a code block with code to be run in a shell with tabs for Python and Rust, use the following format:

````
=== ":fontawesome-brands-python: Python"
```shell
$ pip install fsspec
```
=== ":fontawesome-brands-rust: Rust"
```shell
$ cargo add aws_sdk_s3
```
````

#### Adding a code block

The snippets for Python and Rust code blocks are in the `docs/src/python/` and `docs/src/rust/` directories, respectively. To add a code snippet with Python or Rust code to a `.md` page, use the following format:

```
{{code_block('user-guide/io/cloud-storage','read_parquet',[read_parquet,read_csv])}}
```

- The first argument is a path to either or both files called `docs/src/python/user-guide/io/cloud-storage.py` and `docs/src/rust/user-guide/io/cloud-storage.rs`.
- The second argument is the name given at the start and end of each snippet in the `.py` or `.rs` file
- The third argument is a list of links to functions in the API docs. For each element of the list there must be a corresponding entry in `docs/_build/API_REFERENCE_LINKS.yml`

If the corresponding `.py` and `.rs` snippet files both exist then each snippet named in the second argument to `code_block` above must exist or the build will fail. An empty snippet should be added to the `.py` or `.rs` file if the snippet is not needed.

Each snippet is formatted as follows:

```python
# --8<-- [start:read_parquet]
import polars as pl

df = pl.read_parquet("file.parquet")
# --8<-- [end:read_parquet]
```

The snippet is delimited by `--8<-- [start:<snippet_name>]` and `--8<-- [end:<snippet_name>]`. The snippet name must match the name given in the second argument to `code_block` above.

#### Linting

Before committing, install `dprint` (see above) and run `dprint fmt` from the `docs` directory to lint the markdown files.

### API reference

Expand Down
13 changes: 10 additions & 3 deletions docs/_build/API_REFERENCE_LINKS.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ python:
write_csv: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_csv.html
read_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_json.html
write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
read_ipc: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_ipc.html
min: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.min.html
max: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.max.html
value_counts: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.value_counts.html
Expand Down Expand Up @@ -65,6 +64,7 @@ python:
write_database:
name: write_database
link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_database.html
read_database_uri: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_database_uri.html
read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
scan_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html
Expand All @@ -73,6 +73,7 @@ python:
write_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_ndjson.html
write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
scan_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_ndjson.html
scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html
from_arrow:
name: from_arrow
link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html
Expand Down Expand Up @@ -197,7 +198,7 @@ rust:
feature_flags: ['json']
read_ndjson:
name: JsonLineReader
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson_core/ndjson/struct.JsonLineReader.html
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson/core/struct.JsonLineReader.html
feature_flags: ['json']
write_json:
name: JsonWriter
Expand All @@ -223,6 +224,12 @@ rust:
name: scan_parquet
link: https://pola-rs.github.io/polars/docs/rust/dev/polars/prelude/struct.LazyFrame.html#method.scan_parquet
feature_flags: ['parquet']
read_ipc:
name: IpcReader
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/prelude/struct.IpcReader.html
feature_flags: ['ipc']
scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html

min: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.min
max: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.max
struct:
Expand Down
14 changes: 0 additions & 14 deletions docs/src/python/user-guide/io/aws.py

This file was deleted.

63 changes: 63 additions & 0 deletions docs/src/python/user-guide/io/cloud-storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""
# --8<-- [start:read_parquet]
import polars as pl
source = "s3://bucket/*.parquet"
df = pl.read_parquet(source)
# --8<-- [end:read_parquet]
# --8<-- [start:scan_parquet]
import polars as pl
source = "s3://bucket/*.parquet"
storage_options = {
"aws_access_key_id": "<secret>",
"aws_secret_access_key": "<secret>",
"aws_region": "us-east-1",
}
df = pl.scan_parquet(source, storage_options=storage_options)
# --8<-- [end:scan_parquet]
# --8<-- [start:scan_parquet_query]
import polars as pl
source = "s3://bucket/*.parquet"
df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()
# --8<-- [end:scan_parquet_query]
# --8<-- [start:scan_pyarrow_dataset]
import polars as pl
import pyarrow.dataset as ds
dset = ds.dataset("s3://my-partitioned-folder/", format="parquet")
(
pl.scan_pyarrow_dataset(dset)
.filter("foo" == "a")
.select(["foo", "bar"])
.collect()
)
# --8<-- [end:scan_pyarrow_dataset]
# --8<-- [start:write_parquet]
import polars as pl
import s3fs
df = pl.DataFrame({
"foo": ["a", "b", "c", "d", "d"],
"bar": [1, 2, 3, 4, 5],
})
fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"
# write parquet
with fs.open(destination, mode='wb') as f:
df.write_parquet(f)
# --8<-- [end:write_parquet]
"""
32 changes: 22 additions & 10 deletions docs/src/python/user-guide/io/database.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,44 @@
"""
# --8<-- [start:read]
# --8<-- [start:read_uri]
import polars as pl
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"
pl.read_database(query=query, connection_uri=connection_uri)
# --8<-- [end:read]
pl.read_database_uri(query=query, uri=uri)
# --8<-- [end:read_uri]
# --8<-- [start:read_cursor]
import polars as pl
from sqlalchemy import create_engine
conn = create_engine(f"sqlite:///test.db")
query = "SELECT * FROM foo"
pl.read_database(query=query, connection=conn.connect())
# --8<-- [end:read_cursor]
# --8<-- [start:adbc]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"
pl.read_database(query=query, connection_uri=connection_uri, engine="adbc")
pl.read_database_uri(query=query, uri=uri, engine="adbc")
# --8<-- [end:adbc]
# --8<-- [start:write]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})
df.write_database(table_name="records", connection_uri=connection_uri)
df.write_database(table_name="records", uri=uri)
# --8<-- [end:write]
# --8<-- [start:write_adbc]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})
df.write_database(table_name="records", connection_uri=connection_uri, engine="adbc")
df.write_database(table_name="records", uri=uri, engine="adbc")
# --8<-- [end:write_adbc]
"""
24 changes: 24 additions & 0 deletions docs/src/python/user-guide/io/json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# --8<-- [start:setup]
import polars as pl

# --8<-- [end:setup]

"""
# --8<-- [start:read]
df = pl.read_json("docs/data/path.json")
# --8<-- [end:read]
# --8<-- [start:readnd]
df = pl.read_ndjson("docs/data/path.json")
# --8<-- [end:readnd]
"""

# --8<-- [start:write]
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_json("docs/data/path.json")
# --8<-- [end:write]

# --8<-- [start:scan]
df = pl.scan_ndjson("docs/data/path.json")
# --8<-- [end:scan]
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
# --8<-- [start:bucket]
# --8<-- [start:read_parquet]
use aws_sdk_s3::Region;

use aws_config::meta::region::RegionProviderChain;
Expand Down Expand Up @@ -28,5 +28,18 @@ async fn main() {

println!("{:?}", df);
}
# --8<-- [end:bucket]
# --8<-- [end:read_parquet]

# --8<-- [start:scan_parquet]
# --8<-- [end:scan_parquet]

# --8<-- [start:scan_parquet_query]
# --8<-- [end:scan_parquet_query]

# --8<-- [start:scan_pyarrow_dataset]
# --8<-- [end:scan_pyarrow_dataset]

# --8<-- [start:write_parquet]
# --8<-- [end:write_parquet]

"""
File renamed without changes.
20 changes: 0 additions & 20 deletions docs/user-guide/io/aws.md

This file was deleted.

51 changes: 51 additions & 0 deletions docs/user-guide/io/cloud-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Cloud storage

Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.

To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:

=== ":fontawesome-brands-python: Python"

```shell
$ pip install fsspec s3fs adlfs gcsfs
```

=== ":fontawesome-brands-rust: Rust"

```shell
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full
```

## Reading from cloud storage

Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage.

{{code_block('user-guide/io/cloud-storage','read_parquet',['read_parquet','read_csv','read_ipc'])}}

This eager query downloads the file to a buffer in memory and creates a `DataFrame` from there. Polars uses `fsspec` to manage this download internally for all cloud storage providers.

## Scanning from cloud storage with query optimisation

Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a `dict` as the `storage_options` argument.

{{code_block('user-guide/io/cloud-storage','scan_parquet',['scan_parquet'])}}

This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.

If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md), the query optimiszr will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.

{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}

## Scanning with PyArrow

We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.

We first create a PyArrow dataset and then create a `LazyFrame` from the dataset.

{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',['scan_pyarrow_dataset'])}}

## Writing to cloud storage

We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example, we write a Parquet file to S3.

{{code_block('user-guide/io/cloud-storage','write_parquet',['write_parquet'])}}
Loading

0 comments on commit 8ece165

Please sign in to comment.