-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(python): Improved user guide for cloud functionality (#11646)
- Loading branch information
Showing
18 changed files
with
313 additions
and
102 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
""" | ||
# --8<-- [start:read_parquet] | ||
import polars as pl | ||
source = "s3://bucket/*.parquet" | ||
df = pl.read_parquet(source) | ||
# --8<-- [end:read_parquet] | ||
# --8<-- [start:scan_parquet] | ||
import polars as pl | ||
source = "s3://bucket/*.parquet" | ||
storage_options = { | ||
"aws_access_key_id": "<secret>", | ||
"aws_secret_access_key": "<secret>", | ||
"aws_region": "us-east-1", | ||
} | ||
df = pl.scan_parquet(source, storage_options=storage_options) | ||
# --8<-- [end:scan_parquet] | ||
# --8<-- [start:scan_parquet_query] | ||
import polars as pl | ||
source = "s3://bucket/*.parquet" | ||
df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect() | ||
# --8<-- [end:scan_parquet_query] | ||
# --8<-- [start:scan_pyarrow_dataset] | ||
import polars as pl | ||
import pyarrow.dataset as ds | ||
dset = ds.dataset("s3://my-partitioned-folder/", format="parquet") | ||
( | ||
pl.scan_pyarrow_dataset(dset) | ||
.filter("foo" == "a") | ||
.select(["foo", "bar"]) | ||
.collect() | ||
) | ||
# --8<-- [end:scan_pyarrow_dataset] | ||
# --8<-- [start:write_parquet] | ||
import polars as pl | ||
import s3fs | ||
df = pl.DataFrame({ | ||
"foo": ["a", "b", "c", "d", "d"], | ||
"bar": [1, 2, 3, 4, 5], | ||
}) | ||
fs = s3fs.S3FileSystem() | ||
destination = "s3://bucket/my_file.parquet" | ||
# write parquet | ||
with fs.open(destination, mode='wb') as f: | ||
df.write_parquet(f) | ||
# --8<-- [end:write_parquet] | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,44 @@ | ||
""" | ||
# --8<-- [start:read] | ||
# --8<-- [start:read_uri] | ||
import polars as pl | ||
connection_uri = "postgres://username:password@server:port/database" | ||
uri = "postgres://username:password@server:port/database" | ||
query = "SELECT * FROM foo" | ||
pl.read_database(query=query, connection_uri=connection_uri) | ||
# --8<-- [end:read] | ||
pl.read_database_uri(query=query, uri=uri) | ||
# --8<-- [end:read_uri] | ||
# --8<-- [start:read_cursor] | ||
import polars as pl | ||
from sqlalchemy import create_engine | ||
conn = create_engine(f"sqlite:///test.db") | ||
query = "SELECT * FROM foo" | ||
pl.read_database(query=query, connection=conn.connect()) | ||
# --8<-- [end:read_cursor] | ||
# --8<-- [start:adbc] | ||
connection_uri = "postgres://username:password@server:port/database" | ||
uri = "postgres://username:password@server:port/database" | ||
query = "SELECT * FROM foo" | ||
pl.read_database(query=query, connection_uri=connection_uri, engine="adbc") | ||
pl.read_database_uri(query=query, uri=uri, engine="adbc") | ||
# --8<-- [end:adbc] | ||
# --8<-- [start:write] | ||
connection_uri = "postgres://username:password@server:port/database" | ||
uri = "postgres://username:password@server:port/database" | ||
df = pl.DataFrame({"foo": [1, 2, 3]}) | ||
df.write_database(table_name="records", connection_uri=connection_uri) | ||
df.write_database(table_name="records", uri=uri) | ||
# --8<-- [end:write] | ||
# --8<-- [start:write_adbc] | ||
connection_uri = "postgres://username:password@server:port/database" | ||
uri = "postgres://username:password@server:port/database" | ||
df = pl.DataFrame({"foo": [1, 2, 3]}) | ||
df.write_database(table_name="records", connection_uri=connection_uri, engine="adbc") | ||
df.write_database(table_name="records", uri=uri, engine="adbc") | ||
# --8<-- [end:write_adbc] | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# --8<-- [start:setup] | ||
import polars as pl | ||
|
||
# --8<-- [end:setup] | ||
|
||
""" | ||
# --8<-- [start:read] | ||
df = pl.read_json("docs/data/path.json") | ||
# --8<-- [end:read] | ||
# --8<-- [start:readnd] | ||
df = pl.read_ndjson("docs/data/path.json") | ||
# --8<-- [end:readnd] | ||
""" | ||
|
||
# --8<-- [start:write] | ||
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]}) | ||
df.write_json("docs/data/path.json") | ||
# --8<-- [end:write] | ||
|
||
# --8<-- [start:scan] | ||
df = pl.scan_ndjson("docs/data/path.json") | ||
# --8<-- [end:scan] |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Cloud storage | ||
|
||
Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers. | ||
|
||
To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider: | ||
|
||
=== ":fontawesome-brands-python: Python" | ||
|
||
```shell | ||
$ pip install fsspec s3fs adlfs gcsfs | ||
``` | ||
|
||
=== ":fontawesome-brands-rust: Rust" | ||
|
||
```shell | ||
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full | ||
``` | ||
|
||
## Reading from cloud storage | ||
|
||
Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage. | ||
|
||
{{code_block('user-guide/io/cloud-storage','read_parquet',['read_parquet','read_csv','read_ipc'])}} | ||
|
||
This eager query downloads the file to a buffer in memory and creates a `DataFrame` from there. Polars uses `fsspec` to manage this download internally for all cloud storage providers. | ||
|
||
## Scanning from cloud storage with query optimisation | ||
|
||
Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a `dict` as the `storage_options` argument. | ||
|
||
{{code_block('user-guide/io/cloud-storage','scan_parquet',['scan_parquet'])}} | ||
|
||
This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file. | ||
|
||
If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md), the query optimiszr will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`. | ||
|
||
{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}} | ||
|
||
## Scanning with PyArrow | ||
|
||
We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning. | ||
|
||
We first create a PyArrow dataset and then create a `LazyFrame` from the dataset. | ||
|
||
{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',['scan_pyarrow_dataset'])}} | ||
|
||
## Writing to cloud storage | ||
|
||
We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example, we write a Parquet file to S3. | ||
|
||
{{code_block('user-guide/io/cloud-storage','write_parquet',['write_parquet'])}} |
Oops, something went wrong.