Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme improvements #40

Merged
merged 5 commits into from
Jul 12, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 68 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,98 @@
## Dask Deltatable Reader
## Dask-DeltaTable

Reads a Delta Table from directory using Dask engine.
Reading and writing to Delta Lake using Dask engine.

To Try out the package:
### Installation

To install the package:

```
pip install dask-deltatable
```

### Features:
1. Reads the parquet files based on delta logs parallely using dask engine
2. Supports all three filesystem like s3, azurefs, gcsfs
3. Supports some delta features like

1. Read the parquet files from Delta Lake and parallelize with Dask
2. Write Dask dataframes to Delta Lake (limited support)
3. Supports multiple filesystems (s3, azurefs, gcsfs)
4. Subset of Delta Lake features:
- Time Travel
- Schema evolution
- parquet filters
- Parquet filters
- row filter
- partition filter
4. Query Delta commit info - History
5. vacuum the old/ unused parquet files
6. load different versions of data using datetime.
5. Query Delta commit info and history
6. API to ``vacuum`` the old / unused parquet files
7. Load different versions of data by timestamp or version.

### Usage:
### Not supported

```
1. Writing to Delta Lake is still in development.
2. `optimize` API to run a bin-packing operation on a Delta Table.

### Reading from Delta Lake

```python
import dask_deltatable as ddt

# read delta table
ddt.read_delta_table("delta_path")

# read delta table for specific version
ddt.read_delta_table("delta_path",version=3)
# with specific version
ddt.read_delta_table("delta_path", version=3)

# with specific datetime
ddt.read_delta_table("delta_path", datetime="2018-12-19T16:39:57-08:00")
```

### Accessing remote file systems

To be able to read from S3, azure, gcsfs, and other remote filesystems,
you ensure the credentials are properly configured in environment variables
or config files. For AWS, you may need `~/.aws/credential`; for gcsfs,
`GOOGLE_APPLICATION_CREDENTIALS`. Refer to your cloud provider documentation
to configure these.

```python
ddt.read_delta_table("s3://bucket_name/delta_path", version=3)
```

# read delta table for specific datetime
ddt.read_delta_table("delta_path",datetime="2018-12-19T16:39:57-08:00")
### Accessing AWS Glue catalog

`dask-deltatable` can connect to AWS Glue catalog to read the delta table.
The method will look for `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
environment variables, and if those are not available, fall back to
`~/.aws/credentials`.

Example:

```python
ddt.read_delta_table(catalog="glue", database_name="science", table_name="physics")
```

### Inspecting Delta Table history

One of the features of Delta Lake is preserving the history of changes, which can be is useful
for auditing and debugging. `dask-deltatable` provides APIs to read the commit info and history.

```python

```python
# read delta complete history
ddt.read_delta_history("delta_path")

# read delta history upto given limit
ddt.read_delta_history("delta_path",limit=5)

# read delta history to delete the files
ddt.vacuum("delta_path",dry_run=False)
ddt.read_delta_history("delta_path", limit=5)
```

# Can read from S3,azure,gcfs etc.
ddt.read_delta_table("s3://bucket_name/delta_path",version=3)
# please ensure the credentials are properly configured as environment variable or
# configured as in ~/.aws/credential
### Managing Delta Tables

# can connect with AWS Glue catalog and read the complete delta table (currently only AWS catalog available)
# will take expilicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from environment
# variables if available otherwise fallback to ~/.aws/credential
ddt.read_delta_table(catalog=glue,database_name="science",table_name="physics")
Vacuuming a table will delete any files that have been marked for deletion. This
may make some past versions of a table invalid, so this can break time travel.
However, it will save storage space. Vacuum will retain files in a certain
window, by default one week, so time travel will still work in shorter ranges.

```python
# read delta history to delete the files
ddt.vacuum("delta_path", dry_run=False)
```