diff --git a/README.md b/README.md index 68aa5cb..71e2944 100644 --- a/README.md +++ b/README.md @@ -1,58 +1,98 @@ -## Dask Deltatable Reader +## Dask-DeltaTable -Reads a Delta Table from directory using Dask engine. +Reading and writing to Delta Lake using Dask engine. -To Try out the package: +### Installation + +To install the package: ``` pip install dask-deltatable ``` ### Features: -1. Reads the parquet files based on delta logs parallely using dask engine -2. Supports all three filesystem like s3, azurefs, gcsfs -3. Supports some delta features like + +1. Read the parquet files from Delta Lake and parallelize with Dask +2. Write Dask dataframes to Delta Lake (limited support) +3. Supports multiple filesystems (s3, azurefs, gcsfs) +4. Subset of Delta Lake features: - Time Travel - Schema evolution - - parquet filters + - Parquet filters - row filter - partition filter -4. Query Delta commit info - History -5. vacuum the old/ unused parquet files -6. load different versions of data using datetime. +5. Query Delta commit info and history +6. API to ``vacuum`` the old / unused parquet files +7. Load different versions of data by timestamp or version. -### Usage: +### Not supported -``` +1. Writing to Delta Lake is still in development. +2. `optimize` API to run a bin-packing operation on a Delta Table. + +### Reading from Delta Lake + +```python import dask_deltatable as ddt # read delta table ddt.read_delta_table("delta_path") -# read delta table for specific version -ddt.read_delta_table("delta_path",version=3) +# with specific version +ddt.read_delta_table("delta_path", version=3) + +# with specific datetime +ddt.read_delta_table("delta_path", datetime="2018-12-19T16:39:57-08:00") +``` + +### Accessing remote file systems + +To be able to read from S3, azure, gcsfs, and other remote filesystems, +you ensure the credentials are properly configured in environment variables +or config files. For AWS, you may need `~/.aws/credential`; for gcsfs, +`GOOGLE_APPLICATION_CREDENTIALS`. Refer to your cloud provider documentation +to configure these. + +```python +ddt.read_delta_table("s3://bucket_name/delta_path", version=3) +``` -# read delta table for specific datetime -ddt.read_delta_table("delta_path",datetime="2018-12-19T16:39:57-08:00") +### Accessing AWS Glue catalog +`dask-deltatable` can connect to AWS Glue catalog to read the delta table. +The method will look for `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` +environment variables, and if those are not available, fall back to +`~/.aws/credentials`. +Example: + +```python +ddt.read_delta_table(catalog="glue", database_name="science", table_name="physics") +``` + +### Inspecting Delta Table history + +One of the features of Delta Lake is preserving the history of changes, which can be is useful +for auditing and debugging. `dask-deltatable` provides APIs to read the commit info and history. + +```python + +```python # read delta complete history ddt.read_delta_history("delta_path") # read delta history upto given limit -ddt.read_delta_history("delta_path",limit=5) - -# read delta history to delete the files -ddt.vacuum("delta_path",dry_run=False) +ddt.read_delta_history("delta_path", limit=5) +``` -# Can read from S3,azure,gcfs etc. -ddt.read_delta_table("s3://bucket_name/delta_path",version=3) -# please ensure the credentials are properly configured as environment variable or -# configured as in ~/.aws/credential +### Managing Delta Tables -# can connect with AWS Glue catalog and read the complete delta table (currently only AWS catalog available) -# will take expilicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from environment -# variables if available otherwise fallback to ~/.aws/credential -ddt.read_delta_table(catalog=glue,database_name="science",table_name="physics") +Vacuuming a table will delete any files that have been marked for deletion. This +may make some past versions of a table invalid, so this can break time travel. +However, it will save storage space. Vacuum will retain files in a certain +window, by default one week, so time travel will still work in shorter ranges. +```python +# read delta history to delete the files +ddt.vacuum("delta_path", dry_run=False) ```