dask-contrib · fjetter · Jul 12, 2023 · Jul 10, 2023 · Jul 11, 2023 · Jul 11, 2023
diff --git a/README.md b/README.md
@@ -1,58 +1,98 @@
-## Dask Deltatable Reader
+## Dask-DeltaTable
 
-Reads a Delta Table from directory using Dask engine.
+Reading and writing to Delta Lake using Dask engine.
 
-To Try out the package:
+### Installation
+
+To install the package:
 
 ```
 pip install dask-deltatable
 ```
 
 ### Features:
-1. Reads the parquet files based on delta logs parallely using dask engine
-2. Supports all three filesystem like s3, azurefs, gcsfs
-3. Supports some delta features like
+
+1. Read the parquet files from Delta Lake and parallelize with Dask
+2. Write Dask dataframes to Delta Lake (limited support)
+3. Supports multiple filesystems (s3, azurefs, gcsfs)
+4. Subset of Delta Lake features:
    - Time Travel
    - Schema evolution
-   - parquet filters
+   - Parquet filters
      - row filter
      - partition filter
-4. Query Delta commit info - History
-5. vacuum the old/ unused parquet files
-6. load different versions of data using datetime.
+5. Query Delta commit info and history
+6. API to ``vacuum`` the old / unused parquet files
+7. Load different versions of data by timestamp or version.
 
-### Usage:
+### Not supported
 
-```
+1. Writing to Delta Lake is still in development.
+2. `optimize` API to run a bin-packing operation on a Delta Table.
+
+### Reading from Delta Lake
+
+```python
 import dask_deltatable as ddt
 
 # read delta table
 ddt.read_delta_table("delta_path")
 
-# read delta table for specific version
-ddt.read_delta_table("delta_path",version=3)
+# with specific version
+ddt.read_delta_table("delta_path", version=3)
+
+# with specific datetime
+ddt.read_delta_table("delta_path", datetime="2018-12-19T16:39:57-08:00")
+```
+
+### Accessing remote file systems
+
+To be able to read from S3, azure, gcsfs, and other remote filesystems,
+you ensure the credentials are properly configured in environment variables
+or config files. For AWS, you may need `~/.aws/credential`; for gcsfs,
+`GOOGLE_APPLICATION_CREDENTIALS`. Refer to your cloud provider documentation
+to configure these.
+
+```python
+ddt.read_delta_table("s3://bucket_name/delta_path", version=3)
+```
 
-# read delta table for specific datetime
-ddt.read_delta_table("delta_path",datetime="2018-12-19T16:39:57-08:00")
+### Accessing AWS Glue catalog
 
+`dask-deltatable` can connect to AWS Glue catalog to read the delta table.
+The method will look for `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
+environment variables, and if those are not available, fall back to
+`~/.aws/credentials`.
 
+Example:
+
+```python
+ddt.read_delta_table(catalog="glue", database_name="science", table_name="physics")
+```
+
+### Inspecting Delta Table history
+
+One of the features of Delta Lake is preserving the history of changes, which can be is useful
+for auditing and debugging. `dask-deltatable` provides APIs to read the commit info and history.
+
+```python
+
+```python
 # read delta complete history
 ddt.read_delta_history("delta_path")
 
 # read delta history upto given limit
-ddt.read_delta_history("delta_path",limit=5)
-
-# read delta history to delete the files
-ddt.vacuum("delta_path",dry_run=False)
+ddt.read_delta_history("delta_path", limit=5)
+```
 
-# Can read from S3,azure,gcfs etc.
-ddt.read_delta_table("s3://bucket_name/delta_path",version=3)
-# please ensure the credentials are properly configured as environment variable or
-# configured as in ~/.aws/credential
+### Managing Delta Tables
 
-# can connect with AWS Glue catalog and read the complete delta table (currently only AWS catalog available)
-# will take expilicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from environment
-# variables if available otherwise fallback to ~/.aws/credential
-ddt.read_delta_table(catalog=glue,database_name="science",table_name="physics")
+Vacuuming a table will delete any files that have been marked for deletion. This
+may make some past versions of a table invalid, so this can break time travel.
+However, it will save storage space. Vacuum will retain files in a certain
+window, by default one week, so time travel will still work in shorter ranges.
 
+```python
+# read delta history to delete the files
+ddt.vacuum("delta_path", dry_run=False)
 ```