A Python package to enrich Google Cloud Data Catalog Fileset Entries with Data Catalog Tags. The goal of this library is to provide useful statistics regarding the GCS files that match the file pattern on the provided Data Catalog Fileset Entry.
For instructions on how to create Fileset Entries, please go to the official Google Cloud Docs
Tags created by the fileset enricher are composed by the following attributes, and all stats are a snapshot of the execution time:
Field | Description | Mandatory |
---|---|---|
execution_time | Execution time when all stats were collected. | Y |
files | Number of files found, that matches the prefix. | N |
min_file_size | Minimum file size found in bytes. | N |
max_file_size | Maximum file size found in bytes. | N |
avg_file_size | Average file size found in bytes. | N |
total_file_size | Total file size found in bytes. | N |
first_created_date | First time a file was created in the bucket(s). | N |
last_created_date | Last time a file was created in the bucket(s). | N |
last_updated_date | Last time a file was updated in the bucket(s). | N |
created_files_by_day | Number of files created on the same date. | N |
updated_files_by_day | Number of files updated on the same date. | N |
prefix | Prefix used to find the files. | N |
bucket_prefix | When specified at runtime, buckets without this prefix are ignored. | N |
buckets_found | Number of buckets that matched the prefix. | N |
files_by_bucket | Number of files found on each bucket. | N |
files_by_type | Number of files found by file type. | N |
If no fields are specified when running the fileset enricher, all Tag fields will be applied.
To generate file statistics and create the Tags this python package, uses the GCS list_buckets
and list_blobs
APIs to extract the metadata that matches the file pattern, so their billing policies will apply.
git clone https://github.com/mesmacosta/datacatalog-fileset-enricher
cd datacatalog-fileset-enricher
- Data Catalog Tag Editor
- Data Catalog TagTemplate Owner
- Data Catalog Viewer
- Storage Admin or Custom Role with storage.buckets.list acl
./credentials/datacatalog-fileset-enricher.json
Using virtualenv is optional, but strongly recommended unless you use Docker.
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
pip install --upgrade --editable .
export GOOGLE_APPLICATION_CREDENTIALS=./credentials/datacatalog-fileset-enricher.json
Docker may be used as an alternative to run all the scripts. In this case, please disregard the Virtualenv install instructions.
- python
python main.py --project-id my_project \
enrich-gcs-filesets
- docker
docker build --rm --tag datacatalog-fileset-enricher .
docker run --rm --tty -v your_credentials_folder:/data datacatalog-fileset-enricher \
--project-id my_project \
enrich-gcs-filesets
If you are using a different project, make sure the Service Account has the following permissions on that project:
- Data Catalog TagTemplate Creator
- Data Catalog TagTemplate User
python main.py --project-id my_project \
enrich-gcs-filesets \
--tag-template-name projects/my_different_project/locations/us-central1/tagTemplates/fileset_enricher_findings
python main.py --project-id my_project \
enrich-gcs-filesets \
--entry-group-id my_entry_group \
--entry-id my_entry
Users are able to choose the Tag fields from the list provided at Tags
python main.py --project-id my_project \
enrich-gcs-filesets \
--entry-group-id my_entry_group \
--entry-id my_entry
--tag-fields files,prefix
When the bucket_prefix is specified, the list_bucket api calls pass this prefix and avoid scanning buckets that don't match the prefix. This only applies when there's a wildcard on the bucket_name, otherwise the get bucket method is called and the bucket_prefix is ignored.
python main.py --project-id my_project \
enrich-gcs-filesets \
--bucket-prefix my_bucket
Cleans up the Template and Tags from the Fileset Entries, running the main command will recreate those.
python main.py --project-id my_project \
clean-up-templates-and-tags
This is not an officially supported Google product.