Skip to content

πŸ—ƒFixity metadata tracker for Google Cloud Storage file archives

Notifications You must be signed in to change notification settings

zefdelgadillo/gcs-fixity-function

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Fixity Metadata for GCS πŸ—ƒ

This script pulls metadata and checksums for file archives in Google Cloud Storage and stores them in a manifest file and in BigQuery to track changes over time. The script uses the BagIt specification.

Overview

Each time this Fixity function is run for any file archive bag using the BagIt specification, the following is created:

  • An MD5 checksum manifest file
  • Records in BigQuery containing the following metadata: bucket, bag, file name, file size, checksum, file modified date, fixity run date.

Process

Fixity Process Diagram

  • Google Cloud Function listens on changes to a GCS Bucket (file archives, file updates)
  • (or) Google Cloud Scheduler invokes Cloud Function manually or via a predefined schedule
  • Function reads metadata of files for each Bag* that has file updates and writes a new Manifest file into each Bag
  • Function writes records into BigQuery for each Bag with new metadata

* If function is invoked by listening to changes on a GCS bucket, then Fixity is run only for the Bag that had the change. If function is invoked by Cloud Scheduler, then Fixity is run for the entire GCS Bucket

Buckets

This Fixity function is configured for 1 Google Cloud Storage bucket containing any number of Bags.

Bags

Bags should be created using the BagIt Specification (RFC 8493). A Bag is a directory in a GCS bucket that contains a data/ directory containing archived files.

Any number of bags can be created in a GCS bucket, as long as each bag contains a data/ directory. In the following example, this function will recognize 4 bags: collection-europe/italy/, collection-europe/france/, collection-na/1700s/, and uncategorized/.

BUCKET: Rare Books
.
β”œβ”€β”€ collection-europe
β”‚Β Β  β”œβ”€β”€ italy
β”‚Β Β  β”‚Β Β  └── data
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ book1
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ book2
β”‚Β Β  β”‚Β Β      └── book3
β”‚Β Β  └── france
β”‚Β Β      └── data
β”‚Β Β          β”œβ”€β”€ book1
β”‚Β Β          └── book2
β”œβ”€β”€ collection-na
β”‚Β Β  └── 1700s
β”‚Β Β   Β Β  └── data
β”‚Β Β   Β Β      β”œβ”€β”€ book1
β”‚Β Β   Β Β      β”œβ”€β”€ book2
β”‚Β Β   Β Β      └── book3
└── uncategorized
 Β Β  └── data
 Β Β   Β Β  └── a

BigQuery

The setup instructions create the following BigQuery views:

  • fixity.current_manifest: A current list of all files in the archive across all Bags.
  • fixity.file_operations: A running list of all file operations (file updated, file changed, file created) across all bags.

Setup

Limitations

This Cloud Functions has a default memory limit of 256MB per function invocation. To avoid hitting memory limits, distribute bags and objects across many different buckets. It's recommended to maintain under 250,000 objects per bucket to avoid running into memory limitations.

About

πŸ—ƒFixity metadata tracker for Google Cloud Storage file archives

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published