In the same directory as the Giftless server, make sure there is a file credentials.json
, which you can obtain when setting up a service account on Google Cloud Storage. Its content should be of the following form:
{
"type": "service_account",
"project_id": "XXXXXXXXXXXXXXXX",
"private_key_id": "XXXXXXXXXXXXXXXX",
"private_key": "-----BEGIN PRIVATE KEY-----\nXXXXXXXXXXXXXXXX\nXXXXXXXXXXXXXXXX\n",
"client_email": "email@project-name.iam.gserviceaccount.com",
"client_id": "XXXXXXXXXXXXXXXX",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/email%40project-name.iam.gserviceaccount.com"
}
You should have a file giftless.yaml
at the root directory where the Giftless server is located (same directory as credentials.json
). It should contain something like the following for a Google Cloud Storage configuration:
TRANSFER_ADAPTERS:
basic:
factory: giftless.transfer.basic_streaming:factory
options:
storage_class: giftless.storage.google_cloud:GoogleCloudStorage
storage_options:
project_name: gift-data
bucket_name: gift-datasets
account_key_file: credentials.json
AUTH_PROVIDERS:
- giftless.auth.allow_anon:read_write
For convenience, it has been added to this repository: giftless.yaml.
Additionally, you will need to export the following environment variable for Giftless to use this configuration file:
export GIFTLESS_CONFIG_FILE=giftless.yaml
With a local setup, this can be done with a freshly clone of the datopian/giftless repo while running uWSGI. More details can be found in the section Google Cloud Platform Support.
uwsgi -M -T --threads 2 -p 2 --manage-script-name \
--module giftless.wsgi_entrypoint --callable app --http 127.0.0.1:8080
- This happens in two main steps:
- Download the data packages file (
datapackage.json
) for each dataset we are interested in; - Reading those files, we determine the URL of the associated resources and we retrieve them in separate repositories.
- Download the data packages file (
The Python script used to perform those actions can be found in the file retrieve_resources.py.
The data itself is stored in the cloud, but we need a GitHub repository to store the metadata with Git-LFS. To create repositories programmatically on GitHub, you need a personal access token and can use the following syntax:
curl -H "Authorization: token ACCESS_TOKEN" --data \
'{"name":"NEW_REPO_NAME"}' https://api.github.com/user/repos
The process can be greatly sped up by looping over a list of repository names. Preparing the repos locally is equally quick:
mkdir repo-name [...]
git clone git@github.com:gift-data/repo-name.git repo-name && \
git clone [...]
Note: You will experience issues when trying to push a commit larger than 2GB. This is explained in this GitHub Community post.
The general idea is to have the Giftless server ready and perform the following actions:
cd repo-name && \
git lfs track "*.csv" && \
git add . && \
git commit -m "Add tracked CSV files" && \
git lfs push origin master &&
cd ..
The process can be automated as is shown in the Bash file push_lfs.sh.
Finally, we want to update the remote repositories by pushing our local commits once the files are stored on Google Cloud Storage. This can be done with a command like the following:
cd repo-name && git push origin && cd ..
cd repo-name-2 && git push origin && cd ..
This is automated for the GIFT datasets in the file push_git.sh.