-
Notifications
You must be signed in to change notification settings - Fork 13
GCP load data to BigQuery
alexmbennett2 edited this page Sep 29, 2022
·
6 revisions
The process to load data from a PostgresSQL database to BigQuery. For this work we will be migrating the mimic-fhir demo data, but this workflow could be extended to the full data
Set up some environment variables first:
export OUTPUT_DIR = <path to output CSVs>
export GS_CSV_PATH = <cloud storage bucket path ie gs://some/path>
- OUTPUT_DIR must include trailing slash (ie /path/to/file/)
Then run the CSV generation script:
psql -d mimic_demo --set output_dir=$OUTPUT_DIR -f psql_export_fhir_to_csv.sql
gsutil -m cp -r $OUTPUT_DIR/* $GS_CSV_PATH
- this step will take ~5-10 minutes due to the size of chartevents/labevents
- look into enabling parallel_composite_upload_threshold
- update in .boto configuration file
Create tables in BigQuery. The UUID and JSONB datatypes from Postgres will be mapped to STRING in BigQuery
Create schema first:
bq mk mimic_fhir
Then generate tables:
bq query < bq_create_fhir_tables.sql --use_legacy_sql=false
Load the data from Cloud Storage into BigQuery.
Use the shell script:
./bq_load_fhir_tables
The script is a series of bq load command for each table. An example load command would be:
bq load --source_format=CSV mimic_fhir.<table> $GS_CSV_PATH/<table>.csv