Skip to content

GCP load data to BigQuery

alexmbennett2 edited this page Sep 29, 2022 · 6 revisions

The process to load data from a PostgresSQL database to BigQuery. For this work we will be migrating the mimic-fhir demo data, but this workflow could be extended to the full data

Set up some environment variables first:

export OUTPUT_DIR = <path to output CSVs>
export GS_CSV_PATH = <cloud storage bucket path ie gs://some/path>
  • OUTPUT_DIR must include trailing slash (ie /path/to/file/)

Steps to load data into BigQuery

1. Copy data out of PostgresSQL as CSVs

Then run the CSV generation script:

psql -d mimic_demo --set output_dir=$OUTPUT_DIR -f psql_export_fhir_to_csv.sql

2. Upload CSVs to Cloud Storage

gsutil -m cp -r $OUTPUT_DIR/* $GS_CSV_PATH
  • this step will take ~5-10 minutes due to the size of chartevents/labevents
  • look into enabling parallel_composite_upload_threshold
    • update in .boto configuration file

3. Create mimic_fhir schema in BigQuery

Create tables in BigQuery. The UUID and JSONB datatypes from Postgres will be mapped to STRING in BigQuery

Create schema first:

bq mk mimic_fhir

Then generate tables:

bq query < bq_create_fhir_tables.sql --use_legacy_sql=false

4. Load data into BigQuery

Load the data from Cloud Storage into BigQuery.

Use the shell script:

./bq_load_fhir_tables

The script is a series of bq load command for each table. An example load command would be:

bq load --source_format=CSV mimic_fhir.<table> $GS_CSV_PATH/<table>.csv