Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap dataproc batch_id declaration to model config #804

Merged
merged 14 commits into from
Aug 11, 2023

Conversation

nickozilla
Copy link
Contributor

@nickozilla nickozilla commented Jun 29, 2023

resolves #671

Description

My initial implementation here: #727 does allow users to to set the batch_id for dataproc serverless models in the profile & this does apply onto a python model, but this wouldn't work for any projects that have more than one python model (which kinda defeats the point of including it at all). As the batch_id parameter needs to be unique, and this is not achievable via a profile level definition.

I've created this new implementation for declaring batch_id at the model level locally, which would be preferred.

Checklist

@cla-bot cla-bot bot added the cla:yes label Jun 29, 2023
@nickozilla nickozilla marked this pull request as ready for review June 30, 2023 08:27
@nickozilla nickozilla requested a review from a team as a code owner June 30, 2023 08:27
@nickozilla
Copy link
Contributor Author

@dbeatty10 @colin-rogers-dbt - I realised that the implementation I merged in didn't actually address the desired functionality - this PR fixes it

@dbeatty10 dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Jul 12, 2023
@dbeatty10
Copy link
Contributor

@nickozilla we're doing release candidates (RC) for dbt-core v1.6 and their associated adapters right now.

So that dbt-bigquery is in a solid state for the RC for 1.6 and to avoid the issue described in #822, we're going to use #826 to revert the original PR in #727.

Then going forward, this PR (#804) can be used as the new implementation of the batch_id feature with the idea of it being ready for inclusion in dbt v1.7.

@dbeatty10
Copy link
Contributor

@nickozilla could you add at least one test case that covers this new behavior? Ideally, it would include at least two different dbt Python models each with a different custom batch_id.

Let us know if you need any help finding an existing test as a template to follow.

.changes/unreleased/Fixes-20230630-092618.yaml Outdated Show resolved Hide resolved
nickozilla and others added 2 commits July 23, 2023 14:10
Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
@dbeatty10
Copy link
Contributor

@nickozilla could reply to dbt-labs/docs.getdbt.com#3718 with some code examples that we can include in the documentation?

@nickozilla
Copy link
Contributor Author

nickozilla commented Aug 1, 2023

Hi @dbeatty10 sorry for delay getting back to you. I've settup the dev environment locally & added a test into tests/functional/adapter/test_python_model.py but the infra we're using doesn't fit well in the test framework - mainly the lack of support for defining the params we use in our profile (most of which are not optional):

target: dev
  outputs:
    dev:
      dataset: dev
      job_execution_timeout_seconds: 2000
      job_retries: 1
      location: EU
      method: oauth
      priority: interactive
      project: "{{ env_var('WAREHOUSE_ANALYTICS_PROJECT_ID') }}"
      threads: 8
      type: bigquery
      dataproc_region: europe-west1
      gcs_bucket: "{{ env_var('PYTHON_DBT_MODELS_BUCKET') }}"
      dataproc_batch:
        environment_config:
          execution_config:
            service_account: "dbt-py@{{ env_var('WAREHOUSE_ANALYTICS_PROJECT_ID') }}.iam.gserviceaccount.com"
            subnetwork_uri: "projects/{{ env_var('NETWORK_PROJECT_ID') }}/regions/europe-west1/subnetworks/dataproc"
            network_tags: ["dataproc-ingress"]
            staging_bucket: "{{ env_var('PYTHON_DBT_STAGING_BUCKET') }}"
        pyspark_batch:
          jar_file_uris:
            [
              "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-0.29.0.jar",
            ]
        runtime_config:
          container_image: "{{ env_var('PYTHON_DBT_IMAGESTORE') }}:{{ env_var('PYTHON_DBT_TAG') }}"

AFAICT the test framework assumes python models are using a cluster on the default project network and gives no extra support for the parameters we're using in the test.env file. Is there a way I can set these parameters for the tests? Otherwise I can try adding in some tests and having the CI on this PR run the checks, but that's not ideal.

I've also included a model yaml configuration example in the issue you've linked, and tested it locally, see below.

Screenshot 2023-08-01 at 22 59 35

@dbeatty10
Copy link
Contributor

@nickozilla no worries!

If you commit your changes to tests/functional/adapter/test_python_model.py, we can try running them against the CI we have set up and see if it works or not 🤞 That might reduce the burden of resolving the mismatch you mentioned.

@nickozilla
Copy link
Contributor Author

@dbeatty10 I've added the tests as requested - though I'm not sure I'm using the assert len() correctly

@colin-rogers-dbt colin-rogers-dbt enabled auto-merge (squash) August 11, 2023 21:06
@colin-rogers-dbt colin-rogers-dbt merged commit 3510f76 into dbt-labs:main Aug 11, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ok to test ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-462] [Feature] Allow override of CreateBatchRequest for batch_id
4 participants