Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-462] [Feature] Allow override of CreateBatchRequest for batch_id #671

Closed
3 tasks done
nickozilla opened this issue Apr 18, 2023 · 8 comments · Fixed by #727 or #804
Closed
3 tasks done

[ADAP-462] [Feature] Allow override of CreateBatchRequest for batch_id #671

nickozilla opened this issue Apr 18, 2023 · 8 comments · Fixed by #727 or #804
Labels
enhancement New feature or request good_first_issue Good for newcomers

Comments

@nickozilla
Copy link
Contributor

nickozilla commented Apr 18, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently the dataproc_profile supports profile configuration of a batch resource which is used in the CreateBatchRequest & leaves the batch_id undeclared forcing a random string generation. It would be better if we could customise this field of batch_id optionally

Describe alternatives you've considered

Labels mainly, but they can't be used as ui filter on the GCP console

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

@nickozilla nickozilla added enhancement New feature or request triage labels Apr 18, 2023
@github-actions github-actions bot changed the title [Feature] Allow override of CreateBatchRequest for batch_id [ADAP-462] [Feature] Allow override of CreateBatchRequest for batch_id Apr 18, 2023
@dbeatty10
Copy link
Contributor

Thanks for reaching out @nickozilla !

I'm assuming you mean dataproc_batch rather than dataproc_profile?

And I'm guessing some variation on the following doesn't work?

Full disclosure: I haven't used the dataproc_batch config personally yet (and I didn't notice it in our documentation), so I'm learning as I go here!

profiles.yml

myproject:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: "{{ env_var('GCP_PROJECT') }}"
      dataset: tmp
      threads: 4
      gcs_bucket: "my-bucket"
      dataproc_region: "europe-west1"
      submission_method: serverless
      dataproc_batch:
        batch_id: my_custom_batch_id
        environment_config:
          execution_config:
            service_account: "dbt@{{ env_var('GCP_PROJECT') }}.iam.gserviceaccount.com"
            subnetwork_uri: "dataproc"
        labels:
          project: "my-project"
          role: "dev"
        runtime_config:
          properties:
            spark.executor.instances: 3
            spark.driver.memory: "1g"

More info

The dataproc_batch config was originally added in #578.

Some relevant sections of code:

def _submit_dataproc_job(self) -> dataproc_v1.types.jobs.Job:
batch = self._configure_batch()
parent = f"projects/{self.credential.execution_project}/locations/{self.credential.dataproc_region}"
request = dataproc_v1.CreateBatchRequest(
parent=parent,
batch=batch,
)

# Apply configuration from dataproc_batch key, possibly overriding defaults.
if self.credential.dataproc_batch:
self._update_batch_from_config(self.credential.dataproc_batch, batch)
return batch
@classmethod
def _update_batch_from_config(
cls, config_dict: Union[Dict, DataprocBatchConfig], target: dataproc_v1.Batch
):
try:
# updates in place
ParseDict(config_dict, target._pb)
except Exception as e:
docurl = (
"https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1"
"#google.cloud.dataproc.v1.Batch"
)
raise ValueError(
f"Unable to parse dataproc_batch as valid batch specification. See {docurl}. {str(e)}"
) from e

@nickozilla
Copy link
Contributor Author

nickozilla commented Apr 19, 2023

Hi @dbeatty10 - yes you were right I meant dataproc_batch rather than dataproc_profile - also I tried the example you posted out of curiosity and it failed:

08:47:50  Unable to parse dataproc_batch as valid batch specification. See https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.Batch. Message type "google.cloud.dataproc.v1.Batch" has no field named "batch_id" at "Batch".
08:47:50   Available Fields(except extensions): "['name', 'uuid', 'createTime', 'pysparkBatch', 'sparkBatch', 'sparkRBatch', 'sparkSqlBatch', 'runtimeInfo', 'state', 'stateMessage', 'stateTime', 'creator', 'labels', 'runtimeConfig', 'environmentConfig', 'operation', 'stateHistory']"

batch_id is only accessible in CreateBatchRequest

https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.CreateBatchRequest

@dbeatty10
Copy link
Contributor

I see what you are saying @nickozilla 👍

The implementation of the feature you are requesting might be pretty lightweight -- would you be interested in contributing it, by any chance? No pressure tho!

Implementation hints

Caveat: the following is completely untested and potentially way too optimistic 😎

  1. Add batch_id: Optional[str] = None above here:
    https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/connections.py#L137-L142

  2. Then add batch_id=self.credential.batch_id, in here:

    request = dataproc_v1.CreateBatchRequest(
    parent=parent,
    batch=batch,
    )

@nickozilla
Copy link
Contributor Author

Yeah sounds good, I'll give that a go & link back to this issue in the PR 👍

@nickozilla
Copy link
Contributor Author

nickozilla commented May 17, 2023

@dbeatty10 You were right to be optimistic, I made the change & tested locally & it was pulled in succesfully. Raised the PR for it here: #727
Screenshot 2023-05-17 at 13 09 09
Screenshot 2023-05-17 at 13 10 12

@dbeatty10
Copy link
Contributor

That's great news @nickozilla! 🎉

That would be a perfect screenshot to add to the pull request as well.

@dbeatty10 dbeatty10 removed the triage label May 17, 2023
@nickozilla
Copy link
Contributor Author

@dbeatty10 do you have a rough idea of when the PR will be reviewed? Currently been open for >3weeks without any comments/reviews.

@dbeatty10
Copy link
Contributor

Re-opening after the revert in #826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Good for newcomers
Projects
None yet
2 participants