Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model executor for s3/gcs/azure to duckdb #6353

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

k-anshul
Copy link
Member

@k-anshul k-anshul commented Jan 6, 2025

subtask for https://github.com/rilldata/rill-private-issues/issues/854

Adds a model executor that ingests data from s3/gcs/azure using duckdb extension (Note : GCS works via SQL compatibility mode).
Also supports incremental ingestion.
Sample model yaml that ingests from s3.

connector: s3
path: s3://rill-developer.rilldata.io/AdBids.csv.gz

output:
  connector: duckdb

Note : Tested only for GCS and S3.

@k-anshul k-anshul self-assigned this Jan 6, 2025
runtime/drivers/gcs/gcs.go Outdated Show resolved Hide resolved
runtime/drivers/duckdb/model_executor_objectstore_self.go Outdated Show resolved Hide resolved
Comment on lines 120 to 123
sb.WriteString(" (TYPE GCS")
if s3Config.AllowHostAccess {
sb.WriteString(", PROVIDER CREDENTIAL_CHAIN")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GCS, what does CREDENTIAL_CHAIN do? If it doesn't support GOOGLE_APPLICATION_CREDENTIALS (which I believe it doesn't?), then if we start using this implementation for legacy sources, it will be a regression, right? I'm wondering if we need to keep a fallback to the old implementation where we download files in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In case of duckdb, GCS is just a convenient wrapper over s3 which takes care of endpoint implicitly. So if I set aws_access_key_id and aws_secret_access_key as environment variables and set the following secret in duckdb then I can query files stored on GCS.
CREATE SECRET gcs_secret (
     TYPE GCS,
     PROVIDER CREDENTIAL_CHAIN
)
SELECT * FROM read_json(['gcs://.../*.json.gz'])
  1. It does not support GOOGLE_APPLICATION_CREDENTIALS.
  2. My understanding was that given so far we are the only ones using GCS so we will migrate our projects to use s3 compatibility. Given both duckdb and clickhouse does not have native GCS support, I get an impression that either people are comfortable using s3 compatibility mode or nobody is using GCS :)
    In my opinion it will be ideal to deprecate our home grown connectors for object stores as well and just rely on duckdb's connectors but I will let you make the call.
    If we do end up supporting our home grown connectors for GCS then I think it will be best to not support GCS via duckdb at all and if somebody wants to use duckdb's connectors they can use s3 with endpoint set.

Copy link
Contributor

@begelundmuller begelundmuller Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Thanks, I responded in the other comment :)
  2. I'm definitely on board with removing the homegrown connectors if possible! Can you think of a log search in Datadog or similar we could do to find out if anyone other than us is using GCS in cloud? If not, we can make this change, but then can you log two follow up issues for:
    • Updating the docs/UI to instruct people how to set HMAC keys for GCS
    • Add HMAC keys to all our Rill Cloud projects that use GCS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any easy way to find what all users use GCS connector so I added a log here that we can merge and release in upcoming release : #6380

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also consider doing a little script that combines rill sudo project search %% with rill sudo clone to grep for sources that contain gs:// or : gcs...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah valid idea but I think we can wait for the log to be added. Would be somewhat easier ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants