-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model executor for s3/gcs/azure to duckdb #6353
base: main
Are you sure you want to change the base?
Conversation
sb.WriteString(" (TYPE GCS") | ||
if s3Config.AllowHostAccess { | ||
sb.WriteString(", PROVIDER CREDENTIAL_CHAIN") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GCS, what does CREDENTIAL_CHAIN
do? If it doesn't support GOOGLE_APPLICATION_CREDENTIALS
(which I believe it doesn't?), then if we start using this implementation for legacy sources, it will be a regression, right? I'm wondering if we need to keep a fallback to the old implementation where we download files in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In case of duckdb, GCS is just a convenient wrapper over s3 which takes care of endpoint implicitly. So if I set
aws_access_key_id
andaws_secret_access_key
as environment variables and set the following secret in duckdb then I can query files stored on GCS.
CREATE SECRET gcs_secret (
TYPE GCS,
PROVIDER CREDENTIAL_CHAIN
)
SELECT * FROM read_json(['gcs://.../*.json.gz'])
- It does not support
GOOGLE_APPLICATION_CREDENTIALS
. - My understanding was that given so far we are the only ones using GCS so we will migrate our projects to use s3 compatibility. Given both duckdb and clickhouse does not have native GCS support, I get an impression that either people are comfortable using s3 compatibility mode or nobody is using GCS :)
In my opinion it will be ideal to deprecate our home grown connectors for object stores as well and just rely on duckdb's connectors but I will let you make the call.
If we do end up supporting our home grown connectors for GCS then I think it will be best to not support GCS via duckdb at all and if somebody wants to use duckdb's connectors they can use s3 with endpoint set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Thanks, I responded in the other comment :)
- I'm definitely on board with removing the homegrown connectors if possible! Can you think of a log search in Datadog or similar we could do to find out if anyone other than us is using GCS in cloud? If not, we can make this change, but then can you log two follow up issues for:
- Updating the docs/UI to instruct people how to set HMAC keys for GCS
- Add HMAC keys to all our Rill Cloud projects that use GCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any easy way to find what all users use GCS connector so I added a log here that we can merge and release in upcoming release : #6380
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also consider doing a little script that combines rill sudo project search %%
with rill sudo clone
to grep
for sources that contain gs://
or : gcs
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah valid idea but I think we can wait for the log to be added. Would be somewhat easier ?
subtask for https://github.com/rilldata/rill-private-issues/issues/854
Adds a model executor that ingests data from s3/gcs/azure using duckdb extension (Note : GCS works via SQL compatibility mode).
Also supports incremental ingestion.
Sample model yaml that ingests from s3.
Note : Tested only for GCS and S3.