Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery Performance #284

Open
BubbaTLC opened this issue Nov 2, 2023 · 4 comments · Fixed by #528
Open

Discovery Performance #284

BubbaTLC opened this issue Nov 2, 2023 · 4 comments · Fixed by #528

Comments

@BubbaTLC
Copy link

BubbaTLC commented Nov 2, 2023

Original Issue in Slack Thread

tap-postgres==v0.0.2

Performance on discovery is taking about 1 second per table. This is extremely slow for large schemas (even 60 tables would take a minute)

Discovery on remote PSQL 11 server on high perfomance internet connection. Schema 3452 tables.

tap-postgres --config config.json --discover  59.60s user 10.66s system 1% cpu 1:21:01.21 total

Please try and replicate the issue. Might still be something on our end

Proposed solutions

  1. Increase discovery efficiency
  2. Only run discover on selected tables
  3. Allow the ability to specify tables to discover.
@florian-ernst-alan
Copy link

I stumbled upon the same issue. Looks like the root cause is in singer_sdk's SQLConnector class (here). Time complexity is indeed linear.

@florian-ernst-alan
Copy link

This would probably require the SDK to move to SQLAlchemy 2.0 in order to use all the inspected.get_multi_* methods (documentation)

@edgarrmondragon
Copy link
Member

This would probably require the SDK to move to SQLAlchemy 2.0 in order to use all the inspected.get_multi_* methods (documentation)

@florian-ernst-alan interesting. Is this the feature that would be relevant here?

The objects can be filtered by passing the names to use to filter_names.

@florian-ernst-alan
Copy link

I think it would yes. Basically, instead of getting all informations individually (for each table, get each column names, get each primary key...), get all of them at once and then use them downstream.

Let's say I'm 90% sure.

edgarrmondragon pushed a commit that referenced this issue Nov 5, 2024
Overrides the SDK functions to instead use the `get_multi_*` functions
from SQLAlchemy Inspector. On our database of ~120 tables, this reduces
the discovery runtime from 10-12 minutes to about 30 seconds.

- Closes #284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

3 participants