Discovery Performance #284

BubbaTLC · 2023-11-02T15:34:16Z

Original Issue in Slack Thread

tap-postgres==v0.0.2

Performance on discovery is taking about 1 second per table. This is extremely slow for large schemas (even 60 tables would take a minute)

Discovery on remote PSQL 11 server on high perfomance internet connection. Schema 3452 tables.

tap-postgres --config config.json --discover  59.60s user 10.66s system 1% cpu 1:21:01.21 total

Please try and replicate the issue. Might still be something on our end

Proposed solutions

Increase discovery efficiency
Only run discover on selected tables
Allow the ability to specify tables to discover.

The text was updated successfully, but these errors were encountered:

florian-ernst-alan · 2023-11-13T16:57:53Z

I stumbled upon the same issue. Looks like the root cause is in singer_sdk's SQLConnector class (here). Time complexity is indeed linear.

florian-ernst-alan · 2023-11-30T11:28:54Z

This would probably require the SDK to move to SQLAlchemy 2.0 in order to use all the inspected.get_multi_* methods (documentation)

edgarrmondragon · 2023-12-08T01:24:45Z

This would probably require the SDK to move to SQLAlchemy 2.0 in order to use all the inspected.get_multi_* methods (documentation)

@florian-ernst-alan interesting. Is this the feature that would be relevant here?

The objects can be filtered by passing the names to use to filter_names.

florian-ernst-alan · 2023-12-08T14:42:30Z

I think it would yes. Basically, instead of getting all informations individually (for each table, get each column names, get each primary key...), get all of them at once and then use them downstream.

Let's say I'm 90% sure.

Overrides the SDK functions to instead use the `get_multi_*` functions from SQLAlchemy Inspector. On our database of ~120 tables, this reduces the discovery runtime from 10-12 minutes to about 30 seconds. - Closes #284

edgarrmondragon mentioned this issue Jan 22, 2024

Discovery performance meltano/sdk#2166

Open

edgarrmondragon mentioned this issue Nov 4, 2024

refactor: Optimise metadata discovery for large databases #528

Merged

edgarrmondragon closed this as completed in #528 Nov 5, 2024

edgarrmondragon reopened this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery Performance #284

Discovery Performance #284

BubbaTLC commented Nov 2, 2023

florian-ernst-alan commented Nov 13, 2023

florian-ernst-alan commented Nov 30, 2023

edgarrmondragon commented Dec 8, 2023

florian-ernst-alan commented Dec 8, 2023

Discovery Performance #284

Discovery Performance #284

Comments

BubbaTLC commented Nov 2, 2023

Proposed solutions

florian-ernst-alan commented Nov 13, 2023

florian-ernst-alan commented Nov 30, 2023

edgarrmondragon commented Dec 8, 2023

florian-ernst-alan commented Dec 8, 2023