Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Targets seem very slow #1181

Closed
Jack-Burnett opened this issue Nov 14, 2022 · 8 comments
Closed

[Bug]: Targets seem very slow #1181

Jack-Burnett opened this issue Nov 14, 2022 · 8 comments
Labels

Comments

@Jack-Burnett
Copy link
Contributor

Singer SDK Version

0.13.1

Python Version

3.10

Bug scope

Targets (data type handling, batching, SQL object generation, etc.)

Operating System

MacOS

Description

We have a target processing 1 million records and it takes a few hours, which seems pretty slow.
I tried creating a target that does nothing (so just measuring how slow the sdk itself is) and it takes around 3-4 seconds per 1000 records.
It spends about 60% of the time doing json schema validation, and another 35% doing _parse_timestamps_in_record.

Firstly, I'm wondering what the point in '_parse_timestamps_in_record' is generally; I can see how it would be useful for some targets but not enough to justify doing it for all targets, since many won't need that (and will then have to parse them back!)

I couldn't see anything obviously wrong with the validation, though I'm on the fence about how necessary it is; I wonder if you can just assume that the tap it outputting valid data 🤔
I heard of a library, but haven't used it myself, called fastjsonschema which is supposed to be faster, so that could be one idea.

Code

No response

@Jack-Burnett Jack-Burnett added kind/Bug Something isn't working valuestream/SDK labels Nov 14, 2022
@edgarrmondragon
Copy link
Collaborator

Hey @Jack-Burnett!

Thanks for reporting!

It spends about 60% of the time doing json schema validation

This is not great. The purpose of this AFAICT is to fail early when records are not valid and would cause errors later when trying to persist them to the target.

I wonder if deserializing and schema validation would be improved by low-level bindings to Rust code with explicit structs for Singer messages, so they're quicker to parse and validate.

and another 35% doing _parse_timestamps_in_record.

Firstly, I'm wondering what the point in '_parse_timestamps_in_record' is generally; I can see how it would be useful for some targets but not enough to justify doing it for all targets, since many won't need that (and will then have to parse them back!)

I agree, perhaps we can make that a toggle at the target class level, so developers can choose whether to keep this behavior in their targets. Wdyt?

@Jack-Burnett
Copy link
Contributor Author

Jack-Burnett commented Nov 15, 2022

I think making _parse_timestamps_in_record toggleable makes sense yep.

For validation speed, I have now locally overiden _validate_and_parse and switched it to use fastjsonschema, and it is much much faster, by an order of magnitude.
So this simple switch might be enough without switching to rust bindings.

So making these changes in the core library too would help a lot I think

@visch
Copy link
Contributor

visch commented Nov 19, 2022

@Jack-Burnett awesome! I can confirm the speed has smelled for a while I have some good production test scenarios for target-mssql if / when we get good traction on this :D

@edgarrmondragon
Copy link
Collaborator

One note from thinking about validation in MeltanoLabs/tap-csv#88 (review):

We probably don't want to change how validation works for the target configuration, just for records.

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@stale stale bot removed the stale label Jul 18, 2023
@edgarrmondragon
Copy link
Collaborator

We should probably refactor this method to make it pluggable so target developers are able to make it a no-op if that makes sense for their destination.

@edgarrmondragon
Copy link
Collaborator

We should probably refactor this method to make it pluggable so target developers are able to make it a no-op if that makes sense for their destination.

This was done in a slightly manner by 631d5df, shipped with https://github.com/meltano/sdk/releases/tag/v0.35.0.

Comment if this is still a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants