-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Targets seem very slow #1181
Comments
Hey @Jack-Burnett! Thanks for reporting!
This is not great. The purpose of this AFAICT is to fail early when records are not valid and would cause errors later when trying to persist them to the target. I wonder if deserializing and schema validation would be improved by low-level bindings to Rust code with explicit structs for Singer messages, so they're quicker to parse and validate.
I agree, perhaps we can make that a toggle at the target class level, so developers can choose whether to keep this behavior in their targets. Wdyt? |
I think making _parse_timestamps_in_record toggleable makes sense yep. For validation speed, I have now locally overiden _validate_and_parse and switched it to use fastjsonschema, and it is much much faster, by an order of magnitude. So making these changes in the core library too would help a lot I think |
@Jack-Burnett awesome! I can confirm the speed has smelled for a while I have some good production test scenarios for target-mssql if / when we get good traction on this :D |
One note from thinking about validation in MeltanoLabs/tap-csv#88 (review): We probably don't want to change how validation works for the target configuration, just for records. |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
We should probably refactor this method to make it pluggable so target developers are able to make it a no-op if that makes sense for their destination. |
This was done in a slightly manner by 631d5df, shipped with https://github.com/meltano/sdk/releases/tag/v0.35.0. Comment if this is still a problem. |
Singer SDK Version
0.13.1
Python Version
3.10
Bug scope
Targets (data type handling, batching, SQL object generation, etc.)
Operating System
MacOS
Description
We have a target processing 1 million records and it takes a few hours, which seems pretty slow.
I tried creating a target that does nothing (so just measuring how slow the sdk itself is) and it takes around 3-4 seconds per 1000 records.
It spends about 60% of the time doing json schema validation, and another 35% doing _parse_timestamps_in_record.
Firstly, I'm wondering what the point in '_parse_timestamps_in_record' is generally; I can see how it would be useful for some targets but not enough to justify doing it for all targets, since many won't need that (and will then have to parse them back!)
I couldn't see anything obviously wrong with the validation, though I'm on the fence about how necessary it is; I wonder if you can just assume that the tap it outputting valid data 🤔
I heard of a library, but haven't used it myself, called fastjsonschema which is supposed to be faster, so that could be one idea.
Code
No response
The text was updated successfully, but these errors were encountered: