Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Setting add_record_metadata should be surfaced as builtin target configuration #1199

Closed
edgarrmondragon opened this issue Nov 16, 2022 · 6 comments · Fixed by #1881
Closed

Comments

@edgarrmondragon
Copy link
Collaborator

Singer SDK Version

0.14.0

Python Version

NA

Bug scope

Targets (data type handling, batching, SQL object generation, etc.)

Operating System

NA

Description

The flag is retrieved from config in

@property
def include_sdc_metadata_properties(self) -> bool:
"""Check if metadata columns should be added.
Returns:
True if metadata columns should be added.
"""
return self.config.get("add_record_metadata", False)

and it's documented in

# [SDK Implementation Details](./index.md) - Record Metadata
The SDK can automatically generate `_sdc_` ("Singer Data Capture") metadata properties when
performing data loads in SDK-based targets.
If `add_record_metadata` is defined as
a config option by the developer, and if the user sets `add_record_metadata=True` within
their own configuration, the following columns will be automatically added to each record:
- `_sdc_extracted_at` - Timestamp indicating when the record was extracted the record from the source.
- `_sdc_received_at` - Timestamp indicating when the record was received by the target for loading.
- `_sdc_batched_at` - Timestamp indicating when the record's batch was initiated.
- `_sdc_deleted_at` - Passed from a Singer tap if DELETE events are able to be tracked. In general, this is populated when the tap is synced LOG_BASED replication. If not sent from the tap, this field will be null.
- `_sdc_sequence` - The epoch (milliseconds) that indicates the order in which the record was queued for loading.
- `_sdc_table_version` - Indicates the version of the table. This column is used to determine when to issue TRUNCATE commands during loading, where applicable.

but the target's config schema is not aware of the setting.

This can be fixed by implementing append_builtin_config in the Target class to include that setting.

Code

No response

@BuzzCutNorman
Copy link
Contributor

@edgarrmondragon I ran into my mssql SDK target asking for the _sdc_table_version to be added to tables in version 0.13.1 and 0.14.0 when I send full extracts from piplinewise tap-postgres.

2022-11-17 12:41:38,703 ALTER TABLE stuff.badges ADD _sdc_table_version INTEGER

I haven't set anything for record metadata in my target to my knowledge. I do know I set the following for tap-postgres

    metadata:
      '*':
        replication-method: FULL_TABLE

@edgarrmondragon
Copy link
Collaborator Author

@BuzzCutNorman yeah there's some overloading of metadata here. In this case it refers to special columns added to tables by the target.

In particular _sdc_table_version is relevant for truncation in full-table replication, as you note.

The metadata you set in meltano.yml is the Singer catalog metadata for the tap.

@pnadolny13
Copy link
Contributor

@edgarrmondragon can you explain the _sdc_table_version attribute more? I've seen multiple users, most recently https://meltano.slack.com/archives/C01UTUSP34M/p1679672516968389?thread_ts=1679475632.154379&cid=C01UTUSP34M, ask for a timestamp or ID that allows them to differentiate records between different syncs. Usually batched_at/received_at/extracted_at are record level timestamps so they arent shared across all records in the sync i.e. you cant use a "group by".

Would _sdc_table_version do what we're looking for in this case? Is this related to activate version? If so, does this only get populated when a tap sends an activate version records?

@aaronsteers
Copy link
Contributor

aaronsteers commented Apr 3, 2023

This is indeed the vehicle that is used for activate version. The spec is flexible on the contents, but generally it's only used in full table sync operations, and all records would have the epoch integer corresponding the stream's sync start time.

@edgarrmondragon
Copy link
Collaborator Author

Usually batched_at/received_at/extracted_at are record level timestamps so they arent shared across all records in the sync i.e. you cant use a "group by".

@pnadolny13 I think we could at least be more precise with extracted_at to make it more useful, and link it to the actual extraction time of a record.

In the generic case that could be when get_records is called. For REST streams, that could be when each page request is made, so all records in the same page would share a extracted_at value.


ask for a timestamp or ID that allows them to differentiate records between different syncs

For this user request, we could also add another _sdc_synced_at (happy to hear better naming ideas! 😅) metadata column that would be set to the start time of the target process. That'd mean all records that were loaded in the same run would share this value, across tables. Do you think that would that work?

@pnadolny13
Copy link
Contributor

@pnadolny13 I think we could at least be more precise with extracted_at to make it more useful, and link it to the actual extraction time of a record.

@edgarrmondragon good point, I hadn't looked into it enough but I would expect it to act the way you described i.e. each record in a page request has the same timestamp.

For this user request, we could also add another _sdc_synced_at (happy to hear better naming ideas! 😅) metadata column that would be set to the start time of the target process.

Yep thats exactly what I was looking for. Naming is tough, maybe _sdc_sync_started_at?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants