[Bug]: Setting `add_record_metadata` should be surfaced as builtin target configuration #1199

edgarrmondragon · 2022-11-16T19:15:56Z

Singer SDK Version

0.14.0

Python Version

NA

Bug scope

Targets (data type handling, batching, SQL object generation, etc.)

Operating System

NA

Description

The flag is retrieved from config in

sdk/singer_sdk/sinks/core.py

Lines 187 to 194 in 253851e

    
               @property 
        
               def include_sdc_metadata_properties(self) -> bool: 
        
                   """Check if metadata columns should be added. 
        
                   Returns: 
        
                       True if metadata columns should be added. 
        
                   """ 
        
                   return self.config.get("add_record_metadata", False)

and it's documented in

sdk/docs/implementation/record_metadata.md

Lines 1 to 15 in 1576b5c

    
           # [SDK Implementation Details](./index.md) - Record Metadata 
        
           The SDK can automatically generate `_sdc_` ("Singer Data Capture") metadata properties when 
        
           performing data loads in SDK-based targets. 
        
           If `add_record_metadata` is defined as 
        
           a config option by the developer, and if the user sets `add_record_metadata=True` within 
        
           their own configuration, the following columns will be automatically added to each record: 
        
           - `_sdc_extracted_at` - Timestamp indicating when the record was extracted the record from the source. 
        
           - `_sdc_received_at` - Timestamp indicating when the record was received by the target for loading. 
        
           - `_sdc_batched_at` - Timestamp indicating when the record's batch was initiated. 
        
           - `_sdc_deleted_at` - Passed from a Singer tap if DELETE events are able to be tracked. In general, this is populated when the tap is synced LOG_BASED replication. If not sent from the tap, this field will be null. 
        
           - `_sdc_sequence` - The epoch (milliseconds) that indicates the order in which the record was queued for loading. 
        
           - `_sdc_table_version` - Indicates the version of the table. This column is used to determine when to issue TRUNCATE commands during loading, where applicable.

but the target's config schema is not aware of the setting.

This can be fixed by implementing append_builtin_config in the Target class to include that setting.

Code

No response

The text was updated successfully, but these errors were encountered:

BuzzCutNorman · 2022-11-17T21:36:00Z

@edgarrmondragon I ran into my mssql SDK target asking for the _sdc_table_version to be added to tables in version 0.13.1 and 0.14.0 when I send full extracts from piplinewise tap-postgres.

2022-11-17 12:41:38,703 ALTER TABLE stuff.badges ADD _sdc_table_version INTEGER

I haven't set anything for record metadata in my target to my knowledge. I do know I set the following for tap-postgres

    metadata:
      '*':
        replication-method: FULL_TABLE

edgarrmondragon · 2022-11-17T22:29:44Z

@BuzzCutNorman yeah there's some overloading of metadata here. In this case it refers to special columns added to tables by the target.

In particular _sdc_table_version is relevant for truncation in full-table replication, as you note.

The metadata you set in meltano.yml is the Singer catalog metadata for the tap.

pnadolny13 · 2023-04-03T14:40:42Z

@edgarrmondragon can you explain the _sdc_table_version attribute more? I've seen multiple users, most recently https://meltano.slack.com/archives/C01UTUSP34M/p1679672516968389?thread_ts=1679475632.154379&cid=C01UTUSP34M, ask for a timestamp or ID that allows them to differentiate records between different syncs. Usually batched_at/received_at/extracted_at are record level timestamps so they arent shared across all records in the sync i.e. you cant use a "group by".

Would _sdc_table_version do what we're looking for in this case? Is this related to activate version? If so, does this only get populated when a tap sends an activate version records?

aaronsteers · 2023-04-03T15:07:56Z

This is indeed the vehicle that is used for activate version. The spec is flexible on the contents, but generally it's only used in full table sync operations, and all records would have the epoch integer corresponding the stream's sync start time.

edgarrmondragon · 2023-04-03T15:53:10Z

Usually batched_at/received_at/extracted_at are record level timestamps so they arent shared across all records in the sync i.e. you cant use a "group by".

@pnadolny13 I think we could at least be more precise with extracted_at to make it more useful, and link it to the actual extraction time of a record.

In the generic case that could be when get_records is called. For REST streams, that could be when each page request is made, so all records in the same page would share a extracted_at value.

ask for a timestamp or ID that allows them to differentiate records between different syncs

For this user request, we could also add another _sdc_synced_at (happy to hear better naming ideas! 😅) metadata column that would be set to the start time of the target process. That'd mean all records that were loaded in the same run would share this value, across tables. Do you think that would that work?

pnadolny13 · 2023-04-03T16:44:20Z

@pnadolny13 I think we could at least be more precise with extracted_at to make it more useful, and link it to the actual extraction time of a record.

@edgarrmondragon good point, I hadn't looked into it enough but I would expect it to act the way you described i.e. each record in a page request has the same timestamp.

For this user request, we could also add another _sdc_synced_at (happy to hear better naming ideas! 😅) metadata column that would be set to the start time of the target process.

Yep thats exactly what I was looking for. Naming is tough, maybe _sdc_sync_started_at?

edgarrmondragon added kind/Bug Something isn't working Accepting Pull Requests valuestream/SDK labels Nov 16, 2022

pnadolny13 mentioned this issue Jul 12, 2023

feat: unique sync ID metadata column #1787

Closed

edgarrmondragon linked a pull request Jul 27, 2023 that will close this issue

fix: Expose add_record_metadata as a builtin target setting #1881

Merged

edgarrmondragon mentioned this issue Jul 27, 2023

fix: Expose add_record_metadata as a builtin target setting #1881

Merged

edgarrmondragon closed this as completed in #1881 Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Setting `add_record_metadata` should be surfaced as builtin target configuration #1199

[Bug]: Setting `add_record_metadata` should be surfaced as builtin target configuration #1199

edgarrmondragon commented Nov 16, 2022

BuzzCutNorman commented Nov 17, 2022

edgarrmondragon commented Nov 17, 2022

pnadolny13 commented Apr 3, 2023

aaronsteers commented Apr 3, 2023 •

edited

Loading

edgarrmondragon commented Apr 3, 2023

pnadolny13 commented Apr 3, 2023

[Bug]: Setting add_record_metadata should be surfaced as builtin target configuration #1199

[Bug]: Setting add_record_metadata should be surfaced as builtin target configuration #1199

Comments

edgarrmondragon commented Nov 16, 2022

Singer SDK Version

Python Version

Bug scope

Operating System

Description

Code

BuzzCutNorman commented Nov 17, 2022

edgarrmondragon commented Nov 17, 2022

pnadolny13 commented Apr 3, 2023

aaronsteers commented Apr 3, 2023 • edited Loading

edgarrmondragon commented Apr 3, 2023

pnadolny13 commented Apr 3, 2023

[Bug]: Setting `add_record_metadata` should be surfaced as builtin target configuration #1199

[Bug]: Setting `add_record_metadata` should be surfaced as builtin target configuration #1199

aaronsteers commented Apr 3, 2023 •

edited

Loading