Managing Identity Columns as Primary Keys in DMS. Feasible ? #590

Mimetis · 2021-08-17T14:05:36Z

Introduction

This issue is an explanation about "why" auto increment primary keys is not handled by DMS, yet.
Some solutions exist to ensure a correct sync between several systems, using an identity primary keys for some tables.
Theoretically, solutions exist, but in practice (you know.. real life) some really complex challenges can emerge from these solutions.

This post is NOT a solution, but more an open discussion on "How he could be done, if possible".

The challenge of auto increment keys is basically how can we afford any conflict when 2 records from 2 systems are using the same primary key (auto incremented) for a new record.
Basically, any delete or update does not have any problem here. The main source of concerns comes from the insertion.

Client inserts a new record : Primary Key generated : 42
Server inserts a new record : Primary Key generated : 42

Is it an update conflict ?
Is it two insertion ? If so, should we change the Client primary key to 43 ? (and what will happens on next insertion ?)

Solutions

Basically we have two solutions, to avoid insertions conflicts with auto increment primary keys:

1) Assigning ranges

The idea is to work with Identity columns as Primary keys and range of ids.
For instance, you can define something like that:

Server IDS range: 1 - 99 999 999
Client 1 IDS range: 100 000 000 - 100 999 999
Client 2 IDS range: 101 000 000 - 101 999 999
Client 3 IDS range: 102 000 000 - 102 999 999

You can store these values in any configuration table.
When you reach the limit for one client (or server), just change the range for a new one.

PRO : It's working with SQL Server / MySQL and SQLite
CONS :
- We are not thread safe (or more accurately, we are not SQL Sessions Safe)
- We need to manipulate the seeding of identity columns on each sync

2) Assigning Seed / Step

In the same concept, another technic is to use the IDENTITY (m, n) formula as follows:

m = Serial number of a node (Server = 1, First Client = 2, Second Client = 3, etc.)
n = Number of replication nodes (Server + All Clients)

For example, we have 4 nodes (n = 4) to sync (1 Server + 3 Clients):
The Serial number m will be 1 for Server and 2, 3, 4 for Client – A, Client – B and Client – C respectively.

According to the formula, the data will be populated as follows:

Server : IDENTITY (1, 4) - the column data sequence will be 1, 5, 9, 13…
Client – A: IDENTITY (2, 4) - the column data sequence will be 2, 6, 10, 14…
Client – B: IDENTITY (3, 4) - the column data sequence will be 3, 7, 11, 15…
Client – C: IDENTITY (4, 4) - the column data sequence will be 4, 8, 12, 16…

You can store the formulas values / increments in any configuration table.

PROS : Easy to implement if you already know how much clients you will have
CONS :
- The downside of this method is that you need to now the MAX number of clients you will have in your organization.
- It does not work with SQLite

Solution we will discuss here

Since the major clients databases used with DMS is SQLite, and since the merge replication from SQL Server uses also this technic, we will discuss the implementation of Solution 1: Assigning Ranges.

DMS possible Implementation

I've already worked on the primary keys auto increment implementation in DMS, even if it wasn't publish so far.
I'm going to explain what I've done and what are the problems I did not resolve yet

Playing with Identity Columns On SQL Server

Firstly, let's see what we have, in terms of queries / language / hints, on SQL Server, when we are working with Identity columns:
(Obviously, we have many others SQL statements for IDENTITY columns, that we don't need here)

SET IDENTITY INSERT : Allowing the insertion of a specific value in an identity column
DBCC CHECKIDENT : Managing Seed and Step for an identity column for a specific table.
IDENT_CURRENT : Getting the current identity increment values for a specific table.

Macro implementation in DMS

For the implementation, we will stick to SQL Server only, for now.

scopes_identity table

Firstly, we need to save and handle the ranges for each client and the server.
Obviously, we can use a simple table (like scope_info) to stores these values.
Nothing really complicated here.
Here is a draft of the scope_info_identity table, stored and managed from the server side:

sycn_scope_id	scope_table_name	scope_identity_min	scope_identity_ax	scope_threshold
Client_ID_GUID_1	Customer	1 000 000	1 999 999	0.7
Client_ID_GUID_2	Customer	2 000 000	2 999 999	0.7
Client_ID_GUID_1	Product	1 000 000	1 999 999	0.7

Using this kind of tables, we can imagine a sync system where a Client will:

Send its last identity value for each table
Server will see if the value is above the threshold or not
If threshold is overpassed, Server will send back a new range of id for this specific table for the specific Client

Merging rows

Here are coming the problems.
Here is a macro version of the actual Insert / Update (well on SQL Server, we are using a MERGE statement) stored procedure used when inserting new rows in any client or server (when dealing with identity primary keys):

SET IDENTITY_INSERT [Product] ON;

MERGE [Product] AS [base]
USING [changes] on [Product].[ProductID] = [base].[ProductID]
WHEN MATCHED  THEN UPDATE  ...
WHEN NOT MATCHED THEN INSERT ...

SET IDENTITY_INSERT [Product] OFF;

Something important to notice about SET IDENTITY_INSERT:
It allows us to specify a value for our identity column, and if this value is higher than any stored value, the new internal identity counter is incremented as well.

A small sample:

Select IDENT_CURRENT('Product')
-- Result : 1

Insert into Product (Name) Values ('Bikes')
Select IDENT_CURRENT('Product')
-- Result : 2

SET IDENTITY_INSERT [Product] ON;
Insert into Product (ProductID, Name) Values (1000000, 'Bikes')
SET IDENTITY_INSERT [Product] OFF;
Select IDENT_CURRENT('Product')
-- Result : 1 000 000

Since we need to be sure each clients will stay in its range, we need to handle this behavior.

Client 1: Range 1 000 000 to 1 999 999

Let's see what happens for Client 1 with range 1 000 000 to 1 999 999 ?
We need to ensure that all values inserted in from Client 1 should always be in that range.

But what happens if we have a sync on Client 1 with records coming from Client 3 where range is 3 000 000 to 3 999 999 ?
Well, for sure, the internal identity value for that column will be reseed to something between 3 000 000 to 3 999 999.
And we need to avoid that.

Here is a macro version to be sure we can avoid this situation:

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE [Product] AS [base]
USING [changes] on [Product].[ProductID] = [base].[ProductID]
WHEN MATCHED  THEN UPDATE  ...
WHEN NOT MATCHED THEN INSERT ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Using this technic, we will RESEED the internal increment value to the previous value, before any merge (with potential higher values for the identity column)

This solution leads to several problems ....

Multi SQL Sessions insertions

What happens if we have another session between the identity increment saving (SELECT @identity_current = IDENT_CURRENT('Product');) and the RESEED ?

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

-- ANOTHER SESSION TRY TO INSERTS A RECORD HERE IN THE CLIENT 1 DATABASE

MERGE ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Another session will insert a new record and our temp saved value will not be corect.
The RESEED will work, but will be wrong as well.
Since the reseeding will not be correct, we will have an error raised on next INSERT in the Client 1 database.

On the other part, what will happens if we try to insert something, in another session, just after the MERGE ?

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE ...

-- ANOTHER SESSION TRY TO INSERTS A RECORD HERE IN THE CLIENT 1 DATABASE

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Here, we may have a new record on Client 1 database, out of the authorized ranges of values....

Using a Serializable transaction.

Serializable transaction is the most protective level of transaction. Can we use it here ?

SET TRANSACTION ISOLATION LEVEL SERIALIZEABLE
SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

The serializable transaction won't save us here, since it will not lock the entire table, and will allow any other session to get a new identity value and insert a new record in the table, potentially as we saw in our last 2 samples...

TABLOCK hint

The TABLOCK hint is more restrictive and will lock the entire table during our MERGE statement

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE Product WITH (TABLOCK)

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

It will not save us at all since we still can insert before or after the MERGE statement...

Solutions of both side.

So far, we saw that the problem is coming from multi SQL sessions on the table that is updating during a sync MERGE statement....

FREEZE UI and prevents multi SQL Sessions

Can we say something like:
"Hey, when you are doing a sync on this table, DO NOT TRY TO MAKE ANY INSERTS IN IT"

That could potentially works for client databases. You can easily freeze the UI and prevents this situation to happens.

But what will happens on the Server side ?
The Server side needs to handle multiple sync at the same time, and we can't prevent two sync to happens at the same time ...

Disabling the internal counter increment

An ideal solution would be to disable the internal increment of the identity current value, during the MERGE statement:

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);

-- DISABLE INTERNAL IDENTITY CURRENT INCREMENT HERE, BUT ONLY FOR THIS SESSION
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE Product WITH (TABLOCK)

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

I guess this is what happens when we are using the NOT FOR REPLICATION option with the SQL MERGE replication.

The "Not For Replication" setting for identity columns allows replication to pass the ID value from the publisher to the subscriber without the subscriber creating a new ID. Since identity columns will always generate a new number when records are inserted, this setting allows you to ensure the values on both your publisher and subscribers stay in synch.

Source : https://www.mssqltips.com/sqlservertip/1274/change-not-for-replication-value-for-sql-server-identity-columns/

Conclusion

I'm not sure I've put here all the researchs and tests I've made so far on the Sync Identity columns problems, but at least this thread could be a good starting point to discuss about it.

If you have any idea, do not hesitate to share here your thoughts.

The text was updated successfully, but these errors were encountered:

gentledepp · 2021-08-22T14:09:35Z

To be honest - I do consider using sequential numeric values as primary keys in sync scenarios a fundamental design flaw.
No one should ever do this.
Rather, use sequential guids or the like to improve index clustering.
So I am going to stay out of this discussion ;-)

VagueGit · 2021-08-23T00:21:32Z

Speaking as one who is staying out of this discussion to another @gentledepp, how are sequential guids generated on the client?

I'm not sure sequential numeric values are necessarily a design flaw in a distributed system. Like most things, it depends. Most systems, including most systems discussed here do not generate more than 10 million rows per second. For those, DateTimeOffset.UtcTicks may be a viable primary key.

Or that's what I would have said if I weren't staying out of it.

percramer · 2022-08-27T22:21:12Z

May I ask what the current status is of this? I am using a different sync framework at this moment, but I am thinking about switching.

In our database we use integer based keys and foreign keys. And before everybody mentions we should use guid’s, that is not possible for us. We need to communicate with other platforms/applications that don’t allow guid’s as id/foreign keys. I can add guid’s as primary key, but I would still need an extra unique integer based id field.

The current framework also works with id ranges per client. And those are updated on each sync. This setup was also mentioned above, but i was wondering if that is enabled in the current release?

Mimetis · 2022-08-28T08:02:26Z

Hello @percramer

Today, using identity columns as primary keys works fine in DMS for Update and Delete statements, but you will have conflicts for Insert, if server and client are inserting a new record on the same range

Since we do not have yet a proper solution for handling identity columns on the server side, it will not be implemented.
You can obviously implement the logic on your own, of course, but you will have to make some trade off, as explained in this post.

Just out of curiosity, which framework are you using today, and does it manage correctly the identity columns or do you have to make trade off ?

percramer · 2022-08-28T09:03:14Z

Hi @Mimetis,

I am using Sqlite-sync . They made the express version open source , that only works with sqlite on the server and client side. I am using sql server on the server side, but it all works in the same way. I need to be able to do inserts from the client side since the application is also used in locations where there is no network connection.

What they do is also reserve id ranges per client (and per table of course) in a table, but those are synced to the client also. So each client will have a unique integer range which is updated (it needed) during the sync. And triggers on the client sqlite table ensure that the used values are from that range. So the server doesn’t have to deal with that anymore because the data from the clients already has got unique id’s.

so that is partially what you proposed also, but you wanted to keep that logic server side? Wouldn’t the setup of my current framework but a workable solutions for this also? It works without any problems at this moment.

sqlite-sync has got some other design flaws sadly though that prevent usage with larger amounts of data / changes

SvdSinner · 2023-02-28T16:29:15Z

In our database we use integer based keys and foreign keys. And before everybody mentions we should use guid’s, that is not possible for us. We need to communicate with other platforms/applications that don’t allow guid’s as id/foreign keys. I can add guid’s as primary key, but I would still need an extra unique integer based id field.

A possible solution could be to add a GUID column while keeping the Identity columns as a non-primary key indexed column. You can still index the identity column for performance, but the GUID column would be used to confirm that rows actually are referencing to the same data row. This could be in conjunction with cascading updates in SQL server*** to change the identity of a potential collision before syncing. I see this as "doable" but certainly not elegant or easy to implement. It would be very heavy on the user to make sure all the cascading updates and new GUID columns are set up properly, I would have doubts that it could be made into a general solution usable by average users.

In summary: Feasibility to provide a possible solution pathways to people interested and experienced enough to implement for their own organization: Quite feasible Feasibility to create an elegant, easy to implement general solution that could handle all of the possible individual requirements of all systems? Not feasible.

*** I'm expert-level in MS SqlServer, but I won't begin to claim I know what fancy features that the other supported DBs do or don't support.

bearyung · 2023-04-19T02:35:13Z

I've been using Microsoft Sync Framework since year 2011 and serving more than 6 billion transaction records on one Azure Hyperscale database sync with 6000+ clients.

Sadly when the DB designed 12 years ago we used Integer (even not bigint) for the identity column (auto increment), the way we handle conflict is following Microsoft's best practice, to set the sync group when conflict happen, use the client data or server data as the final copy.

We have set when conflict happened, the sync agent will use the client side's data to override the server side's data.

The key solution is to set composite columns, e.g. A retail group have 200+ shops, they have an (int) column for OrderId, so
Shop A - Order Id 1001
Shop B - Order Id 1001
Shop C - Order Id 1001
Shop D - Order Id 1001

in order to prevent conflict, we used composite keys as primary key with auto increment.
Shop A - Shop Id 123, Order Id 1001 (the transaction ref number will be 123-1001)
Shop B - Shop Id 124, Order Id 1001
Shop C - Shop Id 125, Order Id 1001
Shop D - Shop Id 126, Order Id 1001

And we even introduced the terminal ID (as one shop may have more than one terminal) in the composite keys.
Shop A - Shop Id 123, Till Id: 1, Order Id 1001 (the transaction ref number will be 123-1-1001)
Shop A - Shop Id 123, Till Id: 2, Order Id 1001
Shop B - Shop Id 124, Till Id: 1, Order Id 1001
Shop C - Shop Id 125, Till Id: 1, Order Id 1001
Shop D - Shop Id 126, Till Id: 1, Order Id 1001

In summary: I think DMS do not need to handle the "creation" when there's conflicts of Identity Columns as Primary Keys, it is a design flaw of the SQL DB schema. The current conflict resolution (either set client override server, or server override client) already sufficient to handle this situation.

percramer · 2023-04-19T08:54:07Z

Although that is a nice solution for the issue (composite keys) it can’t be always applied. We do have situation like yours but with a few systems we don’t have control over the db schema (since they are from external software applications which have been bought). And this applications only work with an singe column integer key sadly

bearyung · 2023-04-19T09:08:07Z

Although that is a nice solution for the issue (composite keys) it can’t be always applied. We do have situation like yours but with a few systems we don’t have control over the db schema (since they are from external software applications which have been bought). And this applications only work with an singe column integer key sadly

Do you have control over the local database? If yes, i am thinking you can create a new column (e.g. ShopId) for those local tables, and setup a stored procedure or create row trigger to automatically fill in the ShopId to 1 for shop A, shopId to 2 for shop B.

And then in DMS, set the filter to only bi-directional sync rows with shopID column equals to 1 to shop A, 2 to shop B, etc.

percramer · 2023-04-19T09:19:15Z

No that is the main issue, we can't change the schema at all. Our SQL user doesn't have those rights. We where using Sqlite-sync in the past but for different reasons I am looking for something different. SQLite-sync uses a identity range per client which is actively updated when the user syncs. And then local triggers to ensure the ID is in the range of the user. I have been thinking about implementing this here also, but time...

bearyung · 2023-04-19T10:01:33Z

No that is the main issue, we can't change the schema at all. Our SQL user doesn't have those rights. We where using Sqlite-sync in the past but for different reasons I am looking for something different. SQLite-sync uses a identity range per client which is actively updated when the user syncs. And then local triggers to ensure the ID is in the range of the user. I have been thinking about implementing this here also, but time...

For allocating identity range per client approach, the sync server need to estimate the growth rate of the client, how frequent the client will connect to server to update the "used range" and to take the next batch of the available range, and in case the identity range used up all ids, can it overflow? (potentially the overflow part will conflict with other client's assigned range)

Another approach is to build a middle DB (either local or on the cloud), use DMS perform a 1:1 to sync the sqlite data to the middle DB, then fill back the shop ID in the middle DB, and perform another data sync from the middle DB to the centralized DB (merged with other shops data).

percramer · 2024-02-21T14:04:57Z

i have been doing some testing with negative primary keys on my client app. So the auto increment step is -1 instead of 1.

And then during the sync i can easily check if they are negative and then replace them with a newly generated value. Just need to sync that back to the client then.

Mimetis added Discussion help wanted investigation labels Aug 21, 2021

This was referenced Dec 23, 2021

Something wrong with the ChangeTracking sync feature #667

Closed

Remove existing PK column and set GUID column as PK - during (first) provisioning #672

Closed

PeterJey mentioned this issue Apr 4, 2022

Deleted row (Primary key) on the server causes conflict when client adds new record #723

Closed

Mimetis mentioned this issue Apr 30, 2022

Bidirectional Sync issue #718

Closed

Mimetis mentioned this issue Jul 18, 2022

Conflict management - take both client and server changes? #757

Closed

ErickHeredia92 mentioned this issue May 30, 2023

sync multiple client with the server #1034

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Identity Columns as Primary Keys in DMS. Feasible ? #590

Managing Identity Columns as Primary Keys in DMS. Feasible ? #590

Mimetis commented Aug 17, 2021 •

edited

Loading

gentledepp commented Aug 22, 2021

VagueGit commented Aug 23, 2021

percramer commented Aug 27, 2022

Mimetis commented Aug 28, 2022

percramer commented Aug 28, 2022

SvdSinner commented Feb 28, 2023 •

edited

Loading

bearyung commented Apr 19, 2023 •

edited

Loading

percramer commented Apr 19, 2023

bearyung commented Apr 19, 2023

percramer commented Apr 19, 2023

bearyung commented Apr 19, 2023

percramer commented Feb 21, 2024

Managing Identity Columns as Primary Keys in DMS. Feasible ? #590

Managing Identity Columns as Primary Keys in DMS. Feasible ? #590

Comments

Mimetis commented Aug 17, 2021 • edited Loading

Introduction

Solutions

1) Assigning ranges

2) Assigning Seed / Step

Solution we will discuss here

DMS possible Implementation

Playing with Identity Columns On SQL Server

Macro implementation in DMS

scopes_identity table

Merging rows

Client 1: Range 1 000 000 to 1 999 999

Multi SQL Sessions insertions

Using a Serializable transaction.

TABLOCK hint

Solutions of both side.

FREEZE UI and prevents multi SQL Sessions

Disabling the internal counter increment

Conclusion

gentledepp commented Aug 22, 2021

VagueGit commented Aug 23, 2021

percramer commented Aug 27, 2022

Mimetis commented Aug 28, 2022

percramer commented Aug 28, 2022

SvdSinner commented Feb 28, 2023 • edited Loading

bearyung commented Apr 19, 2023 • edited Loading

percramer commented Apr 19, 2023

bearyung commented Apr 19, 2023

percramer commented Apr 19, 2023

bearyung commented Apr 19, 2023

percramer commented Feb 21, 2024

Mimetis commented Aug 17, 2021 •

edited

Loading

SvdSinner commented Feb 28, 2023 •

edited

Loading

bearyung commented Apr 19, 2023 •

edited

Loading