Add backfill script from `data_sources_documents.source_url` to `data_sources_nodes.source_url` #10041

aubin-tchoi · 2025-01-16T18:14:25Z

Description

Add a backfill script that fills the column source_url for every document in data_sources_nodes by reading the content of the column data_sources_documents.source_url.
We have an index on (document).

Risk

corrupt the column and force us to redo all the backfills

Deploy Plan

no deploy

aubin-tchoi · 2025-01-16T18:22:33Z

core/src/stores/migrations/20250116_nodes_backfill_documents_source_urls.sql

+UPDATE data_sources_nodes dsn
+SET source_url = dsd.source_url
+FROM data_sources_documents dsd
+WHERE dsn.document = dsd.id
+  AND dsn.document IS NOT NULL;


we are gonna need some SQL dark magic to make this doable

maybe we can simply batch it by chunks of 1M rows

ahaha let's see what we can do; we can't use it like this though because of all the superseeded documents; we need to catch only latest

the single sql command was likely too ambitious; imo we query (id, source url) with a where using the index (datasource, status, timestamp) scrolling on timestamp
we do by datasource, by batches of ~1K

still not a bad script though

philipperolet

Threaded com :)

philipperolet · 2025-01-16T18:28:38Z

core/src/stores/migrations/20250116_nodes_backfill_documents_source_urls.sql

+UPDATE data_sources_nodes dsn
+SET source_url = dsd.source_url
+FROM data_sources_documents dsd
+WHERE dsn.document = dsd.id
+  AND dsn.document IS NOT NULL;


ahaha let's see what we can do; we can't use it like this though because of all the superseeded documents; we need to catch only latest

philipperolet · 2025-01-16T18:30:49Z

core/src/stores/migrations/20250116_nodes_backfill_documents_source_urls.sql

+UPDATE data_sources_nodes dsn
+SET source_url = dsd.source_url
+FROM data_sources_documents dsd
+WHERE dsn.document = dsd.id
+  AND dsn.document IS NOT NULL;


the single sql command was likely too ambitious; imo we query (id, source url) with a where using the index (datasource, status, timestamp) scrolling on timestamp
we do by datasource, by batches of ~1K

philipperolet · 2025-01-16T18:31:02Z

core/src/stores/migrations/20250116_nodes_backfill_documents_source_urls.sql

+UPDATE data_sources_nodes dsn
+SET source_url = dsd.source_url
+FROM data_sources_documents dsd
+WHERE dsn.document = dsd.id
+  AND dsn.document IS NOT NULL;


still not a bad script though

aubin-tchoi · 2025-01-17T10:00:35Z

front/migrations/20250116_backfill_documents_source_url.ts

+  for (;;) {
+    const [updatedRows] = (await (async () => {
+      // If nextId is null, we only filter by timestamp.
+      if (nextId === null) {


this strategy of double filtering anticipates on an issue raised when migrating parents: we have huge batches of documents that have strictly equal timestamps, which would cause the script to loop infinitely if we only rely on the timestamp

philipperolet

Great, thanks 🙏 💪
Left minor coms

philipperolet · 2025-01-17T10:46:24Z

front/migrations/20250116_backfill_documents_source_url.ts

+        }),
+      });
+    }
+  } while (frontDataSources.length === DATA_SOURCE_BATCH_SIZE);


nit: ooc, why batch here (since it's not paralelized)?
(fine to no-op on this)

we didn't need to batch here indeed, it's not that much data to load in memory

philipperolet · 2025-01-17T10:51:18Z

front/migrations/20250116_backfill_documents_source_url.ts

+import { makeScript } from "@app/scripts/helpers";
+
+const DATA_SOURCE_BATCH_SIZE = 16; // putting a larger batch size here doesn't really do anything
+const QUERY_BATCH_SIZE = 512; // here it does a lot


would go with at least 1024 here

philipperolet · 2025-01-17T10:54:40Z

front/migrations/20250116_backfill_documents_source_url.ts

+          );
+        }
+      }
+    })()) as { id: number; source_url: string; timestamp: number }[][];


not familiar with SQL returning; here, the query will return tu.id, tu.timestamp, but also source_url (although not specified)? seems a bit weird although not impossible, just want to confirm

also, is source_url used?

oh thanks the type is incorrect, there is no source_url, I forgot to remove it

aubin-tchoi added 3 commits January 16, 2025 19:12

add backfill script

1d5d011

improve the script

ed629e1

add aliases

22d0d36

aubin-tchoi commented Jan 16, 2025

View reviewed changes

philipperolet reviewed Jan 16, 2025

View reviewed changes

aubin-tchoi force-pushed the backfill-documents-source-url branch 3 times, most recently from 90d8e9e to b47c98a Compare January 17, 2025 09:49

add an actual backfill script

d435ea0

aubin-tchoi force-pushed the backfill-documents-source-url branch from b47c98a to d435ea0 Compare January 17, 2025 09:55

clean up the log a little

681de26

aubin-tchoi force-pushed the backfill-documents-source-url branch from c1ba32f to 681de26 Compare January 17, 2025 09:57

aubin-tchoi requested a review from philipperolet January 17, 2025 09:58

aubin-tchoi commented Jan 17, 2025

View reviewed changes

philipperolet approved these changes Jan 17, 2025

View reviewed changes

aubin-tchoi added 3 commits January 17, 2025 12:03

remove source_url from the types since it's not in the returning

6b88d4e

increase the query batch size

a1dc114

remove batching on data sources

0fbda5e

aubin-tchoi merged commit 361a35f into main Jan 17, 2025
3 checks passed

aubin-tchoi deleted the backfill-documents-source-url branch January 17, 2025 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backfill script from `data_sources_documents.source_url` to `data_sources_nodes.source_url` #10041

Add backfill script from `data_sources_documents.source_url` to `data_sources_nodes.source_url` #10041

aubin-tchoi commented Jan 16, 2025 •

edited

Loading

aubin-tchoi Jan 16, 2025

aubin-tchoi Jan 16, 2025

philipperolet Jan 16, 2025

philipperolet Jan 16, 2025

philipperolet Jan 16, 2025

philipperolet left a comment

philipperolet Jan 16, 2025

philipperolet Jan 16, 2025

philipperolet Jan 16, 2025

aubin-tchoi Jan 17, 2025

philipperolet left a comment

philipperolet Jan 17, 2025

aubin-tchoi Jan 17, 2025

philipperolet Jan 17, 2025

philipperolet Jan 17, 2025

philipperolet Jan 17, 2025

aubin-tchoi Jan 17, 2025

Add backfill script from data_sources_documents.source_url to data_sources_nodes.source_url #10041

Add backfill script from data_sources_documents.source_url to data_sources_nodes.source_url #10041

Conversation

aubin-tchoi commented Jan 16, 2025 • edited Loading

Description

Risk

Deploy Plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipperolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipperolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add backfill script from `data_sources_documents.source_url` to `data_sources_nodes.source_url` #10041

Add backfill script from `data_sources_documents.source_url` to `data_sources_nodes.source_url` #10041

aubin-tchoi commented Jan 16, 2025 •

edited

Loading