Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backfill script from data_sources_documents.source_url to data_sources_nodes.source_url #10041

Merged
merged 8 commits into from
Jan 17, 2025

Conversation

aubin-tchoi
Copy link
Contributor

@aubin-tchoi aubin-tchoi commented Jan 16, 2025

Description

  • Add a backfill script that fills the column source_url for every document in data_sources_nodes by reading the content of the column data_sources_documents.source_url.
  • We have an index on (document).

Risk

  • corrupt the column and force us to redo all the backfills

Deploy Plan

no deploy

Comment on lines +1 to +5
UPDATE data_sources_nodes dsn
SET source_url = dsd.source_url
FROM data_sources_documents dsd
WHERE dsn.document = dsd.id
AND dsn.document IS NOT NULL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are gonna need some SQL dark magic to make this doable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can simply batch it by chunks of 1M rows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahaha let's see what we can do; we can't use it like this though because of all the superseeded documents; we need to catch only latest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the single sql command was likely too ambitious; imo we query (id, source url) with a where using the index (datasource, status, timestamp) scrolling on timestamp
we do by datasource, by batches of ~1K

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not a bad script though

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threaded com :)

Comment on lines +1 to +5
UPDATE data_sources_nodes dsn
SET source_url = dsd.source_url
FROM data_sources_documents dsd
WHERE dsn.document = dsd.id
AND dsn.document IS NOT NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahaha let's see what we can do; we can't use it like this though because of all the superseeded documents; we need to catch only latest

Comment on lines +1 to +5
UPDATE data_sources_nodes dsn
SET source_url = dsd.source_url
FROM data_sources_documents dsd
WHERE dsn.document = dsd.id
AND dsn.document IS NOT NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the single sql command was likely too ambitious; imo we query (id, source url) with a where using the index (datasource, status, timestamp) scrolling on timestamp
we do by datasource, by batches of ~1K

Comment on lines +1 to +5
UPDATE data_sources_nodes dsn
SET source_url = dsd.source_url
FROM data_sources_documents dsd
WHERE dsn.document = dsd.id
AND dsn.document IS NOT NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not a bad script though

@aubin-tchoi aubin-tchoi force-pushed the backfill-documents-source-url branch 3 times, most recently from 90d8e9e to b47c98a Compare January 17, 2025 09:49
@aubin-tchoi aubin-tchoi force-pushed the backfill-documents-source-url branch from b47c98a to d435ea0 Compare January 17, 2025 09:55
@aubin-tchoi aubin-tchoi force-pushed the backfill-documents-source-url branch from c1ba32f to 681de26 Compare January 17, 2025 09:57
for (;;) {
const [updatedRows] = (await (async () => {
// If nextId is null, we only filter by timestamp.
if (nextId === null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strategy of double filtering anticipates on an issue raised when migrating parents: we have huge batches of documents that have strictly equal timestamps, which would cause the script to loop infinitely if we only rely on the timestamp

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks 🙏 💪
Left minor coms

}),
});
}
} while (frontDataSources.length === DATA_SOURCE_BATCH_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ooc, why batch here (since it's not paralelized)?
(fine to no-op on this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't need to batch here indeed, it's not that much data to load in memory

import { makeScript } from "@app/scripts/helpers";

const DATA_SOURCE_BATCH_SIZE = 16; // putting a larger batch size here doesn't really do anything
const QUERY_BATCH_SIZE = 512; // here it does a lot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would go with at least 1024 here

);
}
}
})()) as { id: number; source_url: string; timestamp: number }[][];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not familiar with SQL returning; here, the query will return tu.id, tu.timestamp, but also source_url (although not specified)? seems a bit weird although not impossible, just want to confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is source_url used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thanks the type is incorrect, there is no source_url, I forgot to remove it

@aubin-tchoi aubin-tchoi merged commit 361a35f into main Jan 17, 2025
3 checks passed
@aubin-tchoi aubin-tchoi deleted the backfill-documents-source-url branch January 17, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants