-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add backfill script from data_sources_documents.source_url
to data_sources_nodes.source_url
#10041
Conversation
UPDATE data_sources_nodes dsn | ||
SET source_url = dsd.source_url | ||
FROM data_sources_documents dsd | ||
WHERE dsn.document = dsd.id | ||
AND dsn.document IS NOT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are gonna need some SQL dark magic to make this doable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can simply batch it by chunks of 1M rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahaha let's see what we can do; we can't use it like this though because of all the superseeded documents; we need to catch only latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the single sql command was likely too ambitious; imo we query (id, source url) with a where using the index (datasource, status, timestamp) scrolling on timestamp
we do by datasource, by batches of ~1K
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not a bad script though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threaded com :)
UPDATE data_sources_nodes dsn | ||
SET source_url = dsd.source_url | ||
FROM data_sources_documents dsd | ||
WHERE dsn.document = dsd.id | ||
AND dsn.document IS NOT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahaha let's see what we can do; we can't use it like this though because of all the superseeded documents; we need to catch only latest
UPDATE data_sources_nodes dsn | ||
SET source_url = dsd.source_url | ||
FROM data_sources_documents dsd | ||
WHERE dsn.document = dsd.id | ||
AND dsn.document IS NOT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the single sql command was likely too ambitious; imo we query (id, source url) with a where using the index (datasource, status, timestamp) scrolling on timestamp
we do by datasource, by batches of ~1K
UPDATE data_sources_nodes dsn | ||
SET source_url = dsd.source_url | ||
FROM data_sources_documents dsd | ||
WHERE dsn.document = dsd.id | ||
AND dsn.document IS NOT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not a bad script though
90d8e9e
to
b47c98a
Compare
b47c98a
to
d435ea0
Compare
c1ba32f
to
681de26
Compare
for (;;) { | ||
const [updatedRows] = (await (async () => { | ||
// If nextId is null, we only filter by timestamp. | ||
if (nextId === null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this strategy of double filtering anticipates on an issue raised when migrating parents: we have huge batches of documents that have strictly equal timestamps, which would cause the script to loop infinitely if we only rely on the timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks 🙏 💪
Left minor coms
}), | ||
}); | ||
} | ||
} while (frontDataSources.length === DATA_SOURCE_BATCH_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ooc, why batch here (since it's not paralelized)?
(fine to no-op on this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't need to batch here indeed, it's not that much data to load in memory
import { makeScript } from "@app/scripts/helpers"; | ||
|
||
const DATA_SOURCE_BATCH_SIZE = 16; // putting a larger batch size here doesn't really do anything | ||
const QUERY_BATCH_SIZE = 512; // here it does a lot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would go with at least 1024 here
); | ||
} | ||
} | ||
})()) as { id: number; source_url: string; timestamp: number }[][]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not familiar with SQL returning; here, the query will return tu.id, tu.timestamp, but also source_url (although not specified)? seems a bit weird although not impossible, just want to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, is source_url used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thanks the type is incorrect, there is no source_url
, I forgot to remove it
Description
source_url
for every document indata_sources_nodes
by reading the content of the columndata_sources_documents.source_url
.(document)
.Risk
Deploy Plan
no deploy