Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up sourceUrl backfill scripts by leveraging indexes #10039

Merged
merged 5 commits into from
Jan 16, 2025

Conversation

aubin-tchoi
Copy link
Contributor

Description

  • This PR aims at speeding up the backfill scripts for Notion and Webcrawler by removing the backfill on the documents (will be done using the table data_sources_documents) and leveraging an index.

Risk

n/a

Deploy Plan

no deploy

@aubin-tchoi aubin-tchoi force-pushed the update-backfil-scripts branch from 33b757c to 80f719e Compare January 16, 2025 17:47
Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 2 nits

@@ -44,10 +48,15 @@ async function backfillFolders(
`SELECT id, "internalId", "url"
FROM webcrawler_folders
WHERE id > :lastId
AND "connectorId" = :connectorId -- does not leverage any index, we'll see if too slow or not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep likely not used but still good to have there

lastId = rows[rows.length - 1].id;
} while (rows.length === BATCH_SIZE);
}
) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could remove

FROM (SELECT unnest(ARRAY [:nodeIds]::text[]) as node_id,
unnest(ARRAY [:urls]::text[]) as url) urls
WHERE data_sources_nodes.node_id = urls.node_id
AND data_sources_nodes.data_source = :dataSourceId;`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure sql is smart enough to use the index despite the order being backwards, but only ~83% sure so keep an 👁️ 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧠

@aubin-tchoi aubin-tchoi merged commit 49b76b3 into main Jan 16, 2025
3 checks passed
@aubin-tchoi aubin-tchoi deleted the update-backfil-scripts branch January 16, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants