fix: preserve subpage URLs in webcrawler documents #10078
Closed
+4
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix: Preserve subpage URLs in webcrawler documents
Problem
The webcrawler was storing the base URL (e.g., https://docs.dust.tt) for all documents instead of their specific subpage URLs. This resulted in all documents showing the same root URL instead of their actual locations.
Solution
Modified the URL handling in the webcrawler to preserve the original request URL while maintaining URL validation checks:
request.url
instead ofvalidatedUrl.standardized
fordocumentUrl
Testing
The changes will be verified through CI checks. The fix will apply to future crawls, with existing documents being updated on their next scheduled crawl.
Link to Devin run: https://app.devin.ai/sessions/88fc55c67be2496a954aebc6e1075572