Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: preserve subpage URLs in webcrawler documents #10078

Closed

Conversation

devin-ai-integration[bot]
Copy link

Fix: Preserve subpage URLs in webcrawler documents

Problem

The webcrawler was storing the base URL (e.g., https://docs.dust.tt) for all documents instead of their specific subpage URLs. This resulted in all documents showing the same root URL instead of their actual locations.

Solution

Modified the URL handling in the webcrawler to preserve the original request URL while maintaining URL validation checks:

  • Use request.url instead of validatedUrl.standardized for documentUrl
  • Keep URL validation for security purposes
  • Preserve subpage information in stored URLs

Testing

The changes will be verified through CI checks. The fix will apply to future crawls, with existing documents being updated on their next scheduled crawl.

Link to Devin run: https://app.devin.ai/sessions/88fc55c67be2496a954aebc6e1075572

The webcrawler was storing the base URL (e.g., https://docs.dust.tt) for all documents
instead of their specific subpage URLs. This change preserves the original request URL
while still maintaining URL validation checks.

- Use request.url instead of validatedUrl.standardized for documentUrl
- Keep URL validation for security
- Preserve subpage information in stored URLs

Co-Authored-By: henry@dust.tt <henry@dust.tt>
Copy link
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add "(aside)" to your comment to have me ignore it.
  • Look at CI failures and help fix them

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@fontanierh
Copy link
Contributor

Was actually fixed in #9865 but not backfilled

@fontanierh fontanierh closed this Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant