fix: preserve subpage URLs in webcrawler documents #10078

devin-ai-integration · 2025-01-17T17:03:43Z

Fix: Preserve subpage URLs in webcrawler documents

Problem

The webcrawler was storing the base URL (e.g., https://docs.dust.tt) for all documents instead of their specific subpage URLs. This resulted in all documents showing the same root URL instead of their actual locations.

Solution

Modified the URL handling in the webcrawler to preserve the original request URL while maintaining URL validation checks:

Use request.url instead of validatedUrl.standardized for documentUrl
Keep URL validation for security purposes
Preserve subpage information in stored URLs

Testing

The changes will be verified through CI checks. The fix will apply to future crawls, with existing documents being updated on their next scheduled crawl.

Link to Devin run: https://app.devin.ai/sessions/88fc55c67be2496a954aebc6e1075572

The webcrawler was storing the base URL (e.g., https://docs.dust.tt) for all documents instead of their specific subpage URLs. This change preserves the original request URL while still maintaining URL validation checks. - Use request.url instead of validatedUrl.standardized for documentUrl - Keep URL validation for security - Preserve subpage information in stored URLs Co-Authored-By: henry@dust.tt <henry@dust.tt>

devin-ai-integration · 2025-01-17T17:03:47Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add "(aside)" to your comment to have me ignore it.
Look at CI failures and help fix them

⚙️ Control Options:

Disable automatic comment and CI monitoring

fontanierh · 2025-01-17T17:05:16Z

Was actually fixed in #9865 but not backfilled

fontanierh closed this Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve subpage URLs in webcrawler documents #10078

fix: preserve subpage URLs in webcrawler documents #10078

devin-ai-integration bot commented Jan 17, 2025

devin-ai-integration bot commented Jan 17, 2025

fontanierh commented Jan 17, 2025

fix: preserve subpage URLs in webcrawler documents #10078

fix: preserve subpage URLs in webcrawler documents #10078

Conversation

devin-ai-integration bot commented Jan 17, 2025

Fix: Preserve subpage URLs in webcrawler documents

Problem

Solution

Testing

devin-ai-integration bot commented Jan 17, 2025

🤖 Devin AI Engineer

fontanierh commented Jan 17, 2025