-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏗️ Build spider: WAMPO – Transportation Policy Body #15
Conversation
tests/test_wicks_wampo_tpb.py
Outdated
"title": "Min", | ||
}, | ||
{"href": "https://youtu.be/LsMI1EClvnI", "title": "Re"}, | ||
{"href": "https://youtu.be/LsMI1EClvnI", "title": "cording"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 links should be unified into a single link:
{"href": "https://youtu.be/LsMI1EClvnI", "title": "Recording"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SimmonsRitchie I noticed that the page renders complex html structure for a single link, eg. for Agenda link
And for Recording link, it breaks into 2 <a>
elements. So I think you should rework on the _parse_links
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix these 2 functions:
test_links()
_parse_links(item)
Current handling doesn't adequately handle highly nested HTML structure of links on this agency's page.
Ah, my bad, @LienDang. That's sloppy work on my part. I'm experimenting with a new automation to build tests quickly and I'll admit I gave the links retrieved a pretty cursory look. I've now fixed the parsing to handle the messiness of this agency's HTML. I think it now works pretty well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What's this PR do?
Adds a spider to scrape meetings from the website of Wichita Area Metropolitan Planning Organization - Transportation Policy Body. The new spider is called
wicks_wampo_tpb
.[Note: Much like a separate PR, I accidentally merged this to main before review. Silly me 😒. Here's the original PR #13 and the revert #14 ]
Why are we doing this?
Requested by our site partners.
Steps to manually test
Monitor the output and ensure no errors are raised.
Inspect
test_output.csv
to ensure the data looks valid. Target website is here.Are there any smells or added technical debt to note?