Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗️ Build spider: WAMPO – Transportation Policy Body #15

Merged
merged 2 commits into from
Mar 20, 2024

Conversation

SimmonsRitchie
Copy link
Contributor

@SimmonsRitchie SimmonsRitchie commented Mar 14, 2024

What's this PR do?

Adds a spider to scrape meetings from the website of Wichita Area Metropolitan Planning Organization - Transportation Policy Body. The new spider is called wicks_wampo_tpb.

[Note: Much like a separate PR, I accidentally merged this to main before review. Silly me 😒. Here's the original PR #13 and the revert #14 ]

Why are we doing this?

Requested by our site partners.

Steps to manually test

  1. Ensure the project is installed:
pipenv sync --dev
  1. Activate the virtual env and enter the pipenv shell:
pipenv shell
  1. Run the spider:
scrapy crawl wicks_wampo_tpb -O test_output.csv
  1. Monitor the output and ensure no errors are raised.

  2. Inspect test_output.csv to ensure the data looks valid. Target website is here.

Are there any smells or added technical debt to note?

  • Page is lacking some key details but has great links for attachments. Certain details are hardcoded.

"title": "Min",
},
{"href": "https://youtu.be/LsMI1EClvnI", "title": "Re"},
{"href": "https://youtu.be/LsMI1EClvnI", "title": "cording"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 links should be unified into a single link:

{"href": "https://youtu.be/LsMI1EClvnI", "title": "Recording"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SimmonsRitchie I noticed that the page renders complex html structure for a single link, eg. for Agenda link
WAMPO_Agenda_link
And for Recording link, it breaks into 2 <a> elements. So I think you should rework on the _parse_links function.

Copy link

@LienDang LienDang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix these 2 functions:

test_links()

_parse_links(item)

Current handling doesn't adequately handle highly nested HTML structure of links on this agency's page.
@SimmonsRitchie
Copy link
Contributor Author

Please fix these 2 functions:

test_links()

_parse_links(item)

Ah, my bad, @LienDang. That's sloppy work on my part. I'm experimenting with a new automation to build tests quickly and I'll admit I gave the links retrieved a pretty cursory look. I've now fixed the parsing to handle the messiness of this agency's HTML. I think it now works pretty well.

@LienDang LienDang self-requested a review March 20, 2024 01:47
Copy link

@LienDang LienDang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SimmonsRitchie SimmonsRitchie merged commit 47a97bd into main Mar 20, 2024
2 checks passed
@SimmonsRitchie SimmonsRitchie deleted the revert-14-revert-13-wampo branch March 20, 2024 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants