-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework notebooks to use the static self-hosted fake job board #350
base: master
Are you sure you want to change the base?
Conversation
indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes. I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails. I've previously [reworked the written tutorial](https://realpython.com/beautiful-soup-web-scraper-python/#step-1-inspect-your-data-source) to use a self-hosted [fake job board](https://realpython.github.io/fake-jobs/) that I set up just for the purpose of the tutorial. As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks. The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martin-martin Great job updating this! I agree with you in removing the output from the notebooks!
I found one tiny bug (title
-> title_element
) that's noted as a line comment.
Otherwise, this looks good to me!
We could potentially ask @KateFinegan to have a quick LE glance on the changes as well.
build-a-web-scraper/03_parse.ipynb
Outdated
"source": [ | ||
"link_text = title_link.text\n", | ||
"link_text" | ||
"title = title.text\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title
is currently not defined, should we refer to title_element
?
"title = title.text\n", | |
"title = title_element.text\n", |
Co-authored-by: gahjelle <geirarne@gmail.com>
indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes.
I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails.
I've previously reworked the written tutorial to use a self-hosted fake job board that I set up just for the purpose of the tutorial.
As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks.
The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.
Where to put new files:
my-awesome-article
How to merge your changes: