Releases: micahcochran/recipe-crawler
Releases · micahcochran/recipe-crawler
v0.4.0 - Add Crawl-delay
From PR #11
- Adds support for
Crawl-delay
directive inrobots.txt
(Note:Request-rate
is not supported.). - Replaces
reppy
with the Python standard libraryurllib.robotparse
- Typing syntax improvements.
- Updates made to support recipe-scrapers library >=14.48.0
- Adds Foodista to the default list of sites to scrape.
- Adds
requirements-dev.txt
for development. This is mainly typing.
v0.3.0 - CLI Revamped
- uses argparse to create a more flexible CLI interface
- slightly improve the accuracy of the number of bytes downloaded metric, which is likely an overestimate
- Create a main() function and move functions within it. This puts those variables into a local scope instead of a a global scope.
v0.2.1 - Fix recipe-scrapres returns a list
recipeInstructions
from recipe-scrapers now returns a list (was a string) from issue #7
v0.2.0
- Mostly removed hack code from 0.2.0-pre due changes in recipe-scrapers.
- Make minimum
recipe-scrapers
library version 13.3.3. - Log to a file. Add more logging around when recipe contents matches.
- Writes a license report in Markdown file (licenses.md) in
Crawler.license_report()
. - Assign
AnchorListsEmptyError
when the anchor lists are empty. Before this the code was catchingValueError
exceptions from other parts of the code, when the program should be exiting. - Reports the bytes of HTML downloaded.
- Improve documentation to taste.py. Add length argument.
- Remove Release notes from the code (release notes already contained in GitHub) -- addresses #4 .
v0.2.0-pre - Some fixes, add taste.py, add scrapers from recipe-scrapers
- Fixed bug where robots.txt parser was disallowing some websites.
- Add command line option to filter out a website by keyword
- Add taste.py a CLI program that allows you examine a field in all of the recipes in the cookbook
- When there is the same recipe in from the same website, don't add duplicate copy. Fixes Issue #2 (There could still be the same recipe on multiple websites that isn't detected.)
- Add RecipeScraperCrawler for a few website that don't follow schema.org/Recipe format. (Not quite ready for release.)
- issue inrecipe-scrapers
, which makes this not quite ready for release. - TODO: Exceptions to be implemented.
- TODO: Debug printing is a little too verbose for a release.
v0.1.0 - Impoved Algorithm
- functionality of
Crawler._find_random_url()
has been split between_rank_url()
and_mine_url()
, which work together. - Crawler._rank_url() ranks the URLs based on the recipe_url defined in the yaml config. URLs that match recipe_url get put in a higher priority list.
_mine_url()
processes all of the anchors of a webpage into lists.Crawler._download_page()
now picks web pages to download from.- Add timeout value to
requests.get()
- Replaced deque (double ended queue) with Python list. Python lists are common and the double ended queue provided no advantages.
v0.0.2 - Numerous Minor Changes
Fixing a bug seems to have sped up the algorithm (item 7).
- Logs runtime and number of web page requests.
- Add Pendulum library to print out an English runtime message.
- Correct spelling errors.
- Rename
__VERSION__
to__version__
- Add url and license to the schema output.
- Add unit testing for URLTest Class.
- Fixed bug in
URLTest.is_same_domain()
, the same domain names with different letter cases were returning false. Now, WWW.EXAMPLE.COM and www.example.com, will return True for theis_same_domain()
function.
v0.0.1
Program works, but it is not optimal.