Releases · micahcochran/recipe-crawler · GitHub

27 Sep 16:55

micahcochran

v0.4.0 - Add Crawl-delay Latest

Latest

From PR #11

Adds support for Crawl-delay directive in robots.txt (Note: Request-rate is not supported.).
Replaces reppy with the Python standard library urllib.robotparse
Typing syntax improvements.
Updates made to support recipe-scrapers library >=14.48.0
Adds Foodista to the default list of sites to scrape.
Adds requirements-dev.txt for development. This is mainly typing.

Assets 2

27 Aug 02:50

micahcochran

v0.3.0 - CLI Revamped

uses argparse to create a more flexible CLI interface
slightly improve the accuracy of the number of bytes downloaded metric, which is likely an overestimate
Create a main() function and move functions within it. This puts those variables into a local scope instead of a a global scope.

Assets 2

13 Aug 02:36

micahcochran

v0.2.1 - Fix recipe-scrapres returns a list

recipeInstructions from recipe-scrapers now returns a list (was a string) from issue #7

Assets 2

28 Jul 21:35

micahcochran

v0.2.0

Mostly removed hack code from 0.2.0-pre due changes in recipe-scrapers.
Make minimum recipe-scrapers library version 13.3.3.
Log to a file. Add more logging around when recipe contents matches.
Writes a license report in Markdown file (licenses.md) in Crawler.license_report().
Assign AnchorListsEmptyError when the anchor lists are empty. Before this the code was catching ValueError exceptions from other parts of the code, when the program should be exiting.
Reports the bytes of HTML downloaded.
Improve documentation to taste.py. Add length argument.
Remove Release notes from the code (release notes already contained in GitHub) -- addresses #4 .

Assets 2

08 Jul 01:39

micahcochran

v0.2.0-pre - Some fixes, add taste.py, add scrapers from recipe-scrapers Pre-release

Pre-release

Fixed bug where robots.txt parser was disallowing some websites.
Add command line option to filter out a website by keyword
Add taste.py a CLI program that allows you examine a field in all of the recipes in the cookbook
When there is the same recipe in from the same website, don't add duplicate copy. Fixes Issue #2 (There could still be the same recipe on multiple websites that isn't detected.)
Add RecipeScraperCrawler for a few website that don't follow schema.org/Recipe format. (Not quite ready for release.)
- issue in recipe-scrapers, which makes this not quite ready for release.
TODO: Exceptions to be implemented.
TODO: Debug printing is a little too verbose for a release.

Assets 2

10 Jun 03:22

micahcochran

v0.1.0 - Impoved Algorithm

functionality of Crawler._find_random_url() has been split between _rank_url() and _mine_url(), which work together.
Crawler._rank_url() ranks the URLs based on the recipe_url defined in the yaml config. URLs that match recipe_url get put in a higher priority list.
_mine_url() processes all of the anchors of a webpage into lists.
Crawler._download_page() now picks web pages to download from.
Add timeout value to requests.get()
Replaced deque (double ended queue) with Python list. Python lists are common and the double ended queue provided no advantages.

Assets 2

07 Jun 00:55

micahcochran

v0.0.2 - Numerous Minor Changes

Fixing a bug seems to have sped up the algorithm (item 7).

Logs runtime and number of web page requests.
Add Pendulum library to print out an English runtime message.
Correct spelling errors.
Rename __VERSION__ to __version__
Add url and license to the schema output.
Add unit testing for URLTest Class.
Fixed bug in URLTest.is_same_domain(), the same domain names with different letter cases were returning false. Now, WWW.EXAMPLE.COM and www.example.com, will return True for the is_same_domain() function.

Assets 2

05 Jun 20:29

micahcochran

v0.0.1

Program works, but it is not optimal.

Assets 2