Skip to content

Releases: micahcochran/recipe-crawler

v0.4.0 - Add Crawl-delay

27 Sep 16:55
674ec47
Compare
Choose a tag to compare

From PR #11

  • Adds support for Crawl-delay directive in robots.txt (Note: Request-rate is not supported.).
  • Replaces reppy with the Python standard library urllib.robotparse
  • Typing syntax improvements.
  • Updates made to support recipe-scrapers library >=14.48.0
  • Adds Foodista to the default list of sites to scrape.
  • Adds requirements-dev.txt for development. This is mainly typing.

v0.3.0 - CLI Revamped

27 Aug 02:50
95fee3b
Compare
Choose a tag to compare
  1. uses argparse to create a more flexible CLI interface
  2. slightly improve the accuracy of the number of bytes downloaded metric, which is likely an overestimate
  3. Create a main() function and move functions within it. This puts those variables into a local scope instead of a a global scope.

v0.2.1 - Fix recipe-scrapres returns a list

13 Aug 02:36
0d0db21
Compare
Choose a tag to compare

recipeInstructions from recipe-scrapers now returns a list (was a string) from issue #7

v0.2.0

28 Jul 21:35
9a45e96
Compare
Choose a tag to compare
  • Mostly removed hack code from 0.2.0-pre due changes in recipe-scrapers.
  • Make minimum recipe-scrapers library version 13.3.3.
  • Log to a file. Add more logging around when recipe contents matches.
  • Writes a license report in Markdown file (licenses.md) in Crawler.license_report().
  • Assign AnchorListsEmptyError when the anchor lists are empty. Before this the code was catching ValueError exceptions from other parts of the code, when the program should be exiting.
  • Reports the bytes of HTML downloaded.
  • Improve documentation to taste.py. Add length argument.
  • Remove Release notes from the code (release notes already contained in GitHub) -- addresses #4 .

v0.2.0-pre - Some fixes, add taste.py, add scrapers from recipe-scrapers

08 Jul 01:39
Compare
Choose a tag to compare
  • Fixed bug where robots.txt parser was disallowing some websites.
  • Add command line option to filter out a website by keyword
  • Add taste.py a CLI program that allows you examine a field in all of the recipes in the cookbook
  • When there is the same recipe in from the same website, don't add duplicate copy. Fixes Issue #2 (There could still be the same recipe on multiple websites that isn't detected.)
  • Add RecipeScraperCrawler for a few website that don't follow schema.org/Recipe format. (Not quite ready for release.)
    - issue in recipe-scrapers, which makes this not quite ready for release.
  • TODO: Exceptions to be implemented.
  • TODO: Debug printing is a little too verbose for a release.

v0.1.0 - Impoved Algorithm

10 Jun 03:22
Compare
Choose a tag to compare
  • functionality of Crawler._find_random_url() has been split between _rank_url() and _mine_url(), which work together.
  • Crawler._rank_url() ranks the URLs based on the recipe_url defined in the yaml config. URLs that match recipe_url get put in a higher priority list.
  • _mine_url() processes all of the anchors of a webpage into lists.
  • Crawler._download_page() now picks web pages to download from.
  • Add timeout value to requests.get()
  • Replaced deque (double ended queue) with Python list. Python lists are common and the double ended queue provided no advantages.

v0.0.2 - Numerous Minor Changes

07 Jun 00:55
Compare
Choose a tag to compare

Fixing a bug seems to have sped up the algorithm (item 7).

  1. Logs runtime and number of web page requests.
  2. Add Pendulum library to print out an English runtime message.
  3. Correct spelling errors.
  4. Rename __VERSION__ to __version__
  5. Add url and license to the schema output.
  6. Add unit testing for URLTest Class.
  7. Fixed bug in URLTest.is_same_domain(), the same domain names with different letter cases were returning false. Now, WWW.EXAMPLE.COM and www.example.com, will return True for the is_same_domain() function.

v0.0.1

05 Jun 20:29
Compare
Choose a tag to compare

Program works, but it is not optimal.