This repository contains the Python code used to scrape the product listings from HisNibs.com, insert them into a MongoDB collection, which will then be used to build a proposed redesign of the site.
In order for this webscraper to function properly, you must download the version of chromedriver
that matches your version of Google Chrome, and place it at the root-level of this directory.
chromedriver
can be dowloaded here
The general plan for this project is to:
- Gather the URLs of each brand's product pages
- Determine keywords that can be used to differentiate actual product listings from the rest of the page's content (very few selectors used on original site, so we are forced to use text/Regex-based keywords)
- Begin scraping
- For each brand/URL pairing ...
- Use BeautifulSoup to extract elements that contain one of the keywords
- Filter the list of found elements by those that contain a "$" (price) in their text
- Map each of the filtered elements' text to
Pen
objects using Regular Expressions, focusing on Name, Price, and whether or not the product is SoldOut - Add the
brand
andsrcURL
(where the pen was found on the current site) attributes to eachPen
object found on this page, using the known values from the enclosing loop - Append each of the new
Pen
objects to a master list
- Insert each of the
Pen
s in the master list (see step 3vi) into a MongoDB Collection