HisNibsScraper

This repository contains the Python code used to scrape the product listings from HisNibs.com, insert them into a MongoDB collection, which will then be used to build a proposed redesign of the site.

Setup

In order for this webscraper to function properly, you must download the version of chromedriver that matches your version of Google Chrome, and place it at the root-level of this directory. chromedriver can be dowloaded here

Approach

The general plan for this project is to:

Gather the URLs of each brand's product pages
Determine keywords that can be used to differentiate actual product listings from the rest of the page's content (very few selectors used on original site, so we are forced to use text/Regex-based keywords)
Begin scraping
1. For each brand/URL pairing ...
2. Use BeautifulSoup to extract elements that contain one of the keywords
3. Filter the list of found elements by those that contain a "$" (price) in their text
4. Map each of the filtered elements' text to Pen objects using Regular Expressions, focusing on Name, Price, and whether or not the product is SoldOut
5. Add the brand and srcURL (where the pen was found on the current site) attributes to each Pen object found on this page, using the known values from the enclosing loop
6. Append each of the new Pen objects to a master list
Insert each of the Pens in the master list (see step 3vi) into a MongoDB Collection

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Pen.py		Pen.py
README.md		README.md
mainScraper.py		mainScraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HisNibsScraper

Setup

Approach

About

Releases

Packages

Languages

bweir27/HisNibsScraper

Folders and files

Latest commit

History

Repository files navigation

HisNibsScraper

Setup

Approach

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages