GitHub - techno-verse/walmart-scraper: This repository includes code for scraping walmart catalogue using scrapy, sqlalchemy, and sqlite3 :)

Introduction

This is a basic Scrapy based ETL framework which will scrape data from walmart website for any given category based on the provided url.

You can read more about Scrapy here

Use Case : Scraping a product department at Walmart Canada's website

The product information is defined by two models (or tables):

Product

The Product model contains basic product information:

Product

Store
Barcodes (a list of UPC/EAN barcodes)
SKU (the product identifier in the store)
Brand
Name
Description
Package
Image URL
Category
URL

BranchProduct

The BranchProduct model contains the attributes of a product that are specific for a store's branch. The same product can be available/unavailable or have different prices at different branches.

BranchProduct

Branch
Product
Stock
Price

Both of the models above have been defined in the models.py file

Use Case Description

Walmart offers a very broad selection of products, from breakfast cereals to gym equipment. We will ingest their product information and store it in our database.

The product information we will scrape is:

Product

Store Walmart
Barcodes 60538887928
SKU 10295446
Brand Great Value
Name Spring Water
Description Convenient and refreshing, Great Value Spring Water is a healthy option...
Package 24 x 500ml
Image URL ["https://i5.walmartimages.ca/images/Large/887/928/999999-60538887928.jpg", "https://i5.walmartimages.ca/images/Large/089/6_1/400896_1.jpg", "https://i5.walmartimages.ca/images/Large/88_/nft/605388879288_NFT.jpg"]
Category Grocery|Pantry, Household & Pets|Drinks›Water|Bottled Water
URL https://www.walmart.ca/en/ip/great-value-24pk-spring-water/6000143709667

BranchProduct

Product <product_id>
Branch 3124
Stock 426
Price 2.27

For now, we are only ingesting the Fruits category.

To run the scraper please perform the following steps

Set up environment

# Clone the repo
git clone https://github.com/shreyaspatel7/walmart-scraper.git
cd walmart-scraper/

# Set up virtual env
virtualenv venv --python=python3
. venv/bin/activate

# Install dependencies
pip install -r requirements.txt

You will have to run python database_setup.py to generate DB models.
You will have to run the Spider with python -m scrapy crawl ca_walmart -a branch=3106. Where branch is the id of the Walmart store you want to scrap.This will aggregate the sqlite database.

Code description

Description:

This Scrapy crawler will extract data from Walmart based on the passed branch number as an argument.

It includes all the data cleaning and filtering rules as well as pre-configured cookies that were required by default for the website.

To run the scraping job for multiple stores, simply pass branch id as following python -m scrapy crawl ca_walmart -a branch=3124 python -m scrapy crawl ca_walmart -a branch=3106

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scrapers		scrapers
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
database_setup.py		database_setup.py
models.py		models.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Product

BranchProduct

Use Case Description

To run the scraper please perform the following steps

Code description

About

Releases

Packages

Languages

techno-verse/walmart-scraper

Folders and files

Latest commit

History

Repository files navigation

Introduction

Product

BranchProduct

Use Case Description

To run the scraper please perform the following steps

Code description

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages