Skip to content
JyothishArumugam edited this page Aug 6, 2018 · 4 revisions

Welcome to the web_scraping wiki!

The web_scraping project is using pythons famous scrapy framework to crawl and scrape the webpages.
* Scrape - to extract the data
* Crawl - Navigate between various data sources

Pre Requisites to Follow:

  • python 3 installed
  • Scrapy installed
  • Little bit of Xpath selectors knowledge

Why Scrapy?

Scrapy is a python framework built on top of the "Twisted "- an asynchronous networking framework which gives it the speed and asynchronus execution.- Got the reason why scrapy is blazing fast 🥇

Play time with scrapy

Scrapy provides some tools that will help you to experiment some of the common tasks of scraping

  • Scrapy View

scrapy view https://github.com/JyothishArumugam/web_scraping/wiki/Home/_edit

This block of the code will launch the browser with the url you given, This will be way in which the page is visible by scrapy.

  • Scrapy Shell Its the scrapy's way of interactive programming, as like a ipython notebook . Experimentalise and have fun

Scrapy shell https://github.com/JyothishArumugam/web_scraping/wiki/Home/_edit

Starting a Scrapy Project

Starting a scrapy project is simple as a command


scrapy startproject project_name

Breakdown of our sample project - bikes

Bikes is a simple project which will help to collect the data that will let you study about the Indian MotorCycle market.In this project we will be collecting the bike specs per manufacturer.

Scrapy Spiders

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites.
All the scrapy spiders are organised in the Spiders folder. Creation of the spider using the default template


scrapy genspider hondaspy bikewale.com/honda-bikes

Breakdown of the "hondaspy.py " spider

# -*- coding: utf-8 -*-
import scrapy
from bikes2.items import Bikes2Item
from scrapy.loader import ItemLoader


class HondaspySpider(scrapy.Spider):
    name = 'hondaspy'
    allowed_domains = ['bikewale.com']
    start_urls = ['https://www.bikewale.com/honda-bikes/']

    def parse(self, response):
        all_bikes = response.xpath("//*[@class='bikeDescWrapper']")
        for bike in all_bikes:
            next_url = bike.xpath('.//*[@class="modelurl"]/@href').extract_first()
            absolute_url = response.urljoin(next_url)
            yield scrapy.Request(url=absolute_url,callback=self.parse2)
    def parse2(self,sec_response):
        item = Bikes2Item()
        item['price'] = sec_response.xpath("//*[@id='new-bike-price']/text()").extract_first()
        item['name'] = sec_response.xpath('//*[@class="breadcrumb-link__label"]/text()').extract()[-1]

        yield item
  • Class HondaspySpider
    This is the default created classes by scrapy during the project creation with the attributes of

      • name >> name of the spider
      • allowed domains >> allowed domains to crawl to, restricting this will not allow to crawl through add pages
      • start_urls >> home page of our crawl
  • parse method
    This is the method which parses the response once the start_urls being hitted
    The Xpath selectors will select the "bikeDescWrapper" class and will loop through the all bikes listed in the page
    The "yield" will call all the url found in the bike description and will hit the urls ,and the callback forwards the response to the "parse2" method which will be defining the parsing of the pages in a different manner to extract the data.

Run the spider

Navigate tot the spiders folder and run the following command

scrapy runspider hondaspy.py -o honda.csv

This part will generate the honda.csv file after scrapping the honda bikes prices and name
Supported formats will be .csv,.json,.xml

Happy Coding