-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The web_scraping project is using pythons famous scrapy framework to crawl and scrape the webpages.
* Scrape - to extract the data
* Crawl - Navigate between various data sources
- python 3 installed
- Scrapy installed
- Little bit of Xpath selectors knowledge
Scrapy is a python framework built on top of the "Twisted "- an asynchronous networking framework which gives it the speed and asynchronus execution.- Got the reason why scrapy is blazing fast 🥇
Scrapy provides some tools that will help you to experiment some of the common tasks of scraping
- Scrapy View
scrapy view https://github.com/JyothishArumugam/web_scraping/wiki/Home/_edit
This block of the code will launch the browser with the url you given, This will be way in which the page is visible by scrapy.
- Scrapy Shell Its the scrapy's way of interactive programming, as like a ipython notebook . Experimentalise and have fun
Scrapy shell https://github.com/JyothishArumugam/web_scraping/wiki/Home/_edit
Starting a scrapy project is simple as a command
scrapy startproject project_name
Bikes is a simple project which will help to collect the data that will let you study about the Indian MotorCycle market.In this project we will be collecting the bike specs per manufacturer.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites.
All the scrapy spiders are organised in the Spiders folder.
Creation of the spider using the default template
scrapy genspider hondaspy bikewale.com/honda-bikes
Breakdown of the "hondaspy.py " spider
# -*- coding: utf-8 -*-
import scrapy
from bikes2.items import Bikes2Item
from scrapy.loader import ItemLoader
class HondaspySpider(scrapy.Spider):
name = 'hondaspy'
allowed_domains = ['bikewale.com']
start_urls = ['https://www.bikewale.com/honda-bikes/']
def parse(self, response):
all_bikes = response.xpath("//*[@class='bikeDescWrapper']")
for bike in all_bikes:
next_url = bike.xpath('.//*[@class="modelurl"]/@href').extract_first()
absolute_url = response.urljoin(next_url)
yield scrapy.Request(url=absolute_url,callback=self.parse2)
def parse2(self,sec_response):
item = Bikes2Item()
item['price'] = sec_response.xpath("//*[@id='new-bike-price']/text()").extract_first()
item['name'] = sec_response.xpath('//*[@class="breadcrumb-link__label"]/text()').extract()[-1]
yield item
-
Class HondaspySpider
This is the default created classes by scrapy during the project creation with the attributes of
-
-
- name >> name of the spider
-
-
- allowed domains >> allowed domains to crawl to, restricting this will not allow to crawl through add pages
-
-
- start_urls >> home page of our crawl
-
parse method
This is the method which parses the response once the start_urls being hitted
The Xpath selectors will select the "bikeDescWrapper" class and will loop through the all bikes listed in the page
The "yield" will call all the url found in the bike description and will hit the urls ,and the callback forwards the response to the "parse2" method which will be defining the parsing of the pages in a different manner to extract the data.
Navigate tot the spiders folder and run the following command
scrapy runspider hondaspy.py -o honda.csv
This part will generate the honda.csv file after scrapping the honda bikes prices and name
Supported formats will be .csv,.json,.xml