This scraper is based on the scrapy framework with pagination feature. It uses fake user agents to bypass the security.
Steps to run the projects:-
- Activate virtual env with
. env/bin/activate
- Install requirements using
pip install -r requirements.txt
- Run the following commands:-
scrapy crawl QuotesScraper
Source Code Link -> GitHub
What We are going to do?
- Setting up the Scrapy Project
- Writing a Scraper to scrapes quotes
- Cleaning and Pipelining to store data in sqlite3 database
- Setting up Fake user agents and proxies
Before creating , we must know about Scrapy
Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler.
To initialize the project
scrapy startproject quotes-tutorial
This will create a tutorial directory with the following contents:
quotes-tutorial/
scrapy.cfg # deploy configuration file
quotes-tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
It will extract quotes from Good Reads website
Before moving ahead , we must be aware of the selectors
What are selectors/locators?
A CSS Selector is a combination of an element selector and a value which identifies the web element within a web page.
The choice of locator depends largely on your Application Under Test
Id An element’s id in XPATH is defined using: “[@id='example']” and in CSS using: “#” - ID's must be unique within the DOM. Examples:
XPath: //div[@id='example']
CSS: #example
Element Type The previous example showed //div in the xpath. That is the element type, which could be input for a text box or button, img for an image, or "a" for a link.
Xpath: //input or
Css: =input
Direct Child HTML pages are structured like XML, with children nested inside of parents. If you can locate, for example, the first link within a div, you can construct a string to reach it. A direct child in XPATH is defined by the use of a “/“, while on CSS, it’s defined using “>”. Examples:
XPath: //div/a
CSS: div > a
Child or Sub-Child Writing nested divs can get tiring - and result in code that is brittle. Sometimes you expect the code to change, or want to skip layers. If an element could be inside another or one of its children, it’s defined in XPATH using “//” and in CSS just by a whitespace. Examples:
XPath: //div//a
CSS: div a
Class For classes, things are pretty similar in XPATH: “[@class='example']” while in CSS it’s just “.” Examples:
XPath: //div[@class='example']
CSS: .example
We have inherited the spider class from scrapy-framework. The page_number is the next page to be scraped start_urls provide the entry point to the scraper if no url is explicitly given. _parse function is main function from where the scraping starts. We have used the css selectors explained above.
import scrapy
from ..items import QuotesItem
class QuotesScraper(scrapy.Spider):
page_number = 2
name = "QuotesScraper"
start_urls = ["https://www.goodreads.com/quotes/tag/inspirational"]
def _parse(self, response, **kwargs):
item = QuotesItem()
for quote in response.css(".quote"):
title = quote.css(".quoteText::text").extract_first()
author = quote.css(".authorOrTitle::text").extract_first()
item["title"] = title
item["author"] = author
yield item
# Uncomment the below lines if you want to scrape all the pages in that website and comment the rest uncomment line
# next_btn = response.css("a.next_page::attr(href)").get()
# if next_btn is not None:
# yield response.follow(next_btn, callback=self._parse())
next_page=f"https://www.goodreads.com/quotes/tag/inspirational?page={QuotesScraper.page_number}"
if QuotesScraper.page_number < 3:
QuotesScraper.page_number += 1
yield response.follow(next_page, callback=self._parse)
But what are QuotesItem here?
Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protection against populating undeclared fields, to prevent typos.
import scrapy
class QuotesItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
author = scrapy.Field()
import sqlite3
class QuotesPipeline:
def __init__(self):
self.create_connection()
self.create_table()
def process_item(self, item, spider):
self.db_store(item)
return item
def create_connection(self):
self.conn = sqlite3.connect("quotes.db")
self.curr = self.conn.cursor()
def create_table(self):
self.curr.execute("""DROP TABLE IF EXISTS quote_table""")
self.curr.execute("""create table quote_table( title text, author text)""")
def db_store(self, item):
self.curr.execute("""insert into quote_table values(?,?)""", (
item["title"],
item["author"]
))
self.conn.commit()
Initiate a QuotesPipeline Class.
init functions creates a connection between the sqlite3 database and the program with the help of create_connection function. It is also responsible for creating a Quotes table with the create_table function. process_item function will process all the quotes items and will store into the database with the help of db_store function.
As we are scraping on a large scale we need to avoid banning our IP We are using two libraries : -
1. scrapy-proxy-pool It will provide a bunch of proxies to ensure security of our real IP.Enable this middleware by adding the following settings to your settings.py:
PROXY_POOL_ENABLED = True
Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES:
DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
# ...
}
After this all requests will be proxied using proxies.
2. scrapy-user-agents
Random User-Agent middleware picks up User-Agent strings based on Python User Agents and MDN.
Turn off the built-in UserAgentMiddleware and add RandomUserAgentMiddleware.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
Running Scrapy spiders in your local machine is very convenient for the (early) development stage, but not so much when you need to execute long-running spiders or move spiders to run in production continuously. This is where the solutions for deploying Scrapy spiders come in.
Popular choices for deploying Scrapy spiders are:
Placeholder text by Praveen Chaudhary· Images byBinary Beast