This project provides a comprehensive solution for web scraping and data extraction. It combines the power of Scrapy and Django to efficiently scrape data from the ACM and IEEE digital libraries and expose it through a user-friendly API. The Scrapy spiders extract relevant information such as titles, links, abstracts, citation counts, and author details, while the Django project provides a structured way to access and utilize this data through well-defined API endpoints.
This project contains two Scrapy spiders that scrape research papers, titles, authors, abstracts, and citation counts from the IEEE Xplore and ACM Digital Library websites.
The ACM & IEEE Spiders are designed to scrape research paper metadata from their respective digital libraries. Both spiders handle dynamic content using scrapy-splash
to ensure proper loading of JavaScript-rendered pages.
- Scrapes research paper titles, links, abstracts, citation counts, and author details.
- Handles JavaScript-rendered content using Splash.
-
Clone the Repository
git clone https://github.com/yourusername/your-repo.git cd your-repo
-
Install the Dependencies Install Scrapy and the necessary libraries:
pip install scrapy scrapy-splash
-
Install Splash Install Docker from here and run Splash to handle JavaScript rendering:
docker run -p 8050:8050 scrapinghub/splash
-
Ensure Splash is Running Ensure Splash is running in Docker:
docker run -p 8050:8050 scrapinghub/splash
- Description: The IEEE spider scrapes research papers from IEEE Xplore. It extracts information like the paper title, link, abstract, citation count, and authors.
- Command:
scrapy crawl ieee_spider -a search_term="Your Search Term" -o results.json
- Description: The ACM spider scrapes research papers from ACM Digital Library. It extracts information like the paper title, link, abstract, citation count, and authors.
- Command:
scrapy crawl acm_spider -a search_term="Your Search Term" -o results.json
After running the spiders, the output will be stored in a results.json. Here’s an example structure:
{
"title": "Example Research Paper Title",
"link": "https://example.com",
"details":"Example details of publication",
"abstract": "This is a sample abstract.",
"authors": ["Author One", "Author Two"],
"citation_count": 120
}
The Django project acts as a gateway between users and the Scrapy spiders. It handles requests, interacts with the spiders, and processes the extracted data, providing a structured interface for accessing scraped information from ACM and IEEE.
pip install django djangorestframework
python manage.py runserver
The Django API is now running at http://localhost:8000/. You can use a web browser or an API testing tool to interact with the API endpoints.
The current API endpoint is /api/hello
. When accessed, it returns a JSON response containing the following message:
{
"message": "This is a test endpoint for Scraper API."
}