This project is a simple Node.js scraper designed to extract and save content from Wikipedia pages. It uses axios
for making HTTP requests, cheerio
for parsing HTML, and fs-extra
for file operations.
This scraper fetches content from Wikipedia pages and saves the data in JSON format. It demonstrates how to perform web scraping, handle asynchronous operations, and manage file I/O in Node.js.
- Fetch Data: The scraper uses
axios
to send HTTP GET requests to Wikipedia pages. - Parse HTML: The HTML content of the pages is parsed using
cheerio
, which allows easy extraction of relevant information. - Extract Information: The title of the page and its main content are extracted from the HTML.
- Save Data: The extracted data is saved as JSON files in a specified directory using
fs-extra
.
-
Clone the Repository
git clone https://github.com/your-username/wikipedia-scraper.git cd wikipedia-scraper
-
Installing dependencies
npm install
-
Update Scraper Configuration
const PAGES = ['Node.js', 'JavaScript', 'Web_scraping'];
-
Run the Scraper
node scraper.js
After running the scraper, check the data directory for JSON files. Each file will contain the title and content of a specific Wikipedia page.