ScrapeDogg
is an experimental library for scraping websites using OpenAI's GPT.
The library provides a means to scrape structured data from HTML without writing page-specific code.
Before you proceed, here are at least two reasons why you should not use this library:
* It is **very experimental**, no guarantees are made about the stability of the API or the accuracy of the results.
* It relies on the OpenAI API, which is quite slow and can be expensive.
**Use at your own risk.**
Step 1) Obtain an OpenAI API key (https://platform.openai.com) and set an environment variable in the .env file:
OPENAI_API_KEY=sk-...
Step 2) Install the library however you like:
npm install scrapedogg
or
yarn add scrapedogg
Step3) Initialize a URL and a Schema by indicating the data you wish to extract:
let url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html";
let schema = {
title: "string",
book_cover: "url",
url: "url",
price: "string",
stock: "int",
rating: "int/int stars",
description: "string",
};
Step 4) Passing the scraper a URL to the resulting scraper will return the scraped data in the console:
(async () => {
console.log(await scrape(url, schema))
})();
{
"title": "A Light in the Attic",
"book_cover": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg",
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"price": "£51.77",
"stock": 22,
"rating": "3/5 stars",
"description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. ...more"
}
That's it! 🎉
This project requires NodeJS (version 8 or later) and NPM. Node and NPM are really easy to install. To make sure you have them available on your machine, try running the following command.
$ npm -v && node -v
6.4.1
v8.16.0
npm install ScrapeDogg
npm run test
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Add your changes:
git add .
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request 😎
- scrapeghost
- Huge Thanks to scrapeghost for inspiring me to make this idea in JS, Go check out their python version, Thanks!
Give a ⭐️ if this project helped you!