Skip to content

GPTScrape: A tool for web scraping that uses spaCy for NLP and GPT4All for converting scraped text into structured JSON.

License

Notifications You must be signed in to change notification settings

trikztr/gptscrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPTScrape

GPTScrape is a tool for scraping web content and converting it into structured JSON using NLP and LLM models. It uses spaCy for natural language processing and GPT4All for text-to-JSON conversion.

Setup

1. Install Required Packages

First, ensure you have the necessary Python packages installed. Use the requirements.txt file provided in the repository.

pip install -r requirements.txt

2. Download spaCy Model

The application uses the nb_core_news_lg (Norwegian Bokmål) spaCy model by default. Download it with the following command:

python -m spacy download nb_core_news_lg

3. Configuration

You might want to adjust settings in the config.ini file.

Usage

1. Start the Application

Run the main script to start the GPTScrape application:

python main.py

2. Application Interface

  • URL: Enter the URL of the web page you want to scrape.
  • Input: Provide the text input or description that you want to use for scraping.
  • Scrape Button: Click this button to start the scraping process.

3. Application Behavior

For now the application will navigate to the provided URL, locate the body of the page, and use the text input to find the best matching element, then create a query selector based on it to match other elements of the same structure. It will then process these elements' text using GPT4All and prompt it to get a structured JSON output.

Troubleshooting

  • Dependencies: Make sure all dependencies in requirements.txt are installed. If you encounter issues, consider creating a virtual environment.
  • Model Download: Ensure that the spaCy model download completes successfully. GPT4All model should get downloaded automatically.

Contributing

If you want to contribute to the project, please fork the repository and create a pull request with your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

GPTScrape: A tool for web scraping that uses spaCy for NLP and GPT4All for converting scraped text into structured JSON.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages