Skip to content

RAGScraper is a Python library designed for efficient and intelligent scraping of web documentation and content. Tailored for Retrieval-Augmented Generation systems, RAGScraper extracts and preprocesses text into structured, machine-learning-ready formats. It emphasizes precision, context preservation, and ease of integration with RAG models.

Notifications You must be signed in to change notification settings

ElapseAI/RAGScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAGScraper

RAGScraper is a simple Python package that scrapes webpages and converts them to markdown format for RAG usage.

Installation

To install RAGScraper, simply run:

pip install ragscraper

Usage

To use RAGScraper as a command-line tool:

rag-scraper <URL>

To use RAGScraper in a Python script:

from rag_scraper.scraper import Scraper
from rag_scraper.converter import Converter

# Fetch HTML content
url = "https://example.com"
html_content = Scraper.fetch_html(url)

# Convert to Markdown
markdown_content = Converter.html_to_markdown(
    html=html_content, 
    base_url=base_url,
    parser_features='html.parser', 
    ignore_links=True
)
print(markdown_content)

Development

To run the tests for RAGScraper, navigate to the package directory and run:

python -m unittest discover tests

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

RAGScraper is a Python library designed for efficient and intelligent scraping of web documentation and content. Tailored for Retrieval-Augmented Generation systems, RAGScraper extracts and preprocesses text into structured, machine-learning-ready formats. It emphasizes precision, context preservation, and ease of integration with RAG models.

Resources

Stars

Watchers

Forks

Packages

No packages published