sitemapExport

sitemapExport is a Go-based CLI tool that crawls a sitemap or RSS feed, extracts content from web pages using CSS selectors, and compiles the data into various formats such as txt, json, jsonl, md, and pdf.

The primary use case is to extract content into a file that can be used as contextual data for AI. For example, extracting your docs site as a simple PDF to power a solid AI support chatbot (tutorial here).

Features

Crawl a sitemap or RSS feed to extract content from pages.
Extract page content using a specified CSS selector.
Generate a structured list of pages with:
- Page title
- URL
- Meta description (if available)
- Meta tags (if available)
- Extracted content
Output formats supported:
- Plain text (txt)
- JSON (json)
- JSON Lines (jsonl)
- Markdown (md)
- PDF (pdf)

Installation

Easy : Run the command

Just grab the sitemapExport file in the repo.
Make the file executable, and run

Thats it. Each time the repo is updated, the executable is rebuilt. However, you can always build it from source if you choose.

Fun : Build from source

To build sitemapExport, you'll need Go installed.

Clone the repository:

git clone https://github.com/yourusername/sitemapExport.git
cd sitemapExport

Build the CLI tool:
```
go build
```
For a smaller binary, you can use:
```
go build -ldflags="-s -w"
```
This will generate the sitemapExport binary.

Usage

Once built, you can run the tool from the command line. The tool supports both interactive prompts and command-line flags.

Interactive Usage

./sitemapExport

Example Interactive Prompts

$ ./sitemapExport
Enter the Sitemap or RSS feed URL (required): https://example.com/sitemap.xml
Enter the CSS selector to extract content (default: body):
Enter the output filename (default: output): output
Enter the output file type (txt, json, jsonl, md, pdf) (default: txt): jsonl
Enter the content format (html, md, txt) (default: txt): md
Successfully saved output to output.jsonl

This will crawl the provided sitemap, extract content from each page using the CSS selector, and save the output in the chosen format (jsonl in this case).

Command-Line Options (Non-Interactive)

If you prefer to pass flags instead of interactive prompts, you can run:

./sitemapExport --url="https://example.com/sitemap.xml" --css="body" --filename="output" --type="txt" --format="txt"

Or, use the short flags:

./sitemapExport --u="https://example.com/sitemap.xml" --c="body" --n="output" --t="txt" --f="txt"

Supported Formats

txt: Plain text format
json: JSON with pretty-printing
jsonl: JSON Lines format (one JSON object per line)
md: Markdown format
pdf: PDF format

Example Output

JSON Output (output.json):

[
  {
    "Title": "Home",
    "URL": "https://example.com",
    "Description": "Welcome to our homepage",
    "MetaTags": ["description: Welcome to our site"],
    "Content": "<div>Welcome to our site!</div>"
  },
  {
    "Title": "About Us",
    "URL": "https://example.com/about",
    "Description": "Learn more about our company",
    "MetaTags": ["description: About Us"],
    "Content": "<p>We are a company...</p>"
  }
]

Markdown Output (output.md):

# Home

URL: https://example.com

Description: Welcome to our homepage

<div>Welcome to our site!</div>

---

# About Us

URL: https://example.com/about

Description: Learn more about our company

<p>We are a company...</p>

---

JSON Lines Output (output.jsonl):

{"Title":"Home","URL":"https://example.com/","Description":"Welcome to our homepage","MetaTags":["description: Welcome"],"Content":"<div>Welcome to our site!</div>"}
{"Title":"About Us","URL":"https://example.com/about","Description":"Learn more about our company","MetaTags":["description: About Us"],"Content":"<p>We are a company...</p>"}

Project Structure

sitemapExport/
├── main.go           # CLI entry point
├── crawler/          # Handles sitemap and RSS crawling and page extraction
│   └── crawler.go
├── formatter/        # Formats extracted content into different file formats
│   └── formatter.go
├── writer/           # Writes formatted content to files (txt, json, md, pdf)
│   └── writer.go
├── feed/             # Detects feed type and handles feed-related tasks
│   └── feed.go
├── go.mod            # Go module file with dependencies
├── go.sum            # Go module dependency checksum
└── README.md         # Project documentation

Dependencies

sitemapExport uses the following Go packages:

github.com/PuerkitoBio/goquery - For parsing and manipulating HTML documents.
github.com/kennygrant/sanitize - For sanitizing HTML content.
github.com/spf13/cobra - For CLI command management.
github.com/jung-kurt/gofpdf - For PDF generation.
github.com/JohannesKaufmann/html-to-markdown - For converting HTML to Markdown.
github.com/schollz/progressbar/v3 - For showing progress bars during sitemap and RSS crawling.

Contributing

Feel free to open issues or submit pull requests for new features, bug fixes, or general improvements.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sitemapExport

Features

Installation

Easy : Run the command

Fun : Build from source

Usage

Interactive Usage

Example Interactive Prompts

Command-Line Options (Non-Interactive)

Supported Formats

Example Output

Project Structure

Dependencies

Contributing

License

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.devcontainer		.devcontainer
.github		.github
crawler		crawler
feed		feed
formatter		formatter
html2text		html2text
writer		writer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
sitemapExport		sitemapExport
sitemapExport.code-workspace		sitemapExport.code-workspace

License

rlnorthcutt/sitemapExport

Folders and files

Latest commit

History

Repository files navigation

sitemapExport

Features

Installation

Easy : Run the command

Fun : Build from source

Usage

Interactive Usage

Example Interactive Prompts

Command-Line Options (Non-Interactive)

Supported Formats

Example Output

Project Structure

Dependencies

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages