Webspot

Webspot is an intelligent web service to automatically detect web content and extract information from it.

Demo

中文

Screenshots

Detected Results

Extracted Fields

Extracted Data

Get Started

Docker

Make sure you have installed Docker and Docker Compose.

# clone git repo
git clone https://github.com/crawlab-team/webspot

# start docker containers
docker-compose up -d

Then you can access the web UI at http://localhost:9999.

API Reference

Once you started Webspot, you can go to http://localhost:9999/redoc to view the API reference.

Architecture

The overall process of how Webspot detects meaningful elements from HTML or web pages is shown in the following figure.

graph LR
    hr[HtmlRequester]
    gl[GraphLoader]
    d[Detector]
    r[Results]

    hr --"html + json"--> gl --"graph"--> d --"output"--> r

Development

You can follow the following guidance to get started.

Pre-requisites

Python >=3.8 and <=3.10
Go 1.16 or higher
MongoDB 4.2 or higher

Install dependencies

# dependencies
pip install -r requirements.txt

Configure Environment Variables

Database configuration is located in .env file. You can copy the example file and modify it.

cp .env.example .env

Start web server

# start development server
python main.py web

Code Structure

The core code is located in webspot directory. The main.py file is the entry point of the web server.

webspot
├── cmd     # command line tools
├── crawler # web crawler
├── data    # data files (html, json, etc.)
├── db      # database
├── detect  # web content detection
├── graph   # graph module
├── models  # models
├── request # request helper
├── test    # test cases
├── utils   # utilities
└── web     # web server

TODOs

Webspot is aimed at automating the process of web content detection and extraction. It is far from ready for production use. The following features are planned to be implemented in the future.

Table detection
Nested list detection
Export to spiders
Advanced browser request

Disclaimer

Please follow the local laws and regulations when using Webspot. The author is not responsible for any legal issues caused by. Please read the Disclaimer for details.

Community

If you are interested in Webspot, please add the author's WeChat account "tikazyq1" noting "Webspot" to enter the discussion group.

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.github/workflows		.github/workflows
dev		dev
docs/screenshots		docs/screenshots
migrations		migrations
webspot		webspot
webspot_package		webspot_package
webspot_rod		webspot_rod
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
DISCLAIMER-zh.md		DISCLAIMER-zh.md
DISCLAIMER.md		DISCLAIMER.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
alembic.ini		alembic.ini
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
main.py		main.py
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webspot

Screenshots

Detected Results

Extracted Fields

Extracted Data

Get Started

Docker

API Reference

Architecture

Development

Pre-requisites

Install dependencies

Configure Environment Variables

Start web server

Code Structure

TODOs

Disclaimer

Community

About

Releases 1

Packages

Contributors 2

Languages

License

crawlab-team/webspot

Folders and files

Latest commit

History

Repository files navigation

Webspot

Screenshots

Detected Results

Extracted Fields

Extracted Data

Get Started

Docker

API Reference

Architecture

Development

Pre-requisites

Install dependencies

Configure Environment Variables

Start web server

Code Structure

TODOs

Disclaimer

Community

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages