Skip to content

A tool that automates web search, traversal, and extraction of information on unstructured web pages.

License

Notifications You must be signed in to change notification settings

noahshinn/web-browser

Repository files navigation

Web Browser

A tool that automates web search, traversal, and extraction of information on unstructured web pages.

Requirements

You need to have the following installed:

To run

make server

This will start the search server and the searxng instance.

You can test the server by running the following command:

curl -X POST http://localhost:8095/v1/agent_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is sequence parallelism",
    "search_strategy": "human"
  }'

This will return a JSON object with the search results.

Options

Search strategies

Several search strategies are supported. You can specify the strategy in the JSON body:

curl -X POST http://localhost:8095/v1/agent_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is sequence parallelism",
    "search_strategy": "parallel"
  }'

The following strategies are supported:

  • human: (default) Searches the web like a human (one result at a time) by choosing the most relevant webpage to visit at each step and terminating when the query is comprehensively answered.
  • parallel: (fast) Searches the web in parallel by visiting all of the results at once and aggregating the results at the end.
  • sequential: (slow) Searches the web in sequential by visiting the results one at a time.
  • parallel_tree: (hybrid) Builds a dependency tree of the results and auto-optimizes the traversal to process all of the results in parallel while respecting dependencies.

Query strategies

You can specify the query strategy in the JSON body:

curl -X POST http://localhost:8095/v1/agent_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is sequence parallelism",
    "query_strategy": "single"
  }'

The following query strategies are supported:

  • verbatim: (default) Uses the original query.
  • single: (fast) Synthesizes a single query to search.
  • parallel: (fast) Synthesizes one or more queries to search; visits the results in parallel.
  • sequential: (slow) Synthesizes one or more queries to search; visits the results sequentially.

Number of results to visit

You can specify the number of results to visit with the max_results_to_visit field in the JSON body (default is 10).

Whitelisting and blacklisting base URLs

You can specify the whitelisted and blacklisted base URLs with the whitelisted_base_urls and blacklisted_base_urls fields in the JSON body:

  • whitelisted_base_urls: Only the results from the whitelisted base URLs will be visited.
  • blacklisted_base_urls: The results from the blacklisted base URLs will not be visited.

For example, to whitelist github.com, you can run the following command:

curl -X POST http://localhost:8095/v1/agent_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is sequence parallelism",
    "whitelisted_base_urls": ["github.com"]
  }'

Result format

You can specify the result format with the result_format field in the JSON body. The following formats are supported:

  • answer: (default) Formats the result as an answer.
  • research_summary: Formats the result as a research summary.
  • faq_article: Formats the result as a FAQ article.
  • news_article: Formats the result as a news article.
  • webpage: Formats the result as a webpage.
  • custom: Formats the result as a custom format according to the custom format description.

For example, to format the result as a research summary, you can run the following command:

curl -X POST http://localhost:8095/v1/agent_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is sequence parallelism",
    "result_format": "research_summary"
  }'

To format the result as a custom format (such as a markdown table), you can run the following command:

curl -X POST http://localhost:8095/v1/agent_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is the founding date of each of the top 10 market cap companies in the world",
    "result_format": "custom",
    "custom_result_format_description": "Format the results as a markdown table with the following columns: Company Name, Founding Date"
  }'

Other features

Scraping a website

This feature allows you to scrape all of the pages in a site (by base URL) and format the result as cleaned HTML or markdown. Traditional web scraping tools perform this operation by visiting the starting page and following links to other pages. This tool finds all of the pages that have a common base URL, even if they are "orphan" pages without a link to them from any page.

curl -X POST http://localhost:8095/v1/scrape_site \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": <base_url>,
  }'

For example, to scrape Olukai's help center, support.olukai.com, you can run the following command:

curl -X POST http://localhost:8095/v1/scrape_site \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "support.olukai.com",
  }'

You can also specify the maximum number of pages to visit with the max_num_pages_to_visit field in the JSON body (default is 2000).

curl -X POST http://localhost:8095/v1/scrape_site \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "support.olukai.com",
    "max_num_pages_to_visit": 500
  }'

You can also specify the result format with the result_format field in the JSON body. The following formats are supported:

  • html: (default) Formats the result as cleaned HTML.
  • md: Formats the result as markdown. Uses a language model to transform the cleaned HTML into markdown.

For example, to scrape Olukai's help center and format the result as markdown, you can run the following command:

curl -X POST http://localhost:8095/v1/scrape_site \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "support.olukai.com",
    "max_num_pages_to_visit": 5,
    "result_format": "md"
  }'

You can pass a max concurrency to the scrape site endpoint with the max_concurrency field in the JSON body (default is 10).

curl -X POST http://localhost:8095/v1/scrape_site \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "support.olukai.com",
    "max_num_pages_to_visit": 5,
    "result_format": "md",
    "max_concurrency": 5
  }'

Development

You can run the server with the following command:

make dev-server

You can also run the server with the following steps:

First, clone the submodules:

git submodule update --init --recursive

Make sure that the following environment variables are set:

export SEARX_PORT=8096

Then, run the searxng-docker container:

docker compose -f searxng-docker/docker-compose.yaml up -d

A searxng instance will be running on port 8096.

You can test the searxng instance by navigating to http://localhost:8096/ in your browser or by running the following command:

curl http://localhost:8096/search?q=what+is+sequence+parallelism

Then, open a new shell and set the following environment variables:

export SEARX_HOST=localhost
export SEARX_PORT=8096
export ANTHROPIC_API_KEY=...
export WEB_SEARCH_SERVER_PORT=8095

By default, LLM calls are routed to claude-3-5-sonnet-20241022 from Anthropic. You can change the model and provider by setting the following environment variables before starting the server:

export DEFAULT_LLM_MODEL=...
export DEFAULT_LLM_PROVIDER=...

Then, run the server:

cd server && cargo run -- --port ${WEB_SEARCH_SERVER_PORT:-8095}

About

A tool that automates web search, traversal, and extraction of information on unstructured web pages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published