Skip to content

Latest commit

 

History

History
272 lines (202 loc) · 9.72 KB

getting-started.md

File metadata and controls

272 lines (202 loc) · 9.72 KB

Getting Started with Crawl4AI

Welcome to Crawl4AI, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:

  1. Install Crawl4AI (both via pip and Docker, with notes on platform challenges).
  2. Run your first crawl using minimal configuration.
  3. Generate Markdown output (and learn how it’s influenced by content filters).
  4. Experiment with a simple CSS-based extraction strategy.
  5. See a glimpse of LLM-based extraction (including open-source and closed-source model options).

1. Introduction

Crawl4AI provides:

  • An asynchronous crawler, AsyncWebCrawler.
  • Configurable browser and run settings via BrowserConfig and CrawlerRunConfig.
  • Automatic HTML-to-Markdown conversion via DefaultMarkdownGenerator (supports additional filters).
  • Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).

By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.


2. Installation

2.1 Python + Playwright

Basic Pip Installation

pip install crawl4ai
crawl4ai-setup

# Verify your installation
crawl4ai-doctor

If you encounter any browser-related issues, you can install them manually:

python -m playwright install --with-deps chrome chromium
  • crawl4ai-setup installs and configures Playwright (Chromium by default).

We cover advanced installation and Docker in the Installation section.


3. Your First Crawl

Here’s a minimal Python script that creates an AsyncWebCrawler, fetches a webpage, and prints the first 300 characters of its Markdown output:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:300])  # Print first 300 chars

if __name__ == "__main__":
    asyncio.run(main())

What’s happening?

  • AsyncWebCrawler launches a headless browser (Chromium by default).
  • It fetches https://example.com.
  • Crawl4AI automatically converts the HTML into Markdown.

You now have a simple, working crawl!


4. Basic Configuration (Light Introduction)

Crawl4AI’s crawler can be heavily customized using two main classes:

  1. BrowserConfig: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
  2. CrawlerRunConfig: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).

Below is an example with minimal usage:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_conf = BrowserConfig(headless=True)  # or False to see the browser
    run_conf = CrawlerRunConfig(cache_mode="BYPASS")

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_conf
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.


5. Generating Markdown Output

By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.

  • result.markdown:
    The direct HTML-to-Markdown conversion.
  • result.markdown.fit_markdown:
    The same content after applying any configured content filter (e.g., PruningContentFilter).

Example: Using a Filter with DefaultMarkdownGenerator

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(markdown_generator=md_generator)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com", config=config)
    print("Raw Markdown length:", len(result.markdown.raw_markdown))
    print("Fit Markdown length:", len(result.markdown.fit_markdown))

Note: If you do not specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated Markdown Generation tutorial.


6. Simple Data Extraction (CSS-based)

Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    schema = {
        "name": "Example Items",
        "baseSelector": "div.item",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/items",
            config=CrawlerRunConfig(
                extraction_strategy=JsonCssExtractionStrategy(schema)
            )
        )
        # The JSON output is stored in 'extracted_content'
        data = json.loads(result.extracted_content)
        print(data)

if __name__ == "__main__":
    asyncio.run(main())

Why is this helpful?

  • Great for repetitive page structures (e.g., item listings, articles).
  • No AI usage or costs.
  • The crawler returns a JSON string you can parse or store.

7. Simple Data Extraction (LLM-based)

For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports open-source or closed-source providers:

  • Open-Source Models (e.g., ollama/llama3.3, no_token)
  • OpenAI Models (e.g., openai/gpt-4, requires api_token)
  • Or any provider supported by the underlying library

Below is an example using open-source style (no token) and closed-source:

import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class PricingInfo(BaseModel):
    model_name: str = Field(..., description="Name of the AI model")
    input_fee: str = Field(..., description="Fee for input tokens")
    output_fee: str = Field(..., description="Fee for output tokens")

async def main():
    # 1) Open-Source usage: no token required
    llm_strategy_open_source = LLMExtractionStrategy(
        provider="ollama/llama3.3",  # or "any-other-local-model"
        api_token="no_token",       # for local models, no API key is typically required
        schema=PricingInfo.schema(),
        extraction_type="schema",
        instruction="""
            From this page, extract all AI model pricing details in JSON format.
            Each entry should have 'model_name', 'input_fee', and 'output_fee'.
        """,
        temperature=0
    )

    # 2) Closed-Source usage: API key for OpenAI, for example
    openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
    llm_strategy_openai = LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token=openai_token,
        schema=PricingInfo.schema(),
        extraction_type="schema",
        instruction="""
            From this page, extract all AI model pricing details in JSON format.
            Each entry should have 'model_name', 'input_fee', and 'output_fee'.
        """,
        temperature=0
    )

    # We'll demo the open-source approach here
    config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/pricing",
            config=config
        )
        print("LLM-based extraction JSON:", result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

What’s happening?

  • We define a Pydantic schema (PricingInfo) describing the fields we want.
  • The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
  • Depending on the provider and api_token, you can use local models or a remote API.

8. Next Steps

Congratulations! You have:

  1. Installed Crawl4AI (via pip, with Docker as an option).
  2. Performed a simple crawl and printed Markdown.
  3. Seen how adding a markdown generator + content filter can produce “fit” Markdown.
  4. Experimented with CSS-based extraction for repetitive data.
  5. Learned the basics of LLM-based extraction (open-source and closed-source).

If you are ready for more, check out:

  • Installation: Learn more on how to install Crawl4AI and set up Playwright.
  • Focus on Configuration: Learn to customize browser settings, caching modes, advanced timeouts, etc.
  • Markdown Generation Basics: Dive deeper into content filtering and “fit markdown” usage.
  • Dynamic Pages & Hooks: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
  • Deployment: Run Crawl4AI in Docker containers and scale across multiple nodes.
  • Explanations & How-To Guides: Explore browser contexts, identity-based crawling, hooking, performance, and more.

Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!