Welcome to Crawl4AI, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:
- Install Crawl4AI (both via pip and Docker, with notes on platform challenges).
- Run your first crawl using minimal configuration.
- Generate Markdown output (and learn how it’s influenced by content filters).
- Experiment with a simple CSS-based extraction strategy.
- See a glimpse of LLM-based extraction (including open-source and closed-source model options).
Crawl4AI provides:
- An asynchronous crawler,
AsyncWebCrawler
. - Configurable browser and run settings via
BrowserConfig
andCrawlerRunConfig
. - Automatic HTML-to-Markdown conversion via
DefaultMarkdownGenerator
(supports additional filters). - Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.
pip install crawl4ai
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
If you encounter any browser-related issues, you can install them manually:
python -m playwright install --with-deps chrome chromium
crawl4ai-setup
installs and configures Playwright (Chromium by default).
We cover advanced installation and Docker in the Installation section.
Here’s a minimal Python script that creates an AsyncWebCrawler
, fetches a webpage, and prints the first 300 characters of its Markdown output:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
What’s happening?
AsyncWebCrawler
launches a headless browser (Chromium by default).- It fetches
https://example.com
. - Crawl4AI automatically converts the HTML into Markdown.
You now have a simple, working crawl!
Crawl4AI’s crawler can be heavily customized using two main classes:
BrowserConfig
: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).CrawlerRunConfig
: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
Below is an example with minimal usage:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_conf = BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig(cache_mode="BYPASS")
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_conf
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.
result.markdown
:
The direct HTML-to-Markdown conversion.result.markdown.fit_markdown
:
The same content after applying any configured content filter (e.g.,PruningContentFilter
).
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(markdown_generator=md_generator)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
Note: If you do not specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated Markdown Generation tutorial.
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
schema = {
"name": "Example Items",
"baseSelector": "div.item",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/items",
config=CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema)
)
)
# The JSON output is stored in 'extracted_content'
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
Why is this helpful?
- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports open-source or closed-source providers:
- Open-Source Models (e.g.,
ollama/llama3.3
,no_token
) - OpenAI Models (e.g.,
openai/gpt-4
, requiresapi_token
) - Or any provider supported by the underlying library
Below is an example using open-source style (no token) and closed-source:
import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class PricingInfo(BaseModel):
model_name: str = Field(..., description="Name of the AI model")
input_fee: str = Field(..., description="Fee for input tokens")
output_fee: str = Field(..., description="Fee for output tokens")
async def main():
# 1) Open-Source usage: no token required
llm_strategy_open_source = LLMExtractionStrategy(
provider="ollama/llama3.3", # or "any-other-local-model"
api_token="no_token", # for local models, no API key is typically required
schema=PricingInfo.schema(),
extraction_type="schema",
instruction="""
From this page, extract all AI model pricing details in JSON format.
Each entry should have 'model_name', 'input_fee', and 'output_fee'.
""",
temperature=0
)
# 2) Closed-Source usage: API key for OpenAI, for example
openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
llm_strategy_openai = LLMExtractionStrategy(
provider="openai/gpt-4",
api_token=openai_token,
schema=PricingInfo.schema(),
extraction_type="schema",
instruction="""
From this page, extract all AI model pricing details in JSON format.
Each entry should have 'model_name', 'input_fee', and 'output_fee'.
""",
temperature=0
)
# We'll demo the open-source approach here
config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/pricing",
config=config
)
print("LLM-based extraction JSON:", result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
What’s happening?
- We define a Pydantic schema (
PricingInfo
) describing the fields we want. - The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
- Depending on the provider and api_token, you can use local models or a remote API.
Congratulations! You have:
- Installed Crawl4AI (via pip, with Docker as an option).
- Performed a simple crawl and printed Markdown.
- Seen how adding a markdown generator + content filter can produce “fit” Markdown.
- Experimented with CSS-based extraction for repetitive data.
- Learned the basics of LLM-based extraction (open-source and closed-source).
If you are ready for more, check out:
- Installation: Learn more on how to install Crawl4AI and set up Playwright.
- Focus on Configuration: Learn to customize browser settings, caching modes, advanced timeouts, etc.
- Markdown Generation Basics: Dive deeper into content filtering and “fit markdown” usage.
- Dynamic Pages & Hooks: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
- Deployment: Run Crawl4AI in Docker containers and scale across multiple nodes.
- Explanations & How-To Guides: Explore browser contexts, identity-based crawling, hooking, performance, and more.
Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!