Below is a sample Markdown file (tutorials/async-webcrawler-basics.md
) illustrating how you might teach new users the fundamentals of AsyncWebCrawler
. This tutorial builds on the Getting Started section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.
In this tutorial, you’ll learn how to:
- Create and configure an
AsyncWebCrawler
instance - Understand the
CrawlResult
object returned byarun()
- Use basic
BrowserConfig
andCrawlerRunConfig
options to tailor your crawl
Prerequisites
- You’ve already completed the Getting Started tutorial (or have equivalent knowledge).
- You have Crawl4AI installed and configured with Playwright.
AsyncWebCrawler
is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, it’s your high-level interface for collecting page data.
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result)
Below is a simple code snippet showing how to create and use AsyncWebCrawler
. This goes one step beyond the minimal example you saw in Getting Started.
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import BrowserConfig, CrawlerRunConfig
async def main():
# 1. Set up configuration objects (optional if you want defaults)
browser_config = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
crawler_config = CrawlerRunConfig(
page_timeout=30000, # 30 seconds
wait_for_images=True,
verbose=True
)
# 2. Initialize AsyncWebCrawler with your chosen browser config
async with AsyncWebCrawler(config=browser_config) as crawler:
# 3. Run a single crawl
url_to_crawl = "https://example.com"
result = await crawler.arun(url=url_to_crawl, config=crawler_config)
# 4. Inspect the result
if result.success:
print(f"Successfully crawled: {result.url}")
print(f"HTML length: {len(result.html)}")
print(f"Markdown snippet: {result.markdown[:200]}...")
else:
print(f"Failed to crawl {result.url}. Error: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
BrowserConfig
is optional, but it’s the place to specify browser-related settings (e.g.,headless
,browser_type
).CrawlerRunConfig
deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).arun()
is the main method to crawl a single URL. We’ll see howarun_many()
works in later tutorials.
When you call arun()
, you get back a CrawlResult
object containing all the relevant data from that crawl attempt. Some common fields include:
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
screenshot: Optional[str] = None # base64-encoded screenshot if requested
pdf: Optional[bytes] = None # binary PDF data if requested
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
error_message: Optional[str] = None
# ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
success
:True
if the crawl succeeded,False
otherwise.html
: The raw HTML (or final rendered state if JavaScript was executed).markdown
/markdown_v2
: Contains the automatically generated Markdown representation of the page.media
: A dictionary with lists of extracted images, videos, or audio elements.links
: A dictionary with lists of “internal” and “external” link objects.error_message
: Ifsuccess
isFalse
, this often contains a description of the error.
Example:
if result.success:
print("Page Title or snippet of HTML:", result.html[:200])
if result.markdown:
print("Markdown snippet:", result.markdown[:200])
print("Links found:", len(result.links.get("internal", [])), "internal links")
else:
print("Error crawling:", result.error_message)
Below are a few BrowserConfig
and CrawlerRunConfig
parameters you might tweak early on. We’ll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.
Parameter | Description | Default |
---|---|---|
browser_type |
Which browser engine to use: "chromium" , "firefox" , "webkit" |
"chromium" |
headless |
Run the browser with no UI window. If False , you see the browser. |
True |
verbose |
Print extra logs for debugging. | True |
java_script_enabled |
Toggle JavaScript. When False , you might speed up loads but lose dynamic content. |
True |
Parameter | Description | Default |
---|---|---|
page_timeout |
Maximum time in ms to wait for the page to load or scripts. | 30000 (30s) |
wait_for_images |
Wait for images to fully load. Good for accurate rendering. | True |
css_selector |
Target only certain elements for extraction. | None |
excluded_tags |
Skip certain HTML tags (like nav , footer , etc.) |
None |
verbose |
Print logs for debugging. | True |
Tip: Don’t worry if you see lots of parameters. You’ll learn them gradually in later tutorials.
When using AsyncWebCrawler on Windows, you might encounter a NotImplementedError
related to asyncio.create_subprocess_exec
. This is a known Windows-specific issue that occurs because Windows' default event loop doesn't support subprocess operations.
To resolve this, Crawl4AI provides a utility function to configure Windows to use the ProactorEventLoop. Call this function before running any async operations:
from crawl4ai.utils import configure_windows_event_loop
# Call this before any async operations if you're on Windows
configure_windows_event_loop()
# Your AsyncWebCrawler code here
Here’s a slightly more in-depth example that shows off a few key config parameters at once:
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
java_script_enabled=True,
verbose=False
)
crawler_cfg = CrawlerRunConfig(
page_timeout=30000, # wait up to 30 seconds
wait_for_images=True,
css_selector=".article-body", # only extract content under this CSS selector
verbose=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://news.example.com", config=crawler_cfg)
if result.success:
print("[OK] Crawled:", result.url)
print("HTML length:", len(result.html))
print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
Key Observations:
css_selector=".article-body"
ensures we only focus on the main content region.page_timeout=30000
helps if the site is slow.- We turned off
verbose
logs for the browser but kept them on for the crawler config.
- Smart Crawling Techniques: Learn to handle iframes, advanced caching, and selective extraction in the next tutorial.
- Hooks & Custom Code: See how to inject custom logic before and after navigation in a dedicated Hooks Tutorial.
- Reference: For a complete list of every parameter in
BrowserConfig
andCrawlerRunConfig
, check out the Reference section.
You now know the basics of AsyncWebCrawler:
- How to create it with optional browser/crawler configs
- How
arun()
works for single-page crawls - Where to find your crawled data in
CrawlResult
- A handful of frequently used configuration parameters
From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Let’s move on to Smart Crawling Techniques to learn how to handle iframes, advanced caching, and more.
Last updated: 2024-XX-XX
Keep exploring! If you get stuck, remember to check out the How-To Guides for targeted solutions or the Explanations for deeper conceptual background.