Skip to content

Latest commit

 

History

History
335 lines (248 loc) · 13.6 KB

hooks-custom.md

File metadata and controls

335 lines (248 loc) · 13.6 KB

Hooks & Custom Code

Crawl4AI supports a hook system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:

  • Authentication (log in before navigating)
  • Content manipulation (modify HTML, inject scripts, etc.)
  • Session or browser configuration (e.g., adjusting user agents, local storage)
  • Custom data collection (scrape extra details or track state at each stage)

In this tutorial, you’ll learn about:

  1. What hooks are available
  2. How to attach code to each hook
  3. Practical examples (auth flows, user agent changes, content manipulation, etc.)

Prerequisites


1. Overview of Available Hooks

Hook Name Called When / Purpose Context / Objects Provided
on_browser_created Immediately after the browser is launched, but before any page or context is created. Browser object only (no page yet). Use it for broad browser-level config.
on_page_context_created Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. Typically provides page and context.
on_user_agent_updated Whenever the user agent changes. For advanced user agent logic or additional header updates. Typically provides page and updated user agent string.
on_execution_started Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. Typically provides page, possibly context.
before_goto Right before navigating to the URL (i.e., page.goto(...)). Great for setting cookies, altering the URL, or hooking in authentication steps. Typically provides page, context, and goto_params.
after_goto Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. Typically provides page, context, response.
before_retrieve_html Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). Typically provides page or final HTML reference.
before_return_html Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. Typically provides final HTML or a page.

A Note on on_browser_created (the “unbrowser” hook)

  • No page object is available because no page context exists yet. You can, however, set up browser-wide properties.
  • For example, you might control [CDP sessions][cdp] or advanced browser flags here.

2. Registering Hooks

You can attach hooks by calling:

crawler.crawler_strategy.set_hook("hook_name", your_hook_function)

or by passing a hooks dictionary to AsyncWebCrawler or your strategy constructor:

hooks = {
    "before_goto": my_before_goto_hook,
    "after_goto": my_after_goto_hook,
    # ... etc.
}
async with AsyncWebCrawler(hooks=hooks) as crawler:
    ...

Hook Signature

Each hook is a function (async or sync, depending on your usage) that receives certain parameters—most often page, context, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.


3. Real-Life Examples

Below are concrete scenarios where hooks come in handy.


3.1 Authentication Before Navigation

One of the most frequent tasks is logging in or applying authentication before the crawler navigates to a URL (so that the user is recognized immediately).

Using before_goto

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def before_goto_auth_hook(page, context, goto_params, **kwargs):
    """
    Example: Set cookies or localStorage to simulate login.
    This hook runs right before page.goto() is called.
    """
    # Example: Insert cookie-based auth or local storage data
    # (You could also do more complex actions, like fill forms if you already have a 'page' open.)
    print("[HOOK] Setting auth data before goto.")
    await context.add_cookies([
        {
            "name": "session",
            "value": "abcd1234",
            "domain": "example.com",
            "path": "/"
        }
    ])
    # Optionally manipulate goto_params if needed:
    # goto_params["url"] = goto_params["url"] + "?debug=1"

async def main():
    hooks = {
        "before_goto": before_goto_auth_hook
    }

    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig()

    async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
        result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
        if result.success:
            print("[OK] Logged in and fetched protected page.")
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

Key Points

  • before_goto receives page, context, goto_params so you can add cookies, localStorage, or even change the URL itself.
  • If you need to run a real login flow (submitting forms), consider on_browser_created or on_page_context_created if you want to do it once at the start.

3.2 Setting Up the Browser in on_browser_created

If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use on_browser_created. No page is available yet, but you can set up the browser instance itself.

async def on_browser_created_hook(browser, **kwargs):
    """
    Runs immediately after the browser is created, before any pages.
    'browser' here is a Playwright Browser object.
    """
    print("[HOOK] Browser created. Setting up custom stuff.")
    # Possibly connect to DevTools or create an incognito context
    # Example (pseudo-code):
    # devtools_url = await browser.new_context(devtools=True)

# Usage:
async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
    ...

3.3 Adjusting Page or Context in on_page_context_created

If you’d like to set default timeouts or inject scripts right after a page context is spun up:

async def on_page_context_created_hook(page, context, **kwargs):
    print("[HOOK] Page context created. Setting default timeouts or scripts.")
    await page.set_default_timeout(20000)  # 20 seconds
    # Possibly inject a script or set user locale

# Usage:
hooks = {
    "on_page_context_created": on_page_context_created_hook
}

3.4 Dynamically Updating User Agents

on_user_agent_updated is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:

async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
    print(f"[HOOK] User agent updated to {new_ua}")
    # Maybe add a custom header based on new UA
    await context.set_extra_http_headers({"X-UA-Source": new_ua})

hooks = {
    "on_user_agent_updated": on_user_agent_updated_hook
}

3.5 Initializing Stuff with on_execution_started

on_execution_started runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).

async def on_execution_started_hook(page, context, **kwargs):
    print("[HOOK] Execution started. Setting a start timestamp or logging.")
    context.set_default_navigation_timeout(45000)  # 45s if your site is slow

hooks = {
    "on_execution_started": on_execution_started_hook
}

3.6 Post-Processing with after_goto

After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:

async def after_goto_hook(page, context, response, **kwargs):
    """
    Called right after page.goto() finishes, but before the crawler extracts HTML.
    """
    if response and response.ok:
        print("[HOOK] After goto. Status:", response.status)
        # Maybe remove popups or check if we landed on a login failure page.
        await page.evaluate("""() => {
            const popup = document.querySelector(".annoying-popup");
            if (popup) popup.remove();
        }""")
    else:
        print("[HOOK] Navigation might have failed, status not ok or no response.")

hooks = {
    "after_goto": after_goto_hook
}

3.7 Last-Minute Modifications in before_retrieve_html or before_return_html

Sometimes you need to tweak the page or raw HTML right before it’s captured.

async def before_retrieve_html_hook(page, context, **kwargs):
    """
    Modify the DOM just before the crawler finalizes the HTML.
    """
    print("[HOOK] Removing adverts before capturing HTML.")
    await page.evaluate("""() => {
        const ads = document.querySelectorAll(".ad-banner");
        ads.forEach(ad => ad.remove());
    }""")

async def before_return_html_hook(page, context, html, **kwargs):
    """
    'html' is the near-finished HTML string. Return an updated string if you like.
    """
    # For example, remove personal data or certain tags from the final text
    print("[HOOK] Sanitizing final HTML.")
    sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
    return sanitized_html

hooks = {
    "before_retrieve_html": before_retrieve_html_hook,
    "before_return_html": before_return_html_hook
}

Note: If you want to make last-second changes in before_return_html, you can manipulate the html string directly. Return a new string if you want to override.


4. Putting It All Together

You can combine multiple hooks in a single run. For instance:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def on_browser_created_hook(browser, **kwargs):
    print("[HOOK] Browser is up, no page yet. Good for broad config.")

async def before_goto_auth_hook(page, context, goto_params, **kwargs):
    print("[HOOK] Adding cookies for auth.")
    await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])

async def after_goto_log_hook(page, context, response, **kwargs):
    if response:
        print("[HOOK] after_goto: Status code:", response.status)

async def main():
    hooks = {
        "on_browser_created": on_browser_created_hook,
        "before_goto": before_goto_auth_hook,
        "after_goto": after_goto_log_hook
    }

    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig(verbose=True)

    async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
        result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
        if result.success:
            print("[OK] Protected page length:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

This example:

  1. on_browser_created sets up the brand-new browser instance.
  2. before_goto ensures you inject an auth cookie before accessing the page.
  3. after_goto logs the resulting HTTP status code.

5. Common Pitfalls & Best Practices

  1. Hook Order: If multiple hooks do overlapping tasks (e.g., two before_goto hooks), be mindful of conflicts or repeated logic.
  2. Async vs Sync: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects async, define async def.
  3. Mutating goto_params: goto_params is a dict that eventually goes to Playwright’s page.goto(). Changing the url or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.
  4. Browser vs Page vs Context: Not all hooks have both page and context. For example, on_browser_created only has access to browser.
  5. Avoid Overdoing It: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.

Conclusion & Next Steps

Hooks let you bend Crawl4AI to your will:

  • Authentication (cookies, localStorage) with before_goto
  • Browser-level config with on_browser_created
  • Page or context config with on_page_context_created
  • Content modifications before capturing HTML (before_retrieve_html or before_return_html)

Where to go next:

With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!

Last Updated: 2024-XX-XX