diff --git a/docs/.vuepress/dist/en/api/api-reference.html b/docs/.vuepress/dist/en/api/api-reference.html index 0cecfec..541caab 100644 --- a/docs/.vuepress/dist/en/api/api-reference.html +++ b/docs/.vuepress/dist/en/api/api-reference.html @@ -38,7 +38,7 @@ } - +


Last update:
- + diff --git a/docs/.vuepress/dist/en/api/index.html b/docs/.vuepress/dist/en/api/index.html index e1d342f..2e909f1 100644 --- a/docs/.vuepress/dist/en/api/index.html +++ b/docs/.vuepress/dist/en/api/index.html @@ -38,7 +38,7 @@ } - +

Introduction


Introduction

Crawlab users and developers are allowed integrate their own data into the Crawlab platform. This can be achieved by providing open APIs for data integration.

Please refer to below for more information.

- + diff --git a/docs/.vuepress/dist/en/develop/index.html b/docs/.vuepress/dist/en/develop/index.html index d0d51fd..1f35865 100644 --- a/docs/.vuepress/dist/en/develop/index.html +++ b/docs/.vuepress/dist/en/develop/index.html @@ -38,7 +38,7 @@ } - +


- + diff --git a/docs/.vuepress/dist/en/develop/introduction.html b/docs/.vuepress/dist/en/develop/introduction.html index 2585eea..6a9bcd9 100644 --- a/docs/.vuepress/dist/en/develop/introduction.html +++ b/docs/.vuepress/dist/en/develop/introduction.html @@ -38,7 +38,7 @@ } - +

Introduction


Introduction

🚧 Under construction

- + diff --git a/docs/.vuepress/dist/en/develop/plugins/index.html b/docs/.vuepress/dist/en/develop/plugins/index.html index ca5b191..60c0228 100644 --- a/docs/.vuepress/dist/en/develop/plugins/index.html +++ b/docs/.vuepress/dist/en/develop/plugins/index.html @@ -38,7 +38,7 @@ } - +

Develop Plugins


Develop Plugins

🚧 Under construction

- + diff --git a/docs/.vuepress/dist/en/guide/basic-tutorial/index.html b/docs/.vuepress/dist/en/guide/basic-tutorial/index.html index 2cab62d..6174308 100644 --- a/docs/.vuepress/dist/en/guide/basic-tutorial/index.html +++ b/docs/.vuepress/dist/en/guide/basic-tutorial/index.html @@ -38,7 +38,7 @@ } - +

Quick Tutorial


Quick Tutorial

You have now installed Crawlab and perhaps can't wait to start using it. Before you go deep into the details, I would suggest you go through this quick tutorial which would walk you through some basics and get you familiar with some main features in Crawlab.

Introduction

In this tutorial, we are going to create a spider that crawls quotes on a mock siteopen in new window provided by Zyteopen in new window (the company behind Scrapy); then we will upload this spider to Crawlab, and run it to extract quotes data; finally, we will view the crawled data visually on Crawlab.

The framework we are going to use is Scrapyopen in new window, the most popular web crawler framework written in Python, which is easy to use yet with many powerful features.

Note

We assume you have installed Crawlab on your local machine by following Quick Start. If you haven't, please refer to Quick Start to install it on your local machine.

As we are using Scrapy, please make sure you have installed Pythonopen in new window (>=3.6) and module management tool pipopen in new window before proceeding any further steps.

Create Spider

First thing first, we are going to generate a Scrapy project. Let's start by installing Scrapy.

pip install scrapy
@@ -90,6 +90,6 @@
   
 
- + diff --git a/docs/.vuepress/dist/en/guide/cli/index.html b/docs/.vuepress/dist/en/guide/cli/index.html index e5050c0..841d742 100644 --- a/docs/.vuepress/dist/en/guide/cli/index.html +++ b/docs/.vuepress/dist/en/guide/cli/index.html @@ -38,7 +38,7 @@ } - +

CLI


CLI

The CLI tools allow users to easily manage Crawlab and perform common actions including uploading spiders. It is written in Python and very easy to install.

Install

Crawlab CLI tools is integrated with Crawlab SDKopen in new window. You can install Crawlab CLI tools by executing the command below.

pip install crawlab-sdk
@@ -68,6 +68,6 @@
   
 
- + diff --git a/docs/.vuepress/dist/en/guide/data-sources/index.html b/docs/.vuepress/dist/en/guide/data-sources/index.html index 2801513..127b6fc 100644 --- a/docs/.vuepress/dist/en/guide/data-sources/index.html +++ b/docs/.vuepress/dist/en/guide/data-sources/index.html @@ -38,7 +38,7 @@ } - +

Data Sources


Data Sources

Crawlab supports data sources integration, which means you can use Crawlab to manage your data sources, such as MongoDB, MySQL, PostgreSQL, SQL Server, etc.

Supported Data Sources

CategoryData SourceSupported
Non-RelationalMongoDBopen in new window
Non-RelationalElasticSearchopen in new window
RelationalMySQLopen in new window
RelationalPostgreSQLopen in new window
RelationalSQL Serveropen in new window
RelationalCockroachDBopen in new window
RelationalSqliteopen in new window
StreamingKafkaopen in new window

Add Data Source

  1. Go to the Data Sources page
    data-sources-menu
  2. Click New Data Source button
    new-data-source-button
  3. Select Type as the data source type, and enter Name and connection fields
    mongo-form
  4. Click Confirm button to save the data source

Now you should be able to see the data source in the Data Sources page.

Use Data Source

  1. Go to the Spider Detail page
  2. Select the data source in the Data Source field
    mongo-data-source
  3. Click on Save button to save the spider
  4. Add related integration code in the code where saving results data (refer to the Spider Code Examples section below)
  5. Run the spider, and you should see the results in the Data tab
    results

Spider Code Examples

General Python Spider

The method save_item in crawlab-sdkopen in new window can be used to save data to designated data source.


@@ -62,6 +62,6 @@
   
 
- + diff --git a/docs/.vuepress/dist/en/guide/deps/index.html b/docs/.vuepress/dist/en/guide/deps/index.html index 0afd63c..d4a988d 100644 --- a/docs/.vuepress/dist/en/guide/deps/index.html +++ b/docs/.vuepress/dist/en/guide/deps/index.html @@ -38,7 +38,7 @@ } - +

Dependencies Management


Dependencies Management

Crawlab allows users to install and management dependencies for spiders and tasks.

Expand Dependencies button on the left sidebar and click sub-menu items as below.

menu.png

  • Settings: Global dependencies settings
  • Python: Python dependencies management
  • Node.js: Node.js dependencies management

Install Dependencies

  1. Navigate to the dependencies management page (Python/Node.js)
    deps-list.png
  2. Click on Installable button
    installable.png
  3. Enter search keywords and click on Search button
    img.png
  4. Click on Install button
    install.png
  5. Select Mode (which nodes to install) and Upgrade (whether to upgrade) and click Confirm button
    install-form.png

Uninstall Dependencies

  1. Navigate to the dependencies management page (Python/Node.js)
    deps-list.png
  2. Click on Uninstall button to uninstall the dependency
    uninstall.png
  3. Select Mode (which nodes to install) and click Confirm button
    uninstall-form.png

Settings

  1. Navigate to the dependencies management page (Settings)
    settings-list.png
  2. Click on Configure button
    edit.png
  3. Edit the configuration and click on Confirm button
    settings.png

Settings description:

  • Command: executable command for installing/uninstalling dependencies. E.g. pip, /usr/local/bin/pip39, npm , yarn
  • Proxy: proxy for installing/uninstalling dependencies. E.g. https://registry.npm.taobao.org or https://pypi.tuna.tsinghua.edu.cn/simple

Tasks

  1. Navigate to the dependencies management page (Python/Node.js)
  2. Click on Tasks button
    tasks.png
  3. You can view install/uninstall tasks
    tasks-list.png
  4. Click on Logs button to view logs
    tasks-logs.png
  5. You can view logs of given tasks
    tasks-logs-content.png
- + diff --git a/docs/.vuepress/dist/en/guide/environment/index.html b/docs/.vuepress/dist/en/guide/environment/index.html index de19485..8e745c4 100644 --- a/docs/.vuepress/dist/en/guide/environment/index.html +++ b/docs/.vuepress/dist/en/guide/environment/index.html @@ -5,7 +5,7 @@ - - +

Environment Variables


Environment Variables

NOTE

This feature is only available in the Crawlab Pro Editionopen in new window.

Crawlab allows users to set environment variables during spider runtime.

Setting Environment Variables

  1. Navigate to the Environment Variables page.
  2. Click the Create Environment Variable button.
  3. Fill in the configuration form.

Accessing Environment Variables

Assuming we have set an environment variable with the key FOO and the value BAR, we can access it in a spider script using the following sample code.

import os
 
 foo = os.environ.get('FOO')
 print(foo) # BAR
-
- + diff --git a/docs/.vuepress/dist/en/guide/index.html b/docs/.vuepress/dist/en/guide/index.html index a81cfa8..e8182e6 100644 --- a/docs/.vuepress/dist/en/guide/index.html +++ b/docs/.vuepress/dist/en/guide/index.html @@ -38,7 +38,7 @@ } - +

Introduction


Introduction

If you already know what Crawlab is and what it is used for, you can head straight to Quick Start or Installation to install and start to use Crawlab.

If you are not familiar with Crawlab, you can read sections below in order to understand more about Crawlab.

What is Crawlab?

Crawlab is a powerful Web Crawler Management Platform (WCMP) that can run web crawlers and spiders developed in various programming languages including Python, Go, Node.js, Java, C# as well as frameworks including Scrapy, Colly, Selenium, Puppeteer. It is used for running, managing and monitoring web crawlers, particularly in production environment where traceability, scalability and stability are the major factors to concern.

Background and History

Crawlab project has been under continuous development since it was published in March 2019, and gone through a number of major releases. It was initially designed for solving the managerial issue when there are a large number of spiders to coordinate and execute. With a lot of improvements and newly updated features, Crawlab is becoming more and more popular in developer communities, particularly amongst web crawler engineers.

Change Logsopen in new window

Who can use Crawlab?

  • Web Crawler Engineers. By integrating web crawler programs into Crawlab, you can now focus only on the crawling and parsing logics, instead of wasting too much time on writing common modules such as task queue, storage, logging, notification, etc.
  • Operation Engineers. The main benefits from Crawlab for Operation Engineers are the convenience in deployment (for both crawler programs and Crawlab itself). Crawlab supports easy installation with Docker and Kubernetes.
  • Data Analysts. Data analysts who can code (e.g. Python) are able to develop web crawler programs (e.g. Scrapy) and upload them into Crawlab. Then leave all the rest dirty work to Crawlab, and it will automatically collect data for you.
  • Others. Technically everyone can enjoy the convenience and easiness of automation provided by Crawlab. Though Crawlab is good at running web crawler tasks, it can also be used for other types of tasks such as data processing and automation.

Main Features

CategoryFeatureDescription
NodeNode ManagementRegister, manage and control multiple nodes in the distributed system
SpiderSpider DeploymentAuto-deploy spiders to multiple nodes and auto-sync spider files including scripts and programs
Spider Code EditingUpdate and edit script code with the online editor on the go
Spider StatsSpider crawling statistical data such as average running time and results count
Framework IntegrationIntegrate spider frameworks such as Scrapy
Data Storage IntegrationAutomatic saving results data in the database without additional configurations
Git IntegrationVersion control through embedded or external remote Git repos
TaskTask SchedulingAssign and schedule crawling tasks to multiple nodes in the distributed system
Task LoggingAutomatic saving task logs which can be viewed in the frontend UI
Task StatsVisually display task stats including task results count and running time
UserUser ManagementCreate, update and delete user accounts
OtherDependency ManagementSearch and install dependencies Python and Node.js packages
NotificationAutomatic email or mobile notifications when tasks are triggered or complete
- + diff --git a/docs/.vuepress/dist/en/guide/installation/direct-deploy.html b/docs/.vuepress/dist/en/guide/installation/direct-deploy.html index db48d01..b0c6086 100644 --- a/docs/.vuepress/dist/en/guide/installation/direct-deploy.html +++ b/docs/.vuepress/dist/en/guide/installation/direct-deploy.html @@ -38,7 +38,7 @@ } - +

Direct Deploy


Direct Deploy

🚧 Under construction...

- + diff --git a/docs/.vuepress/dist/en/guide/installation/docker.html b/docs/.vuepress/dist/en/guide/installation/docker.html index 357e71a..6dfaa32 100644 --- a/docs/.vuepress/dist/en/guide/installation/docker.html +++ b/docs/.vuepress/dist/en/guide/installation/docker.html @@ -38,7 +38,7 @@ } - +

Installation: Docker


Installation: Docker

Docker is the most convenient and easiest way to install and deploy Crawlab. If you are not familiar with Docker, you can refer to Docker Official Siteopen in new window and install it on your local machine. Make sure you have installed Docker before proceeding any further steps.

Main Process

There are several deployment modes for Docker installation, but the main process is similar.

  1. Install Dockeropen in new window and Docker-Composeopen in new window
  2. Pull Docker image of Crawlab (and MongoDB if you have no external MongoDB instance)
  3. Create docker-compose.yml and make configurations
  4. Start Docker containers

Note

For following guidance, we will assume you have installed Docker and Docker-Compose, and already pulled Docker images.

Standalone-Node Deployment

Standalone-Node Deployment (SND) is similar to the configuration in Quick Start, and it is normally for demo purpose or managing a small number of crawlers. In SND, all Docker containers including Crawlab and MongoDB are in only a single machine, i.e. Master Node (see diagram above).

Create docker-compose.yml and enter the content below.

version: '3.3'
@@ -156,6 +156,6 @@
   
 
- + diff --git a/docs/.vuepress/dist/en/guide/installation/index.html b/docs/.vuepress/dist/en/guide/installation/index.html index 2af06e2..0398c4c 100644 --- a/docs/.vuepress/dist/en/guide/installation/index.html +++ b/docs/.vuepress/dist/en/guide/installation/index.html @@ -38,7 +38,7 @@ } - +

Installation


Installation

There are multiple methods of installing Crawlab. You can refer to the summary table below to choose the one that is most suitable.

Install MethodRecommended EnvironmentRecommended Users
DockerDemo / Production (nodes<=10)
  1. Small cluster needed
  2. Familiar with Docker
  3. Minimal maintenance required
Kubernetes (To be updated)Production (nodes>10)
  1. Medium or large cluster needed
  2. Scalability is major concern
  3. Familiar with Kubernetes or orchestration
  4. Professional operation resources available
Direct Deploy (To be updated)Demo / Experimental
  1. Additional customization needed
  2. Familiar with Vue.js or Go
  3. Willing to work with source code
- + diff --git a/docs/.vuepress/dist/en/guide/installation/kubernetes.html b/docs/.vuepress/dist/en/guide/installation/kubernetes.html index 0a5a5e4..b3d2591 100644 --- a/docs/.vuepress/dist/en/guide/installation/kubernetes.html +++ b/docs/.vuepress/dist/en/guide/installation/kubernetes.html @@ -38,7 +38,7 @@ } - +

Kubernetes


Kubernetes

🚧 Under construction...

- + diff --git a/docs/.vuepress/dist/en/guide/monitoring/index.html b/docs/.vuepress/dist/en/guide/monitoring/index.html index 610dd81..9b8f250 100644 --- a/docs/.vuepress/dist/en/guide/monitoring/index.html +++ b/docs/.vuepress/dist/en/guide/monitoring/index.html @@ -5,7 +5,7 @@ - - +

Monitoring


Monitoring

NOTE

This functionality is for Pro Editionopen in new window only.

Crawlab Proopen in new window supports performance monitoring, which means you can use Crawlab Pro to monitor the performance of your nodes.

Performance Metrics Overview

  1. Go to the Metrics page
    metrics-menu
  2. You can see the snapshots of the performance metrics of all nodes
    metrics-overview

Performance Metrics Detail

  1. Go to the Metrics Deail page by clicking on View button in the Metrics page
    view-button
  2. You can see the performance metrics of the selected node
    metrics-detail
  3. You can switch the metrics source by selecting the Metrics Source dropdown
    metrics-source
  4. You can select the time range/unit by selecting the Time Range dropdown
    time-range
    and Time Unit
    time-unit
  5. You can check or uncheck metrics on the left panel to show/hide them on the right panel
    metrics-panel
- + diff --git a/docs/.vuepress/dist/en/guide/node/index.html b/docs/.vuepress/dist/en/guide/node/index.html index 29cc697..142dc55 100644 --- a/docs/.vuepress/dist/en/guide/node/index.html +++ b/docs/.vuepress/dist/en/guide/node/index.html @@ -38,7 +38,7 @@ } - +

Node


Node

A node is a Crawlab instance that runs crawling tasks or provides other functionalities. You can basically regard a node as a server.

There are two types of nodes, each of which serves different functionalities.

  1. Master Node
  2. Worker Node

Note

Of course you can set up multiple Crawlab instances (nodes) on a server, but that is NOT recommended as a single instance (node) on a server normally suffices.

Master Node

Master Node is the control center of the whole distributed system in Crawlab. It acts like the brain of a human body. Master Node assigns tasks to Worker Nodes or itself, and manages them. It also deploys and distributes spider files to other nodes. Furthermore, it provides APIs to the frontend application and handles communication between each node.

Note

There is only ONE Master Node in Crawlab.

Worker Node

Worker Node is a Crawlab instance dedicated for running crawling tasks. Normally, a single node or server can be limited to its computing power and resources including CPUs, memory and network IO. Therefore, the number of Worker Nodes can be increased in order to scale up the throughput of data collection and improve the overall crawling performance of the distributed system.

Tips

There can be none (SND) or multiple Worker Nodes (MND) in Crawlab.

Topology

Check Node Status

In Nodes page, you can view the status of a node whether it is online of offline.

Enable/Disable

You can enable or disable nodes to run tasks by toggling the switch button of Enabled attribute in Nodes page and node detail page.

Set Max Runners

A node can run multiple tasks at the same time. The number of concurrent tasks is controlled by Max Runners of a node. It can be configured in the node detail page.

Set Basic Info

Basic info such as node name, IP, MAC address can be set in the node detail page.

Add Node

You can refer to Set up Worker Nodes in Multi-Node Deployment (MND) of Docker Installation to add new nodes.

- + diff --git a/docs/.vuepress/dist/en/guide/notifications/index.html b/docs/.vuepress/dist/en/guide/notifications/index.html index dbb3e71..714284a 100644 --- a/docs/.vuepress/dist/en/guide/notifications/index.html +++ b/docs/.vuepress/dist/en/guide/notifications/index.html @@ -5,7 +5,7 @@ - - +

Notifications


Notifications

NOTE

This functionality is for Pro Editionopen in new window only.

Crawlab allows users to receive email or mobile notifications.

Email

  1. Navigate to Notifications page
    notifications-menu.png
  2. Click a notification config of Email type
  3. Fill in the configuration form
    email-config.png
  4. Click on Save button

SMTP configurations:

  • SMTP Server: SMTP server address
  • SMTP Port: SMTP server port
  • SMTP User: SMTP server username
  • SMTP Password: SMTP server password
  • Sender Email: SMTP server sender email
  • Sender Identity: SMTP server sender identity
  • To: Recipient email
  • CC: CC email

Mobile

  1. Navigate to Notifications page
    notifications-menu.png
  2. Click a notification config of Mobile type
  3. Fill in the configuration form
    mobile-config.png
  4. Click on Save button

Tips

Please refer to related documentation for how to get webhook tokens.

Trigger

  1. Navigate to the Notifications page
    notifications-menu.png
  2. Click on the Trigger tab
  3. Select the event types you want to trigger

Template

  1. Navigate to Notifications page
    notifications-menu.png
  2. Click a notification config of any type
  3. Click on Template tab
    template.png

Tips

To understand the syntax and variables of templates, please refer to template-parseropen in new window

- + diff --git a/docs/.vuepress/dist/en/guide/permissions/index.html b/docs/.vuepress/dist/en/guide/permissions/index.html index a920ee9..f167ba8 100644 --- a/docs/.vuepress/dist/en/guide/permissions/index.html +++ b/docs/.vuepress/dist/en/guide/permissions/index.html @@ -5,7 +5,7 @@ - - +

Permissions Management


Permissions Management

NOTE

This functionality is for Pro Editionopen in new window only.

Crawlab Proopen in new window supports a RBACopen in new window -based permissions management, which means you can use Crawlab Pro to manage the Permissions of your users via Roles.

Permissions

Permissions in Crawlab Pro are the basic unit of user access control.

Types of permissions

Types of permissions are as below:

  • Action: Specific actions that a role can perform, such as View, Edit, Delete, Create, etc.
  • Page: Specific pages that a role can access, such as Spiders, Tasks, Nodes, etc.
  • Data: Specific data records that a role can access, such as Spiders attributed to a specific user.

Permission fields

Fields of permissions are as below:

  • Type: Type of permission, Action, Page, or Data.
  • Target: Regex pattern of the targets, where the permission should operate on.
  • Allow: Regex pattern of allowed items.
  • Deny: Regex pattern of denied items.

Create a permission

  1. Go to the Permissions page by clicking the Permissions button in the sidebar.
    permissions-menu
  2. Click the New Permission button
    permissions-create
  3. Enter necessary info of the new permission and click Confirm button
    permissions-create-form

Delete a permission

  1. Go to the Permissions page by clicking the Permissions button in the
    permissions-menu
  2. Click the Delete button of the permission you want to delete
    delete-button

Roles

Roles in Crawlab Pro can be defined by admin users. Roles are associated with a set of permissions, and can be assigned to users.

Create a Role

  1. Go to the Roles page by clicking the navigation button on the left sidebar
    roles-menu
  2. Click the New Role button
    roles-create
  3. Enter necessary info of the new role and click Confirm button
    roles-create-form

Delete a role

  1. Go to the Roles page by clicking the Roles button in the
    roles-menu
  2. Click the Delete button of the role you want to delete
    delete-button
  1. Go to the Permissions tab in the Role Detail page by clicking the View permissions button.
    view-permissions-button
  2. Click on Link Permissions button.
    link-permissions-button
  3. Select the permissions you want to link/unlink to the role, and click Confirm button.
    link-permissions-form
  1. Go to the Permissions tab in the Role Detail page by clicking the View users button.
    view-users-button
  2. Click on Link Users button.
    link-users-button
  3. Select the users you want to link/unlink to the role, and click Confirm button.
    link-users-form
- + diff --git a/docs/.vuepress/dist/en/guide/plugin/index.html b/docs/.vuepress/dist/en/guide/plugin/index.html index 262aba8..2ef9361 100644 --- a/docs/.vuepress/dist/en/guide/plugin/index.html +++ b/docs/.vuepress/dist/en/guide/plugin/index.html @@ -38,7 +38,7 @@ } - +

Plugin


Plugin

Plugin is an extension which can extend beyond existing functionalities and features. In Crawlab, the Plugin Framework is in place for users to customize their web crawler management platforms.

Why Plugin

Why don't we just hack around the source code in Crawlab when customization is needed? The reason is the concern for Maintainability. When you change some code of core modules in Crawlab, you might risk your project's maintainability because there will be upgrades in the future, which would very likely break your current customization.

A well-designed plugin is less likely to be tightly coupled with Crawlab, so that updates in Crawlab will not significantly affect the plugin. Plugins are pluggable and easy to be installed or uninstalled.

Plugin Framework

Plugin Framework is embedded in Crawlab which manages official and third-party plugins. Crawlab users can develop plugins based on Crawlab Plugin Framework (CPF).

Official Plugins

There are some public official plugins maintained by Crawlab Teamopen in new window. The GitHub repos of official Crawlab plugins are normally located in Crawlab Team's repositoriesopen in new window, each of which has a prefix plugin-.

NameDescription仓库链接
plugin-notificationSending alerts and notifications such as emails and mobile push notificationsLinkopen in new window
plugin-dependencyInstalling and managing dependencies and running environmentLinkopen in new window
plugin-spider-assistantProviding advanced web crawler features such as framework support (e.g. Scrapy, etc.)Linkopen in new window

Install Plugin

Tips

After a plugin is installed, you should refresh page in your web browser in order for plugin UI components to display.

There are several ways of installing plugins in Crawlab.

Install Official Plugins

You can install official plugins by only input the plugin name in Install Plugin dialog.

  1. Navigate to Plugins.
  2. Choose Public.
  3. Click Install button on plugins you would like to install.

Install by Git

If you know the git url of a Crawlab plugin, you can install it through git url.

  1. Navigate to Plugins.
  2. Choose Type as Git.
  3. Enter the url of the plugin in the field Install URL.
  4. Click Confirm.

Install by Local

Note

This method is recommended only when you are developing Crawlab with source code.

  1. Navigate to Plugins.
  2. Choose Type as Local.
  3. Enter local path of the plugin in the field Install Path.
  4. Click Confirm.

Installation Source

Note

Installation Source is only for official plugins.

The default installation source of official plugins is GitHub. But GitHub is not always the best Source to access. For example, if you are in Mainland China, accessing GitHub can sometimes be slow; then you can choose Gitee as Source of official plugins, which will largely speed up plugin installation.

Uninstall Plugin

You can uninstall a plugin by clicking Delete button on the right in Plugins page.

Start/Stop

You can start or stop a plugin by clicking Start or Stop button on the right in Plugins page.

- + diff --git a/docs/.vuepress/dist/en/guide/plugin/plugin-dependency.html b/docs/.vuepress/dist/en/guide/plugin/plugin-dependency.html index 5ba8d93..bce3a15 100644 --- a/docs/.vuepress/dist/en/guide/plugin/plugin-dependency.html +++ b/docs/.vuepress/dist/en/guide/plugin/plugin-dependency.html @@ -38,7 +38,7 @@ } - +

plugin-dependency


plugin-dependency

plugin-dependencyopen in new window is a plugin that manages dependencies in Crawlab. For example, your Python crawlers may need to use libraries such as selenium or sqlalchemy apart from pre-installed libraries in Crawlab. With plugin-dependency, you can easily install and manage your dependencies and libraries on web UI in Crawlab.

Available Dependency Frameworks

  • Python
  • Node.js

Search and Install Dependencies

You can search and install dependencies on Crawlab Web UI with plugin-dependency, just like in popular IDEs such as JetBrains IDEA and VS Code.

  1. Navigate to the dependency framework page, e.g. Python.
  2. Click Installable button.
  3. Type in keyword for searching in the search input on the top left.
  4. Click search icon button.
  5. Click Install button on the right of the plugins you'd like to install.

Uninstall Dependencies

Uninstalling dependencies are also available.

  1. Navigate to the dependency framework page, e.g. Python.
  2. Click Installed button.
  3. Type in keyword for searching in the search input on the top left.
  4. Click search icon button.
  5. Click Uninstall button on the right of the plugins you'd like to uninstall.

View Tasks

You may want to check if your installation or uninstallation is successful or not, which can be achieved by viewing tasks following steps below.

  1. Navigate to the dependency framework page, e.g. Python.
  2. Click Tasks button.
  3. You can view logs of each task by clicking Logs button.
- + diff --git a/docs/.vuepress/dist/en/guide/plugin/plugin-notification.html b/docs/.vuepress/dist/en/guide/plugin/plugin-notification.html index 19c46ee..8060868 100644 --- a/docs/.vuepress/dist/en/guide/plugin/plugin-notification.html +++ b/docs/.vuepress/dist/en/guide/plugin/plugin-notification.html @@ -38,7 +38,7 @@ } - +

plugin-notification


plugin-notification

plugin-notificationopen in new window is a Crawlab plugin that allows users to send and receive notifications from Crawlab using email or mobile applications (e.g. WeChat, DingTalk).

Notification Type

There are 2 types of notifications in plugin-notification.

  • Mail: Sending notifications via email.
  • Mobile: Sending notifications via mobile webhooks.

Triggers

plugin-notification allows users to set triggers in order to configure when to send notifications.

You can follow the below steps to configure triggers.

  1. Navigate to Notifications page.
  2. Navigate to notification detail page by clicking the name or View button on the right.
  3. Click Triggers tab.
  4. Select triggers for sending notifications.

Template

plugin-notification allows users to customize notification content.

You can follow the below steps to customize content.

  1. Navigate to Notifications page.
  2. Navigate to notification detail page by clicking the name or View button on the right.
  3. Click Template tab.
  4. Edit template.
- + diff --git a/docs/.vuepress/dist/en/guide/plugin/plugin-spider-assistant.html b/docs/.vuepress/dist/en/guide/plugin/plugin-spider-assistant.html index 2acfafc..543c17b 100644 --- a/docs/.vuepress/dist/en/guide/plugin/plugin-spider-assistant.html +++ b/docs/.vuepress/dist/en/guide/plugin/plugin-spider-assistant.html @@ -38,7 +38,7 @@ } - +

plugin-spider-assistant


plugin-spider-assistant

plugin-spider-assistantopen in new window is a Crawlab plugin that provides assistance in spider management. It allows users to view and manage items in spider frameworks.

Spider Frameworks

NameLanguageViewManage
Scrapyopen in new windowPython
Collyopen in new windowGo
WebMagicopen in new windowJava
DotnetSpideropen in new windowC#

How to use

  1. Navigate to spider detail page.
  2. Click Assistant tab.
  3. You are now able to view info of detected spider framework.
- + diff --git a/docs/.vuepress/dist/en/guide/project/index.html b/docs/.vuepress/dist/en/guide/project/index.html index 00ad0f7..ca89e1c 100644 --- a/docs/.vuepress/dist/en/guide/project/index.html +++ b/docs/.vuepress/dist/en/guide/project/index.html @@ -38,7 +38,7 @@ } - +

Project


Project

A project is a group of spiders that are normally closely related and mostly crawl sites or data in the same category or industry. Therefore, you can regard projects as a method of grouping spiders together so that they could be better managed.

It is in one-to-many relationship with spiders.

You can link a spider to a project by either,

  1. selecting Project in the spider detail page, or
  2. selecting Project in the create new spider dialog.

View Spiders

Navigate to Spiders tab in the project detail page.

- + diff --git a/docs/.vuepress/dist/en/guide/quick-start.html b/docs/.vuepress/dist/en/guide/quick-start.html index 9b208c0..aba5954 100644 --- a/docs/.vuepress/dist/en/guide/quick-start.html +++ b/docs/.vuepress/dist/en/guide/quick-start.html @@ -38,7 +38,7 @@ } - +

Quick Start


Quick Start

The quickest way to install Crawlab is Docker. If you are not familiar with Docker, you can refer to Docker Official Siteopen in new window and install it on your local machine.

Pull Images

Make sure you have installed Docker, and then pull the image of Crawlab and MongoDB.

docker pull crawlabteam/crawlab
@@ -70,6 +70,6 @@
   
 
- + diff --git a/docs/.vuepress/dist/en/guide/schedule/index.html b/docs/.vuepress/dist/en/guide/schedule/index.html index bb74905..e8a601f 100644 --- a/docs/.vuepress/dist/en/guide/schedule/index.html +++ b/docs/.vuepress/dist/en/guide/schedule/index.html @@ -38,7 +38,7 @@ } - +

Schedule


Schedule

Most of the time, we may need to periodically run crawling tasks for a spider. Now you need a schedule.

The concept schedule in Crawlab is similar to crontabopen in new window in Linux. It is a long-existing job that runs spider tasks in a periodical way.

Tips

If you would like to configure a web crawler that automatically runs crawling tasks every day/week/month, you should probably set up a schedule. Schedule is the right way to automate things, especially for spiders that crawl incremental content.

Create Schedule

  1. Navigate to Schedules page.
  2. Click New Schedule button on the top left.
  3. Enter basic info including Name, Cron Expressionopen in new window and Spider.
  4. Click Confirm.

The created schedule is enabled by default. Once you created a schedule which is already enabled, it should trigger a task on time according to its cron expression you have set.

Tips

You can debug whether the schedule module works in Crawlab by creating a new schedule with Cron Expression as * * * * *, which means "every minute", so that you can check if a task will be triggered when the next minute starts.

Enable/Disable Schedule

You can enable or disable schedules by toggling the switch button of Enabled attribute in Schedules page and schedule detail page.

Cron Expression

Cron Expression is a simple and standard format to describe the periodicity of tasks. It is the same as the format in Linux crontab.

*    *    *   *    *  Command_to_execute
@@ -63,6 +63,6 @@
   
 
- + diff --git a/docs/.vuepress/dist/en/guide/spider/file-editor.html b/docs/.vuepress/dist/en/guide/spider/file-editor.html index fd7873a..b1211be 100644 --- a/docs/.vuepress/dist/en/guide/spider/file-editor.html +++ b/docs/.vuepress/dist/en/guide/spider/file-editor.html @@ -38,10 +38,10 @@ } - + -

File Editor


File Editor

Crawlab allows users to edit files in the browser. This is useful for editing files such as settings.py and items.py in the spider.

Open File

  1. Navigate to Files tab in spider detail page.
    files-tab
  2. Double-click the file you want to edit.
    files-sidebar
  3. The file should be opened in the editor.
    file-editor

Edit File

  1. Make changes to the file.

Save File

  1. Press Ctrl + S or click Save button in the nav bar to save the file.
    save-btn

Move File

  1. Drag and drop the file to the folder you want to move to.

Rename File

  1. Right-click the file and select Rename.
    rename

Duplicate File

  1. Right-click the file and select Duplicate.
    duplicate

Delete File

  1. Right-click the file and click Delete in the context menu.
    delete-file