Airflow Custom Service Descriptor (CSD)
This repository allows you to install Apache Airflow as a service managable by Cloudera Manager.
- A supported operating system.
- MySQL or PostgreSQL database in which to store Airflow metadata.
- RabbitMQ
- Airflow Parcel
- Airflow 1.9.0
- Airflow 1.10.3
- CentOS/RHEL 6 & 7
- Debian 8
- Ubuntu 14.04, 16.04, & 18.04
- Download the Jar file. Airflow CSD
- Copy the jar file to the
/opt/cloudera/csd
location on the Cloudera Manager server. - Restart the Cloudera Manager Server service.
service cloudera-scm-server restart
- A database needs to be created.
- A database user needs to be created along with a password.
- Grant all the privileges on the database to the newly created user.
- Set
AIRFLOWDB_PASSWORD
to a sufficient value. For example, run the following in your Linux shell session:< /dev/urandom tr -dc A-Za-z0-9 | head -c 20;echo
Example for MySQL:
- Create a database.
CREATE DATABASE airflow DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
- Create a new user and grant privileges on the database.
GRANT ALL ON airflow.* TO 'airflow'@'localhost' IDENTIFIED BY 'AIRFLOWDB_PASSWORD'; GRANT ALL ON airflow.* TO 'airflow'@'%' IDENTIFIED BY 'AIRFLOWDB_PASSWORD';
Alternatively, you can use the Airflow/MySQL deployment script to create the MySQL database using:
create_mysql_dbs-airflow.sh --host <host_name> --user <username> --password <password>
Example for PostgreSQL:
- Create a role.
CREATE ROLE airflow LOGIN ENCRYPTED PASSWORD 'AIRFLOWDB_PASSWORD' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; ALTER ROLE airflow SET search_path = airflow, "$user", public;
- Create a database.
CREATE DATABASE airflow WITH OWNER = airflow ENCODING = 'UTF8' TABLESPACE = pg_default CONNECTION LIMIT = -1;
Alternatively, you can use the Airflow/PostgreSQL deployment script to create the PostgreSQL database using:
create_postgresql_dbs-airflow.sh --host <host_name> --user <username> --password <password>
There are six roles available for deployment:
- Webserver
- Scheduler
- Worker
- Flower Webserver
- Kerberos
- Gateway
Webserver: Airflow Webserver role runs the Airflow Web UI. Webserver role can be deployed on more than instances. However, they will be the same and can be used for backup purposes.
Scheduler: Airflow Scheduler role is used to schedule the Airflow jobs. This is limited to one instance to reduce the risk of duplicate jobs.
Worker: Airflow Worker role picks jobs from the Scheduler and executes them. Multiple instances can be deployed.
Flower Webserver: Flower Webserver role is used to monitor Celery clusters. Celery allows for the expansion of Worker Only one instance is needed.
Kerberos: Airflow Kerberos role is used to enable Kerberos protocol for the other Airflow roles and for DAGs. This role should exist on each host with an Airflow Worker role.
Gateway: The purpose of the gateway role is to make the configuration available to CLI clients.
Here are some of the examples of Airflow commands:
airflow list_dags
The dag file has to be copied to all the nodes to the dags folder manually.
airflow trigger_dag <DAG Name>
For a complete list of Airflow commands refer to the Airflow Command Line Interface.
The DAG file has to be copied to dags_folder
directory within all the nodes. It is important to manually distribute to all the nodes where the roles are deployed.
In order to enable authentication for the Airflow Web UI check the "Enable Airflow Authentication" option. You can create Airflow users using one of two options below.
One way to add Airflow users to the database is using the airflow-mkuser
script. Users can be added as follows:
- Navigate to Airflow WebUI.
- In the Admin dropdown choose Users.
- Choose Create and enter the username, email, and password you want to create.
Note: Although the last created user shows up in the Airflow configurations, you can still use the previously created users.
Another way to add Airflow users to the database is using the airflow-mkuser
script. Users can be added as follows:
airflow-mkuser <username> <email> <password>
For example, this can be like:
airflow-mkuser admin admin@localdomain password123
git clone https://github.com/teamclairvoyant/apache-airflow-cloudera-csd
cd apache-airflow-cloudera-csd
make dist
Update the version
file before running make dist
if creating a new release.
- After deploying configurations, there is no alert or warning that the specific roles needs to be restarted.
- Only 'airflow.contrib.auth.backends.password_auth' mechanism is supported for Airflow user authentication.
- Build RabbitMQ parcel.
- Test Database connection.
- Add the support for more Airflow user authentication methods.
Upon many deployments, you may face an error called 'Markup file already exists' while trying to stop a role and the process never stops. In that case, stop the process using the "Abort" command and navigate to /var/run/cloudera-scm-agent/process
and delete all the GracefulRoleStopRunner
directories.