Skip to content

Latest commit

 

History

History
199 lines (168 loc) · 7.5 KB

installation_deployment_postgres_demo.md

File metadata and controls

199 lines (168 loc) · 7.5 KB

Installation and deployment instructions (using Postgres as example)

Below are the instructions for connecting a Postgress server. The installation steps should be the same for connecting all kinds of servers. Different servers would require different configurations in the .yaml or DAG files. See https://docs.open-metadata.org/integrations/connectors for your configuration.

Goal: To run Postgres metadata ingestion and quality tests with OpenMetadata using Airflow scheduler

Note: This procedure does not support Windows, because Windows does not implement "signal.SIGALRM". It is highly recommended to use WSL 2 if you are on Windows.

Requirements:

See https://docs.open-metadata.org/overview/run-openmetadata-with-prefect "Requirements" section

Installation:

  1. Clone this git hub repo: git clone https://github.com/open-metadata/OpenMetadata.git

  2. Cd to ~/.../openmetadata/docker/metadata

  3. Start the OpenMetadata containers. This will allow you run OpenMetadata in Docker: docker compose up -d

  1. Install the OpenMetadata ingestion package.
  • (optional but highly recommended): Before installing this package, it is recommended to create and activate a virtual environment. To do this, run: python -m venv env and source env/bin/activate

  • To install the OpenMetadata ingestion package: pip install --upgrade "openmetadata-ingestion[docker]==0.10.3" (specify the release version to ensure compatibility)

  1. Install Airflow:
  • 5A: Install Airflow Lineage Backend: pip3 install "openmetadata-ingestion[airflow-container]"==0.10.3
  • 5B: Install Airflow postgres connector module: pip3 install "openmetadata-ingestion[postgres]"==0.10.3
  • 5C: Install Airflow APIs: pip3 install "openmetadata-airflow-managed-apis"==0.10.3
  • 5D: Install necessary Airflow plugins:
      1. Download the latest openmetadata-airflow-apis-plugins release from https://github.com/open-metadata/OpenMetadata/releases
      1. Untar it under your {AIRFLOW_HOME} directory (usually c/Users/Yourname/airflow). This will create and setup a plugins directory under {AIRFLOW_HOME} .
      1. cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}
      1. mkdir -p {AIRFLOW_HOME}/dag_generated_configs
      1. (re)start the airflow webserver and scheduler
  1. Configure Airflow:

To run a metadata ingestion workflow with Airflow ingestion DAGs on Postgres data:

  1. Prepare the Ingestion DAG: To see a more complete tutorial on ingestion DAG, see https://docs.open-metadata.org/integrations/connectors/postgres/run-postgres-connector-with-the-airflow-sdk To be brief, below is my own DAG. Copy & Paste the following into a python file (postgres_demo.py):
import pathlib
import json
from datetime import timedelta
from airflow import DAG

try:
    from airflow.operators.python import PythonOperator
except ModuleNotFoundError:
    from airflow.operators.python_operator import PythonOperator

from metadata.config.common import load_config_file
from metadata.ingestion.api.workflow import Workflow
from airflow.utils.dates import days_ago

default_args = {
    "owner": "user_name",
    "email": ["username@org.com"],
    "email_on_failure": False,
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(minutes=60)
}

config = """
{
    "source":{
        "type": "postgres",
        "serviceName": "postgres_demo",
        "serviceConnection": {
            "config": {
                "type": "Postgres",
                "username": "postgres", (change to your username)
                "password": "postgres", (change to your password)
                "hostPort": "192.168.1.55:5432", (change to your hostPort)
                "database": "surveillance_hub" (change to your database)
            }
        },
        "sourceConfig":{
            "config": { (all of the following can switch to true or false)
                "enableDataProfiler": "true" or "false", 
                "markDeletedTables": "true" or "false",
                "includeTables": "true" or "false",
                "includeViews": "true" or "false",
                "generateSampleData": "true" or "false" 
            }
        }
    },      
    "sink":{
        "type": "metadata-rest",
        "config": {}
    },   
    "workflowConfig": {
        "openMetadataServerConfig": {
            "hostPort": "http://localhost:8585/api",
            "authProvider": "no-auth"
        }
    }
        
        
}
"""

def metadata_ingestion_workflow():
    workflow_config = json.loads(config)
    workflow = Workflow.create(workflow_config)
    workflow.execute()
    workflow.raise_from_status()
    workflow.print_status()
    workflow.stop()


with DAG(
    "sample_data",
    default_args=default_args,
    description="An example DAG which runs a OpenMetadata ingestion workflow",
    start_date=days_ago(1),
    is_paused_upon_creation=False,
    schedule_interval='*/5 * * * *', 
    catchup=False,
) as dag:
    ingest_task = PythonOperator(
        task_id="ingest_using_recipe",
        python_callable=metadata_ingestion_workflow,
    )

if __name__ == "__main__":
    metadata_ingestion_workflow()
  1. Run the DAG: python postgres_demo.py
  • Alternatively, we could run without Airflow SDK and with OpenMetadata's own methods. Run metadata ingest -c /Your_Path_To_Json/.json The json configuration is exactly the same as the json configuration in the DAG.
  • Or, we could also run it with metadata ingest -c /Your_Path_To_Yaml/.yaml The yaml configuration would be the exact same except without the curly brackets and the double quotes.

To run a profiler workflow on Postgres data

  1. Prepare the DAG OR configure the yaml/json:

Example yaml I was using:

source:
  type: postgres
  serviceName: your_service_name
  serviceConnection:
    config:
      type: Postgres
      username: your_username
      password: your_password
      hostPort: 
      database: your_database  
  sourceConfig:
    config:
      type: Profiler

processor:
  type: orm-profiler
  config:
    test_suite:
      name: demo_test
      tests:
        - table: your_table_name (FQN)
          column_tests:
            - columnName: id
              testCase:
                columnTestType: columnValuesToBeBetween
                config:
                  minValue: 0
                  maxValue: 10
sink:
  type: metadata-rest
  config: {}
workflowConfig:
  openMetadataServerConfig:
    hostPort: http://localhost:8585/api
    authProvider: no-auth

Note that the table name must be FQN and match exactly with the table path on the OpenMetadata UI.

  1. Run it with metadata profile -c /path_to_yaml/.yaml

Make sure to refresh the OpenMetadata UI and click on the Data Quality tab to see the results.