From 2e536daaeab7728dfd8668a8affc89fe61c29b6d Mon Sep 17 00:00:00 2001 From: Alex Date: Fri, 6 Sep 2024 15:01:16 +0200 Subject: [PATCH 1/9] docs updates. added section observability --- docs/Observability/1. overview.md | 9 ++++++++ docs/Observability/2. prometheus.md | 20 ++++++++++++++++++ docs/Observability/3. otlp.md | 32 +++++++++++++++++++++++++++++ 3 files changed, 61 insertions(+) create mode 100644 docs/Observability/1. overview.md create mode 100644 docs/Observability/2. prometheus.md create mode 100644 docs/Observability/3. otlp.md diff --git a/docs/Observability/1. overview.md b/docs/Observability/1. overview.md new file mode 100644 index 00000000..49315bfd --- /dev/null +++ b/docs/Observability/1. overview.md @@ -0,0 +1,9 @@ +# Overview + +Observability is the ability to infer a system's internal state from its external outputs. +AI DIAL supports Prometheus and OpenTelemetry (OTEL) methods enhance observability by providing powerful metrics collection and tracing capabilities, enabling deeper insights into system performance and behavior. + +For DIAL application we have few options to monitoring/observability: +- Prometheus (pull/scrape model) +- OTLP (push model) +- Both (application can be configured to use both approaches) \ No newline at end of file diff --git a/docs/Observability/2. prometheus.md b/docs/Observability/2. prometheus.md new file mode 100644 index 00000000..e04a2b31 --- /dev/null +++ b/docs/Observability/2. prometheus.md @@ -0,0 +1,20 @@ +# Prometheus method +## Introduction +Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets at specified intervals, stores them in a time-series database, and provides powerful querying capabilities. With its flexible architecture, Prometheus is particularly suited for dynamic environments, making it a popular choice for cloud-native applications and microservices. Its intuitive visualization tools help teams gain deep insights into system performance, ensuring timely detection of issues. +## Configuration +### AI DIAL Core Settings +Add the following environment variable to AI DIAL Core configuration. Refer to [AI DIAL Core](https://github.com/epam/ai-dial-helm/tree/main/charts/dial-core) to learn more. +```json +containerPorts.metrics +``` +### AI DIAL Extentions Settings +Add the following environment variable to AI DIAL Extentions configuration. Refer to [AI DIAL Extentions](https://github.com/epam/ai-dial-helm/tree/main/charts/dial-extension) to learn more. +```json +containerPorts.metrics +metrics.* +``` + +## Examples +``` +TBD? +``` \ No newline at end of file diff --git a/docs/Observability/3. otlp.md b/docs/Observability/3. otlp.md new file mode 100644 index 00000000..c002ec5c --- /dev/null +++ b/docs/Observability/3. otlp.md @@ -0,0 +1,32 @@ +# OpenTelemetry method +## Introduction +OpenTelemetry is an open-source observability framework designed to standardize the collection of telemetry data across distributed systems. By providing a unified set of APIs, libraries, and agents, it enables developers to capture traces, metrics, and logs from their applications seamlessly. OpenTelemetry simplifies the monitoring process and enhances visibility into application performance and reliability, making it easier to troubleshoot issues and optimize systems in real-time. +## Configuration +### Configuration Guidelines + `TBD?` +### Configure AI DIAL +Environment Variables for the Configuration of Opentelemetry +All standard env variables you could find in the official opentelemetry [documentation](https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/logging/logging.html). If no value set for the OTEL_METRICS_EXPORTER then OpenTelemetry Prometheus Metric Exporter will be used. If value set to "otlp" the OpenTelemetry Collector Metrics Exporter for web and node will be used. +```json +OTEL_PYTHON_LOG_CORRELATION: "true|false" +OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: "" +OTEL_LOGS_EXPORTER: "otlp" +OTEL_METRICS_EXPORTER: "otlp|otlp,prometheus" +OTEL_TRACES_EXPORTER: "otlp" +OTEL_EXPORTER_OTLP_ENDPOINT: "" +OTEL_RESOURCE_ATTRIBUTES: "service.name=" +``` + +Where: +- `OTEL_PYTHON_LOG_CORRELATION` - enable trace context injection +- `OTEL_PYTHON_FASTAPI_EXCLUDED_URLS`: to exclude certain URLs from tracking +- `OTEL_LOGS_EXPORTER` - Logs exporter to be used +- `OTEL_METRICS_EXPORTER` - Metrics exporter to be used +- `OTEL_TRACES_EXPORTER` - Trace exporter to be used +- `OTEL_EXPORTER_OTLP_ENDPOINT` - OTEL endpoint URL +- `OTEL_RESOURCE_ATTRIBUTES` - Key-value pairs to be used as resource attributes + +## Examples +``` +TBD? +``` \ No newline at end of file From ef2ee40821cf099a10b13925ed857e08a18aa398 Mon Sep 17 00:00:00 2001 From: Alex Date: Mon, 9 Sep 2024 13:14:52 +0200 Subject: [PATCH 2/9] Added a lot of improvements --- docs/Observability/2. prometheus.md | 12 ++++++++++-- docs/Observability/3. otlp.md | 6 +++--- docs/Observability/4. Log collections/1. overview.md | 2 ++ docs/Observability/4. Log collections/2. k8salloy.md | 7 +++++++ 4 files changed, 22 insertions(+), 5 deletions(-) create mode 100644 docs/Observability/4. Log collections/1. overview.md create mode 100644 docs/Observability/4. Log collections/2. k8salloy.md diff --git a/docs/Observability/2. prometheus.md b/docs/Observability/2. prometheus.md index e04a2b31..397ee7b8 100644 --- a/docs/Observability/2. prometheus.md +++ b/docs/Observability/2. prometheus.md @@ -1,18 +1,26 @@ # Prometheus method ## Introduction -Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets at specified intervals, stores them in a time-series database, and provides powerful querying capabilities. With its flexible architecture, Prometheus is particularly suited for dynamic environments, making it a popular choice for cloud-native applications and microservices. Its intuitive visualization tools help teams gain deep insights into system performance, ensuring timely detection of issues. +[Prometheus](https://prometheus.io/) is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets at specified intervals, stores them in a time-series database, and provides powerful querying capabilities. With its flexible architecture, Prometheus is particularly suited for dynamic environments, making it a popular choice for cloud-native applications and microservices. Its intuitive visualization tools help teams gain deep insights into system performance, ensuring timely detection of issues. + +[Prometheus operator](https://prometheus-operator.dev/) The Prometheus Operator manages Prometheus clusters atop Kubernetes. ## Configuration ### AI DIAL Core Settings Add the following environment variable to AI DIAL Core configuration. Refer to [AI DIAL Core](https://github.com/epam/ai-dial-helm/tree/main/charts/dial-core) to learn more. ```json containerPorts.metrics +metrics.enabled +metrics.serviceMonitor.enabled ``` +The default port for collecting metrics in most dial applications is 9464. If necessary, you can change the parameter `containerPorts.metrics` + ### AI DIAL Extentions Settings Add the following environment variable to AI DIAL Extentions configuration. Refer to [AI DIAL Extentions](https://github.com/epam/ai-dial-helm/tree/main/charts/dial-extension) to learn more. ```json containerPorts.metrics -metrics.* +metrics.enabled +metrics.serviceMonitor.enabled ``` +The default port for collecting metrics in most dial applications is 9464. If necessary, you can change the parameter `containerPorts.metrics` ## Examples ``` diff --git a/docs/Observability/3. otlp.md b/docs/Observability/3. otlp.md index c002ec5c..be45a8bc 100644 --- a/docs/Observability/3. otlp.md +++ b/docs/Observability/3. otlp.md @@ -1,9 +1,7 @@ # OpenTelemetry method ## Introduction -OpenTelemetry is an open-source observability framework designed to standardize the collection of telemetry data across distributed systems. By providing a unified set of APIs, libraries, and agents, it enables developers to capture traces, metrics, and logs from their applications seamlessly. OpenTelemetry simplifies the monitoring process and enhances visibility into application performance and reliability, making it easier to troubleshoot issues and optimize systems in real-time. +[OpenTelemetry](https://opentelemetry.io/) is an open-source observability framework designed to standardize the collection of telemetry data across distributed systems. By providing a unified set of APIs, libraries, and agents, it enables developers to capture traces, metrics, and logs from their applications seamlessly. OpenTelemetry simplifies the monitoring process and enhances visibility into application performance and reliability, making it easier to troubleshoot issues and optimize systems in real-time. ## Configuration -### Configuration Guidelines - `TBD?` ### Configure AI DIAL Environment Variables for the Configuration of Opentelemetry All standard env variables you could find in the official opentelemetry [documentation](https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/logging/logging.html). If no value set for the OTEL_METRICS_EXPORTER then OpenTelemetry Prometheus Metric Exporter will be used. If value set to "otlp" the OpenTelemetry Collector Metrics Exporter for web and node will be used. @@ -26,6 +24,8 @@ Where: - `OTEL_EXPORTER_OTLP_ENDPOINT` - OTEL endpoint URL - `OTEL_RESOURCE_ATTRIBUTES` - Key-value pairs to be used as resource attributes +The DIAL ecosystem includes other programming languages (for example chat use nodejs ) + ## Examples ``` TBD? diff --git a/docs/Observability/4. Log collections/1. overview.md b/docs/Observability/4. Log collections/1. overview.md new file mode 100644 index 00000000..e21edba0 --- /dev/null +++ b/docs/Observability/4. Log collections/1. overview.md @@ -0,0 +1,2 @@ +# Overview +Logs are short messages that capture significant events within a software system, along with associated metadata. Log collection refers to the generation, aggregation, and storage of the historical data represented by the logs. \ No newline at end of file diff --git a/docs/Observability/4. Log collections/2. k8salloy.md b/docs/Observability/4. Log collections/2. k8salloy.md new file mode 100644 index 00000000..017f91ff --- /dev/null +++ b/docs/Observability/4. Log collections/2. k8salloy.md @@ -0,0 +1,7 @@ +# Kubernetes Alloy + +If you deploy your application in Kubernetes it’s recommended to deploy Grafana Alloy using the Kubernetes Monitoring helm chart and take advantage of both Application Observability and Kubernetes Monitoring solutions. + +Setup +Follow the [instructions]() provided in the Configure Kubernetes Monitoring with Grafana Kubernetes Monitoring Helm chart documentation. +https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/configuration/helm-chart-config/ \ No newline at end of file From 56c84d341a1f1c4832a4a7cc9fbdefa732954cc3 Mon Sep 17 00:00:00 2001 From: Alex Date: Mon, 9 Sep 2024 13:22:37 +0200 Subject: [PATCH 3/9] Fix issue --- docs/Observability/4. Log collections/2. k8salloy.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/Observability/4. Log collections/2. k8salloy.md b/docs/Observability/4. Log collections/2. k8salloy.md index 017f91ff..263326f2 100644 --- a/docs/Observability/4. Log collections/2. k8salloy.md +++ b/docs/Observability/4. Log collections/2. k8salloy.md @@ -3,5 +3,4 @@ If you deploy your application in Kubernetes it’s recommended to deploy Grafana Alloy using the Kubernetes Monitoring helm chart and take advantage of both Application Observability and Kubernetes Monitoring solutions. Setup -Follow the [instructions]() provided in the Configure Kubernetes Monitoring with Grafana Kubernetes Monitoring Helm chart documentation. -https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/configuration/helm-chart-config/ \ No newline at end of file +Follow the [instructions](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/configuration/helm-chart-config/) provided in the Configure Kubernetes Monitoring with Grafana Kubernetes Monitoring Helm chart documentation. \ No newline at end of file From e8d94b66d0bcdd7bcc1411417056791d399b5791 Mon Sep 17 00:00:00 2001 From: Aleksey Date: Wed, 2 Oct 2024 11:41:28 +0200 Subject: [PATCH 4/9] chore: review observability (#199) * chore: review observability --------- Co-authored-by: sr-remsha Co-authored-by: Aleksandr Eroshkin --- docs/Observability/1. overview.md | 9 - docs/Observability/2. prometheus.md | 28 --- docs/Observability/3. otlp.md | 32 ---- .../4. Log collections/1. overview.md | 2 - .../4. Log collections/2. k8salloy.md | 6 - docs/Observability/Observability.md | 162 ++++++++++++++++++ sidebars.js | 7 + 7 files changed, 169 insertions(+), 77 deletions(-) delete mode 100644 docs/Observability/1. overview.md delete mode 100644 docs/Observability/2. prometheus.md delete mode 100644 docs/Observability/3. otlp.md delete mode 100644 docs/Observability/4. Log collections/1. overview.md delete mode 100644 docs/Observability/4. Log collections/2. k8salloy.md create mode 100644 docs/Observability/Observability.md diff --git a/docs/Observability/1. overview.md b/docs/Observability/1. overview.md deleted file mode 100644 index 49315bfd..00000000 --- a/docs/Observability/1. overview.md +++ /dev/null @@ -1,9 +0,0 @@ -# Overview - -Observability is the ability to infer a system's internal state from its external outputs. -AI DIAL supports Prometheus and OpenTelemetry (OTEL) methods enhance observability by providing powerful metrics collection and tracing capabilities, enabling deeper insights into system performance and behavior. - -For DIAL application we have few options to monitoring/observability: -- Prometheus (pull/scrape model) -- OTLP (push model) -- Both (application can be configured to use both approaches) \ No newline at end of file diff --git a/docs/Observability/2. prometheus.md b/docs/Observability/2. prometheus.md deleted file mode 100644 index 397ee7b8..00000000 --- a/docs/Observability/2. prometheus.md +++ /dev/null @@ -1,28 +0,0 @@ -# Prometheus method -## Introduction -[Prometheus](https://prometheus.io/) is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets at specified intervals, stores them in a time-series database, and provides powerful querying capabilities. With its flexible architecture, Prometheus is particularly suited for dynamic environments, making it a popular choice for cloud-native applications and microservices. Its intuitive visualization tools help teams gain deep insights into system performance, ensuring timely detection of issues. - -[Prometheus operator](https://prometheus-operator.dev/) The Prometheus Operator manages Prometheus clusters atop Kubernetes. -## Configuration -### AI DIAL Core Settings -Add the following environment variable to AI DIAL Core configuration. Refer to [AI DIAL Core](https://github.com/epam/ai-dial-helm/tree/main/charts/dial-core) to learn more. -```json -containerPorts.metrics -metrics.enabled -metrics.serviceMonitor.enabled -``` -The default port for collecting metrics in most dial applications is 9464. If necessary, you can change the parameter `containerPorts.metrics` - -### AI DIAL Extentions Settings -Add the following environment variable to AI DIAL Extentions configuration. Refer to [AI DIAL Extentions](https://github.com/epam/ai-dial-helm/tree/main/charts/dial-extension) to learn more. -```json -containerPorts.metrics -metrics.enabled -metrics.serviceMonitor.enabled -``` -The default port for collecting metrics in most dial applications is 9464. If necessary, you can change the parameter `containerPorts.metrics` - -## Examples -``` -TBD? -``` \ No newline at end of file diff --git a/docs/Observability/3. otlp.md b/docs/Observability/3. otlp.md deleted file mode 100644 index be45a8bc..00000000 --- a/docs/Observability/3. otlp.md +++ /dev/null @@ -1,32 +0,0 @@ -# OpenTelemetry method -## Introduction -[OpenTelemetry](https://opentelemetry.io/) is an open-source observability framework designed to standardize the collection of telemetry data across distributed systems. By providing a unified set of APIs, libraries, and agents, it enables developers to capture traces, metrics, and logs from their applications seamlessly. OpenTelemetry simplifies the monitoring process and enhances visibility into application performance and reliability, making it easier to troubleshoot issues and optimize systems in real-time. -## Configuration -### Configure AI DIAL -Environment Variables for the Configuration of Opentelemetry -All standard env variables you could find in the official opentelemetry [documentation](https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/logging/logging.html). If no value set for the OTEL_METRICS_EXPORTER then OpenTelemetry Prometheus Metric Exporter will be used. If value set to "otlp" the OpenTelemetry Collector Metrics Exporter for web and node will be used. -```json -OTEL_PYTHON_LOG_CORRELATION: "true|false" -OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: "" -OTEL_LOGS_EXPORTER: "otlp" -OTEL_METRICS_EXPORTER: "otlp|otlp,prometheus" -OTEL_TRACES_EXPORTER: "otlp" -OTEL_EXPORTER_OTLP_ENDPOINT: "" -OTEL_RESOURCE_ATTRIBUTES: "service.name=" -``` - -Where: -- `OTEL_PYTHON_LOG_CORRELATION` - enable trace context injection -- `OTEL_PYTHON_FASTAPI_EXCLUDED_URLS`: to exclude certain URLs from tracking -- `OTEL_LOGS_EXPORTER` - Logs exporter to be used -- `OTEL_METRICS_EXPORTER` - Metrics exporter to be used -- `OTEL_TRACES_EXPORTER` - Trace exporter to be used -- `OTEL_EXPORTER_OTLP_ENDPOINT` - OTEL endpoint URL -- `OTEL_RESOURCE_ATTRIBUTES` - Key-value pairs to be used as resource attributes - -The DIAL ecosystem includes other programming languages (for example chat use nodejs ) - -## Examples -``` -TBD? -``` \ No newline at end of file diff --git a/docs/Observability/4. Log collections/1. overview.md b/docs/Observability/4. Log collections/1. overview.md deleted file mode 100644 index e21edba0..00000000 --- a/docs/Observability/4. Log collections/1. overview.md +++ /dev/null @@ -1,2 +0,0 @@ -# Overview -Logs are short messages that capture significant events within a software system, along with associated metadata. Log collection refers to the generation, aggregation, and storage of the historical data represented by the logs. \ No newline at end of file diff --git a/docs/Observability/4. Log collections/2. k8salloy.md b/docs/Observability/4. Log collections/2. k8salloy.md deleted file mode 100644 index 263326f2..00000000 --- a/docs/Observability/4. Log collections/2. k8salloy.md +++ /dev/null @@ -1,6 +0,0 @@ -# Kubernetes Alloy - -If you deploy your application in Kubernetes it’s recommended to deploy Grafana Alloy using the Kubernetes Monitoring helm chart and take advantage of both Application Observability and Kubernetes Monitoring solutions. - -Setup -Follow the [instructions](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/configuration/helm-chart-config/) provided in the Configure Kubernetes Monitoring with Grafana Kubernetes Monitoring Helm chart documentation. \ No newline at end of file diff --git a/docs/Observability/Observability.md b/docs/Observability/Observability.md new file mode 100644 index 00000000..04f67636 --- /dev/null +++ b/docs/Observability/Observability.md @@ -0,0 +1,162 @@ +# Overview + +AI DIAL components provide the following types of monitoring/observability: +- Logs ([Container logs](#container-logs) or [OTel](#opentelemetry)) +- Metrics ([Prometheus](#prometheus) or [OTel](#opentelemetry)) +- Traces ([OTel](#opentelemetry)) + +
+ + +# Table of Contents +- [Overview](#overview) +- [Container Logs](#container-logs) + - [Configuration AI DIAL](#configuration-ai-dial) + - [Python Components](#python-components) + - [AI DIAL Chat](#ai-dial-chat) + - [AI DIAL Core](#ai-dial-core) + - [AI DIAL Bedrock Adapter](#ai-dial-bedrock-adapter) + - [AI DIAL Vertex Adapter](#ai-dial-vertex-adapter) + - [AI DIAL OpenAI Adapter](#ai-dial-openai-adapter) + - [AI DIAL Adapter](#ai-dial-adapter) +- [Prometheus](#prometheus) + - [Configure AI DIAL Components](#configure-ai-dial-components) + - [Configure DIAL Helm Charts](#configure-dial-helm-charts) +- [OpenTelemetry](#opentelemetry) + - [Configure AI DIAL](#configure-ai-dial) + - [Python Components](#python-components-1) + - [Node.js Components](#nodejs-components) + +
+ +# Container Logs + +Unix and Linux commands typically open three I/O streams when they run, called STDIN, STDOUT, and STDERR. + +* STDIN is the command's input stream, which may include input from the keyboard or input from another command. +* STDOUT is usually a command's normal output. +* STDERR is typically used to output error messages. + +AI DIAL components by default use this approach for outputting system logs. + +## Configuration AI DIAL + +### AI DIAL Chat + +AI DIAL supports OpenTelemetry (OTEL) methods to enhance observability by providing powerful metrics for collection and tracing capabilities, enabling deeper insights into system performance and behavior. + +All environment variables you can find in the official OpenTelemetry Collector Logs Exporter for web and node with HTTP [documentation](https://www.npmjs.com/package/@opentelemetry/exporter-logs-otlp-http). + +```yaml +OTEL_EXPORTER_OTLP_LOGS_ENDPOINT: #The endpoint to send logs to. By default https://localhost:4318/v1/logs will be used. v1/logs will not be appended automatically and has to be added explicitly. +OTEL_EXPORTER_OTLP_LOGS_TIMEOUT: #The maximum waiting time, in milliseconds, allowed to send each OTLP log batch. Default is 10000. +``` + +AI DIAL supports OpenTelemetry SDK for Node.js. + +All environment variables you can find in the official OpenTelemetry SDK for Node.js [documentation](https://www.npmjs.com/package/@opentelemetry/sdk-node). + +```yaml +OTEL_SDK_DISABLED: #Disable the SDK by setting the OTEL_SDK_DISABLED environment variable to `true` +OTEL_LOG_LEVEL: #Log level used by the SDK logger.` Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. +``` + +### AI DIAL Core + +The nex environment variables you can for the configuring logging: +```yaml +AIDIAL_LOG_FILE: #Place when the log file should be stored. +AIDIAL_LOG_LEVEL: #The logging levels used are ERROR, WARN, INFO, DEBUG, and TRACE. +``` + +### AI DIAL Bedrock Adapter + +```yaml +LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. +AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. +``` + +### AI DIAL Vertex Adapter + +```yaml +LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. +AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. +``` + +### AI DIAL OpenAI Adapter + +```yaml +LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. +``` + +### AI DIAL Adapter + +```yaml +LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. +``` + +# Prometheus + +[Prometheus](https://prometheus.io/) is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets at specified intervals, stores them in a time-series database, and provides powerful querying capabilities. With its flexible architecture, Prometheus is particularly suited for dynamic environments, making it a popular choice for cloud-native applications and microservices. Its intuitive visualization tools help to gain deep insights into system performance, ensuring timely detection of issues. + +[Prometheus Operator](https://prometheus-operator.dev/) manages Prometheus clusters atop Kubernetes. + +## Configure AI DIAL Components + +By default, AI DIAL components have metrics enabled in Prometheus format on port 9464. + +## Configure DIAL Helm Charts + +Add the following helm values to AI DIAL Helm. Refer to [AI DIAL](https://github.com/epam/ai-dial-helm/tree/main/charts/dial) to learn more. + + ```yaml + : + metrics: + enabled: true + serviceMonitor: + enabled: true # when using the Prometheus Operator + ``` +The default port for collecting metrics in AI DIAL components is 9464. You can change the parameter `.containerPorts.metrics` to change the default port. + +# OpenTelemetry + +[OpenTelemetry](https://opentelemetry.io/) is an open-source observability framework designed to standardize the collection of telemetry data across distributed systems. By providing a unified set of APIs, libraries, and agents, it enables developers to capture traces, metrics, and logs from their applications seamlessly. OpenTelemetry simplifies the monitoring process and enhances visibility into application performance and reliability, making it easier to troubleshoot issues and optimize systems in real-time. + +AI DIAL supports OpenTelemetry (OTEL) methods to enhance observability by providing powerful metrics for collection and tracing capabilities, enabling deeper insights into system performance and behavior. + +## Configure AI DIAL + +All environment variables you can find in the official OpenTelemetry [documentation](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/). + +### Python Components + +All standard python environment variables you can find in the official OpenTelemetry [documentation](https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/logging/logging.html). + +* If the value for **OTEL_METRICS_EXPORTER** is not set, the [OpenTelemetry Prometheus Metric Exporter](https://www.npmjs.com/package/@opentelemetry/exporter-prometheus) will be used. +* If its value is set to `"otlp"`, the [OpenTelemetry Collector Metrics Exporter for Web and Node](https://www.npmjs.com/package/@opentelemetry/exporter-metrics-otlp-http) will be used. + +Example configuration of OpenTelemetry: + +```yaml + OTEL_RESOURCE_ATTRIBUTES: "service.name=" # Key-value pairs to be used as resource attributes + OTEL_EXPORTER_OTLP_ENDPOINT: "" # OTEL endpoint URL + OTEL_LOGS_EXPORTER: "otlp" # logs exporter to be used + OTEL_METRICS_EXPORTER: "otlp|otlp,prometheus" # metrics exporter to be used + OTEL_TRACES_EXPORTER: "otlp" # trace exporter to be used + OTEL_PYTHON_LOG_CORRELATION: "true|false" # enable trace context injection + OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: "" # to exclude certain URLs from tracking +``` + +### Node.js Components + +* If the value for **OTEL_METRICS_EXPORTER** is not set, the [OpenTelemetry Prometheus Metric Exporter](https://www.npmjs.com/package/@opentelemetry/exporter-prometheus) will be used. +* If its value is set to `"otlp"`, the [OpenTelemetry Collector Metrics Exporter for Web and Node](https://www.npmjs.com/package/@opentelemetry/exporter-metrics-otlp-http) will be used. + +Example configuration of OpenTelemetry: + +```yaml + OTEL_SERVICE_NAME: "" # Key-value pairs to be used as resource attributes + OTEL_EXPORTER_OTLP_ENDPOINT: "" # OTEL endpoint URL + OTEL_LOGS_EXPORTER: "otlp" # logs exporter to be used + OTEL_METRICS_EXPORTER: "otlp|otlp,prometheus" # metrics exporter to be used +``` \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index 84d90389..7df60156 100644 --- a/sidebars.js +++ b/sidebars.js @@ -169,6 +169,13 @@ const sidebars = { "Cookbook/dial-cookbook/examples/how_to_call_image_to_text_applications", ], }, + { + type: 'category', + label: 'Observability', + items: [ + "Observability/Observability", + ], + }, { type: 'link', label: 'API Reference', From e69272ea251a284ff54a6c96937fbc614f30d2e5 Mon Sep 17 00:00:00 2001 From: sr-remsha Date: Wed, 2 Oct 2024 13:42:53 +0200 Subject: [PATCH 5/9] review --- docs/Observability/Observability.md | 41 ++++++++++++++--------------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/docs/Observability/Observability.md b/docs/Observability/Observability.md index 04f67636..92a77def 100644 --- a/docs/Observability/Observability.md +++ b/docs/Observability/Observability.md @@ -29,7 +29,7 @@ AI DIAL components provide the following types of monitoring/observability: -# Container Logs +## Container Logs Unix and Linux commands typically open three I/O streams when they run, called STDIN, STDOUT, and STDERR. @@ -39,9 +39,9 @@ Unix and Linux commands typically open three I/O streams when they run, called S AI DIAL components by default use this approach for outputting system logs. -## Configuration AI DIAL +### AI DIAL Configuration -### AI DIAL Chat +#### Chat AI DIAL supports OpenTelemetry (OTEL) methods to enhance observability by providing powerful metrics for collection and tracing capabilities, enabling deeper insights into system performance and behavior. @@ -52,62 +52,61 @@ OTEL_EXPORTER_OTLP_LOGS_ENDPOINT: #The endpoint to send logs to. By default http OTEL_EXPORTER_OTLP_LOGS_TIMEOUT: #The maximum waiting time, in milliseconds, allowed to send each OTLP log batch. Default is 10000. ``` -AI DIAL supports OpenTelemetry SDK for Node.js. - -All environment variables you can find in the official OpenTelemetry SDK for Node.js [documentation](https://www.npmjs.com/package/@opentelemetry/sdk-node). +AI DIAL supports OpenTelemetry SDK for Node.js. All environment variables you can find in the official OpenTelemetry SDK for Node.js [documentation](https://www.npmjs.com/package/@opentelemetry/sdk-node). ```yaml OTEL_SDK_DISABLED: #Disable the SDK by setting the OTEL_SDK_DISABLED environment variable to `true` OTEL_LOG_LEVEL: #Log level used by the SDK logger.` Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` -### AI DIAL Core +#### Core + +These environment variables you can use to configure logging: -The nex environment variables you can for the configuring logging: ```yaml AIDIAL_LOG_FILE: #Place when the log file should be stored. AIDIAL_LOG_LEVEL: #The logging levels used are ERROR, WARN, INFO, DEBUG, and TRACE. ``` -### AI DIAL Bedrock Adapter +#### Bedrock Adapter ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. ``` -### AI DIAL Vertex Adapter +#### Vertex Adapter ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. ``` -### AI DIAL OpenAI Adapter +#### OpenAI Adapter ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` -### AI DIAL Adapter +#### DIAL Adapter ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` -# Prometheus +## Prometheus [Prometheus](https://prometheus.io/) is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets at specified intervals, stores them in a time-series database, and provides powerful querying capabilities. With its flexible architecture, Prometheus is particularly suited for dynamic environments, making it a popular choice for cloud-native applications and microservices. Its intuitive visualization tools help to gain deep insights into system performance, ensuring timely detection of issues. [Prometheus Operator](https://prometheus-operator.dev/) manages Prometheus clusters atop Kubernetes. -## Configure AI DIAL Components +### Configure AI DIAL Components -By default, AI DIAL components have metrics enabled in Prometheus format on port 9464. +By default, AI DIAL components have metrics enabled in Prometheus format on port **9464**. -## Configure DIAL Helm Charts +### Configure DIAL Helm Charts -Add the following helm values to AI DIAL Helm. Refer to [AI DIAL](https://github.com/epam/ai-dial-helm/tree/main/charts/dial) to learn more. +Add the following helm values to AI DIAL Helm. Refer to [AI DIAL Helm](https://github.com/epam/ai-dial-helm/tree/main/charts/dial) to learn more. ```yaml : @@ -118,17 +117,17 @@ Add the following helm values to AI DIAL Helm. Refer to [AI DIAL](https://github ``` The default port for collecting metrics in AI DIAL components is 9464. You can change the parameter `.containerPorts.metrics` to change the default port. -# OpenTelemetry +## OpenTelemetry [OpenTelemetry](https://opentelemetry.io/) is an open-source observability framework designed to standardize the collection of telemetry data across distributed systems. By providing a unified set of APIs, libraries, and agents, it enables developers to capture traces, metrics, and logs from their applications seamlessly. OpenTelemetry simplifies the monitoring process and enhances visibility into application performance and reliability, making it easier to troubleshoot issues and optimize systems in real-time. AI DIAL supports OpenTelemetry (OTEL) methods to enhance observability by providing powerful metrics for collection and tracing capabilities, enabling deeper insights into system performance and behavior. -## Configure AI DIAL +### AI DIAL Configuration All environment variables you can find in the official OpenTelemetry [documentation](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/). -### Python Components +#### Python Components All standard python environment variables you can find in the official OpenTelemetry [documentation](https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/logging/logging.html). @@ -147,7 +146,7 @@ Example configuration of OpenTelemetry: OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: "" # to exclude certain URLs from tracking ``` -### Node.js Components +#### Node.js Components * If the value for **OTEL_METRICS_EXPORTER** is not set, the [OpenTelemetry Prometheus Metric Exporter](https://www.npmjs.com/package/@opentelemetry/exporter-prometheus) will be used. * If its value is set to `"otlp"`, the [OpenTelemetry Collector Metrics Exporter for Web and Node](https://www.npmjs.com/package/@opentelemetry/exporter-metrics-otlp-http) will be used. From 7826131e2c02e847d07b178a9e3e0b6d237e46bc Mon Sep 17 00:00:00 2001 From: Aleksandr Eroshkin Date: Wed, 2 Oct 2024 14:58:29 +0200 Subject: [PATCH 6/9] Some fixes --- docs/Observability/Observability.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/Observability/Observability.md b/docs/Observability/Observability.md index 92a77def..2a9116ae 100644 --- a/docs/Observability/Observability.md +++ b/docs/Observability/Observability.md @@ -60,36 +60,44 @@ OTEL_LOG_LEVEL: #Log level used by the SDK logger.` Values: `TRACE, DEBUG, INFO, ``` #### Core +The main component of AI DIAL, which provides unified API to different chat completion and embedding models, assistants, and applications. -These environment variables you can use to configure logging: - +The next environment variables you can for the configuring logging: ```yaml AIDIAL_LOG_FILE: #Place when the log file should be stored. AIDIAL_LOG_LEVEL: #The logging levels used are ERROR, WARN, INFO, DEBUG, and TRACE. ``` #### Bedrock Adapter +The project implements AI DIAL API for language models from AWS Bedrock +The next environment variables you can for the configuring logging: ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. ``` #### Vertex Adapter +The project implements AI DIAL API for language models and embeddings from Vertex AI +The next environment variables you can for the configuring logging: ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. ``` #### OpenAI Adapter +The project implements AI DIAL API for language models from Azure OpenAI +The next environment variables you can for the configuring logging: ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` #### DIAL Adapter +The project implements application which adapts calls from one DIAL Core to calls to another DIAL Core. +The next environment variables you can for the configuring logging: ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` From 92049edf37b130ef630f0a0df8a9aa9fb55a68fd Mon Sep 17 00:00:00 2001 From: sr-remsha Date: Wed, 2 Oct 2024 15:16:09 +0200 Subject: [PATCH 7/9] review --- docs/Observability/Observability.md | 45 ++++++++++++++++++----------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/docs/Observability/Observability.md b/docs/Observability/Observability.md index 2a9116ae..908173c6 100644 --- a/docs/Observability/Observability.md +++ b/docs/Observability/Observability.md @@ -1,6 +1,7 @@ # Overview AI DIAL components provide the following types of monitoring/observability: + - Logs ([Container logs](#container-logs) or [OTel](#opentelemetry)) - Metrics ([Prometheus](#prometheus) or [OTel](#opentelemetry)) - Traces ([OTel](#opentelemetry)) @@ -60,44 +61,54 @@ OTEL_LOG_LEVEL: #Log level used by the SDK logger.` Values: `TRACE, DEBUG, INFO, ``` #### Core -The main component of AI DIAL, which provides unified API to different chat completion and embedding models, assistants, and applications. -The next environment variables you can for the configuring logging: +[DIAL Core](https://github.com/epam/ai-dial-core) is the main component of AI DIAL, which provides [Unified API](https://epam-rail.com/dial_api) to different chat completion and embedding models, assistants, and applications. + +These environment variables you can use to configure logging: + ```yaml AIDIAL_LOG_FILE: #Place when the log file should be stored. AIDIAL_LOG_LEVEL: #The logging levels used are ERROR, WARN, INFO, DEBUG, and TRACE. ``` #### Bedrock Adapter -The project implements AI DIAL API for language models from AWS Bedrock -The next environment variables you can for the configuring logging: +[AI DIAL Bedrock Adapter](https://github.com/epam/ai-dial-adapter-bedrock) implements AI DIAL API for language models from AWS Bedrock. + +These environment variables you can use to configure logging: + ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. ``` #### Vertex Adapter -The project implements AI DIAL API for language models and embeddings from Vertex AI -The next environment variables you can for the configuring logging: +[AI DIAL Vertex AI Adapter](https://github.com/epam/ai-dial-adapter-vertexai) implements AI DIAL API for language models and embeddings from Vertex AI. + +These environment variables you can use to configure logging: + ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. AIDIAL_LOG_LEVEL: #AI DIAL SDK Level filter for the LLM and response logging. Values: `TRACE, DEBUG, INFO, WARNING, ERROR, FATAL`. ``` #### OpenAI Adapter -The project implements AI DIAL API for language models from Azure OpenAI -The next environment variables you can for the configuring logging: +[AI DIAL OpenAI Adapter](https://github.com/epam/ai-dial-adapter-openai) implements AI DIAL API for language models from Azure OpenAI. + +These environment variables you can use to configure logging: + ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` #### DIAL Adapter -The project implements application which adapts calls from one DIAL Core to calls to another DIAL Core. -The next environment variables you can for the configuring logging: +[DIAL Adapter](https://github.com/epam/ai-dial-adapter-dial) adapts calls from one DIAL Core to calls to another DIAL Core. + +These environment variables you can use to configure logging: + ```yaml LOG_LEVEL: #Level filter for the Adapter logger. Values: `TRACE, DEBUG, INFO, WARN, ERROR, FATAL`. Use `DEBUG` for dev purposes and INFO in prod. It is strongly recommended not to use the logging level `DEBUG` for prod purposes. ``` @@ -116,13 +127,13 @@ By default, AI DIAL components have metrics enabled in Prometheus format on port Add the following helm values to AI DIAL Helm. Refer to [AI DIAL Helm](https://github.com/epam/ai-dial-helm/tree/main/charts/dial) to learn more. - ```yaml - : - metrics: - enabled: true - serviceMonitor: - enabled: true # when using the Prometheus Operator - ``` +```yaml +: + metrics: + enabled: true + serviceMonitor: + enabled: true # when using the Prometheus Operator +``` The default port for collecting metrics in AI DIAL components is 9464. You can change the parameter `.containerPorts.metrics` to change the default port. ## OpenTelemetry From b390df949ac356b61ffb56154193c52cbf532b85 Mon Sep 17 00:00:00 2001 From: sr-remsha Date: Wed, 2 Oct 2024 15:22:45 +0200 Subject: [PATCH 8/9] edit sidebar and review --- docs/Observability/Observability.md | 2 +- docs/architecture.md | 2 ++ sidebars.js | 6 ++---- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/Observability/Observability.md b/docs/Observability/Observability.md index 908173c6..faa42e2b 100644 --- a/docs/Observability/Observability.md +++ b/docs/Observability/Observability.md @@ -1,4 +1,4 @@ -# Overview +# Observability and Monitoring AI DIAL components provide the following types of monitoring/observability: diff --git a/docs/architecture.md b/docs/architecture.md index c0cbfd11..638e2f73 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -217,6 +217,8 @@ Metrics are gathered for the entire system and/or for individual system componen You can use any OTLE Collector such as Prometheus, Jaeger, Fluentd, Zipkin and other. +> Refer to [Observability](./Observability/Observability.md) to learn more. + ## Key Vault All sensitive information is stored according to the best practices of the selected cloud platform, utilizing systems like GCP Cloud Key Management Service, AWS Secrets Manager, Azure Key Vault, and Vault by Hashicorp. diff --git a/sidebars.js b/sidebars.js index 7df60156..d0b599c9 100644 --- a/sidebars.js +++ b/sidebars.js @@ -170,11 +170,9 @@ const sidebars = { ], }, { - type: 'category', + type: 'doc', + id: 'Observability/Observability', label: 'Observability', - items: [ - "Observability/Observability", - ], }, { type: 'link', From f42159c61c6dc8f6dd1590188296e976a67a7dc6 Mon Sep 17 00:00:00 2001 From: sr-remsha Date: Thu, 3 Oct 2024 09:59:19 +0200 Subject: [PATCH 9/9] added link to the new doc in architecture --- docs/architecture.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/architecture.md b/docs/architecture.md index 638e2f73..e125f7aa 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -115,6 +115,8 @@ AI DIAL Core uses [Vector](https://vector.dev/docs/reference/configuration/sinks You can gather standard logs (which do not contain user messages) from components using the ELK stack (Elasticsearch, Logstash, Kibana) or other log collection system. +> Refer to [Observability](./Observability/Observability.md) to learn more. + #### Entitlements In AI DIAL Core, user roles are defined and configured in the application config file. This allows administrators to specify which users or user groups are authorized to access specific resources or features within the application. These user roles match the once created in your IDP.