Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Telemetry Integration #569

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

wcalderipe
Copy link
Collaborator

@wcalderipe wcalderipe commented Oct 28, 2024

This PR introduces OpenTelemetry (OTEL) instrumentation for all applications.

image

The OTEL integration in our stack is implemented through two separate code paths:

  • The @narval/open-telemetry library handles Node.js instrumentation registration and exporter configuration.
  • The open-telemetry module in @narval/nestjs-shared exposes NestJS services for traces and metrics, leveraging dependency injection for improved testability.

OTEL auto-instruments many popular packages (full list here). In addition to these default instrumentations, this PR adds Prisma instrumentation for database visibility. Note that the Prisma instrumentation could alternatively be implemented at the infrastructure level through database instrumentation.

Outgoing and Incoming HTTP Requests

For HTTP request tracing:

  • Incoming requests: @opentelemetry/instrumentation-http sets trace context on active spans
  • Outgoing requests: W3CTraceContextPropagator and W3CBaggagePropagator automatically set trace context headers

This enables continuous tracing across distributed processes.

image

Logs

Winston logs automatically include trace context when a span context exists, thanks to auto-instrumentation of @opentelemetry/instrumentation-winston.

2024-10-30T14:23:23.449Z [INFO]: GET /v1/data/entities 304
{
  "trace_id": "e29f3d185fa0e3e4ff547c8475a72baa",
  "span_id": "21ccbfd64f5d16a2",
  "trace_flags": "01",
  "timestamp": "2024-10-30T14:23:23.449Z"
}

The logs are transported to OTEL's log provider using @opentelemetry/winston-transport. The transport is auto-instrumented by the SDK when winston, @opentelemetry/instrumentation-winston and @opentelemetry/winston-transport packages are installed.

Testing this PR

Important

In development, OTEL's collector is disabled by default. If you want to enable it, see the OTEL section in the README.

Start Jaeger and OTEL's collector containers.

make docker/otel/up

Add the following to the .env of each application.

OTEL_SDK_DISABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_LOGS_EXPORTER=otlp

Note

You can see the list of the SDK's environment variables here.

Or, alternatively, override your local .env with the new defaults.

make armory/copy-default-env
make policy-engine/copy-default-env
make vault/copy-default-env

Restart the applications and make some requests to the API.

Now, head over to Jaeger's http://localhost:16686

Debug OTEL SDK

To debug what's going on within OTEL's SDK and all its magical auto-instrumentation, set the diagnostic log level in the instrumentOpenTelemetry function in each main.ts file:

import { instrumentOpenTelemetry } from '@narval/open-telemetry'
import { DiagLogLevel } from '@opentelemetry/api'

instrumentOpenTelemetry({ 
  serviceName: 'auth',
  diagLogLevel: DiagLogLevel.DEBUG // OR DiagLogLevel.INFO
})
  • DiagLogLevel.DEBUG: Logs everything verbosely - setup, instrumentation loading, exported spans and metrics
  • DiagLogLevel.INFO: Shows minimal logs for setup, instrumentation loading, and errors

Missing Prisma Traces

If Prisma traces aren't appearing, regenerate the database clients:

make armory/db/generate-types
make policy-engine/db/generate-types
make vault/db/generate-types

This is required because the PR adds two preview features to the Prisma client: tracing and interactiveTransactions, which need client regeneration to take effect.

See more https://www.prisma.io/docs/orm/prisma-client/observability-and-logging/opentelemetry-tracing

Possible Improvements

Use decorators to reduce boilerplate when creating OpenTelemetry spans:

  • @Counter(name, options?) - Increments a metric when method is called
  • @ExecutionTime - Measures method execution time with a gauge metric
  • @SpanAttribute(name) - Adds method argument as span attribute (similar to NestJS @Query)
  • @Span(name, options?) - Creates span around method execution, using class name, method name and arguments in span name (e.g. FooService.baz)

Reference

@wcalderipe wcalderipe self-assigned this Oct 28, 2024
Copy link

vercel bot commented Oct 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
devtool ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 31, 2024 1:59pm
manager 🔄 Building (Inspect) Visit Preview 💬 Add feedback Oct 31, 2024 1:59pm

Add docker/otel/up command.
@wcalderipe wcalderipe force-pushed the feat/open-telemetry-integration branch from 7172de3 to 39efad1 Compare October 28, 2024 17:40
Copy link
Contributor

@mattschoch mattschoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Agree w/ the comment that using decorators might be nice to reduce the boilerplate, but seems totally fine for now.

Comment on lines +32 to +34
const span = this.traceService.startSpan(`${EntityDataStoreService.name}.getEntities`, {
attributes: { clientId }
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm questioning naming code -- is there a benefit to using ${EntityDataStoreService.name} instead of just "EntityDataStoreService" as a string? When using a raw string instead of interpolation, it makes it easier to do a "find all" in the codebase when you're looking for the line a log was from.

Obv...it tells you the service & method name, so maybe my workflow is what doesn't make sense. But I tend to find something in logs, and want to just copy/paste in "find" to get to the file. wdyt?

I'm fine either way really, just throwing this out in case we all do this..

Copy link
Collaborator Author

@wcalderipe wcalderipe Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I opted for ${EntityDataStoreService.name} is to make the span name refactoring proof. Although the side effects are:

  • Refactoring the class name will affect the data
  • You can't simply copy the span name and look at the whole code base for it

I'm on the fence now.. let me think a bit.

Comment on lines +24 to +25
this.getCounter = this.metricService.createCounter('policy_data_store_get_count')
this.setCounter = this.metricService.createCounter('policy_data_store_set_count')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use constants for the metric names so that we have one place to look to find all the possible metrics we have?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would push the metrics discoverability problem to the observability stack rather than trying to solve it in the code.

package.json Outdated Show resolved Hide resolved
Comment on lines +1 to +31
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

exporters:
debug:
verbosity: detailed
otlp:
endpoint: jaeger:4317
tls:
insecure: true
sending_queue:
enabled: true
retry_on_failure:
enabled: true

service:
pipelines:
traces:
receivers: [otlp]
exporters: [debug, otlp]
metrics:
receivers: [otlp]
exporters: [debug]
logs:
receivers: [otlp]
exporters: [debug]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When/how does this get changed?
Will these need to become env configs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the local OTEL collector container configuration. For deploy environments, we'll have to configure it differently.

Comment on lines +33 to +52
jaeger:
image: 'jaegertracing/all-in-one'
ports:
- '16686:16686' # Web UI
- '14250:14250' # gRPC
networks:
- armory_stack

otel_collector:
image: 'otel/opentelemetry-collector-contrib'
command: ['--config=/etc/otel-collector-config.yaml']
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- '4318:4318' # HTTP
- '4317:4317' # gRPC
networks:
- armory_stack
depends_on:
- jaeger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use LGTM locally? If we can use tempo instead of jaeger in the local config, that'd be cool. Not blocking, but might be cool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. The reason why I opted for Jaeger and the OTEL collector was simplicity for local. Other solutions require you to run 3 to 6 extra containers which increases the cognitive load to understand the system.

Comment on lines +38 to +41
# OpenTelemetry configuration.
OTEL_SDK_DISABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_LOGS_EXPORTER=otlp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be enabled by default w/ a flag to disable? Or should it be disabled by default & use the flag to enable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it disabled by default because, realistic speaking, we don't need an observability stack to debug locally. Although, log querying and aggregation for development would be nice.

Lock OTEL packages version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants