-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open Telemetry Integration #569
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Add docker/otel/up command.
apps/armory/src/managed-data-store/http/rest/controller/data-store.controller.ts
Outdated
Show resolved
Hide resolved
packages/nestjs-shared/src/lib/module/open-telemetry/service/trace.service.ts
Outdated
Show resolved
Hide resolved
7172de3
to
39efad1
Compare
apps/armory/src/shared/factory/open-telemetry-module-option.factory.ts
Outdated
Show resolved
Hide resolved
Instrument Prisma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Agree w/ the comment that using decorators might be nice to reduce the boilerplate, but seems totally fine for now.
const span = this.traceService.startSpan(`${EntityDataStoreService.name}.getEntities`, { | ||
attributes: { clientId } | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm questioning naming code -- is there a benefit to using ${EntityDataStoreService.name}
instead of just "EntityDataStoreService"
as a string? When using a raw string instead of interpolation, it makes it easier to do a "find all" in the codebase when you're looking for the line a log was from.
Obv...it tells you the service & method name, so maybe my workflow is what doesn't make sense. But I tend to find something in logs, and want to just copy/paste in "find" to get to the file. wdyt?
I'm fine either way really, just throwing this out in case we all do this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I opted for ${EntityDataStoreService.name}
is to make the span name refactoring proof. Although the side effects are:
- Refactoring the class name will affect the data
- You can't simply copy the span name and look at the whole code base for it
I'm on the fence now.. let me think a bit.
this.getCounter = this.metricService.createCounter('policy_data_store_get_count') | ||
this.setCounter = this.metricService.createCounter('policy_data_store_set_count') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use constants for the metric names so that we have one place to look to find all the possible metrics we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would push the metrics discoverability problem to the observability stack rather than trying to solve it in the code.
receivers: | ||
otlp: | ||
protocols: | ||
grpc: | ||
endpoint: 0.0.0.0:4317 | ||
http: | ||
endpoint: 0.0.0.0:4318 | ||
|
||
exporters: | ||
debug: | ||
verbosity: detailed | ||
otlp: | ||
endpoint: jaeger:4317 | ||
tls: | ||
insecure: true | ||
sending_queue: | ||
enabled: true | ||
retry_on_failure: | ||
enabled: true | ||
|
||
service: | ||
pipelines: | ||
traces: | ||
receivers: [otlp] | ||
exporters: [debug, otlp] | ||
metrics: | ||
receivers: [otlp] | ||
exporters: [debug] | ||
logs: | ||
receivers: [otlp] | ||
exporters: [debug] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When/how does this get changed?
Will these need to become env configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the local OTEL collector container configuration. For deploy environments, we'll have to configure it differently.
jaeger: | ||
image: 'jaegertracing/all-in-one' | ||
ports: | ||
- '16686:16686' # Web UI | ||
- '14250:14250' # gRPC | ||
networks: | ||
- armory_stack | ||
|
||
otel_collector: | ||
image: 'otel/opentelemetry-collector-contrib' | ||
command: ['--config=/etc/otel-collector-config.yaml'] | ||
volumes: | ||
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml | ||
ports: | ||
- '4318:4318' # HTTP | ||
- '4317:4317' # gRPC | ||
networks: | ||
- armory_stack | ||
depends_on: | ||
- jaeger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use LGTM locally? If we can use tempo instead of jaeger in the local config, that'd be cool. Not blocking, but might be cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. The reason why I opted for Jaeger and the OTEL collector was simplicity for local. Other solutions require you to run 3 to 6 extra containers which increases the cognitive load to understand the system.
# OpenTelemetry configuration. | ||
OTEL_SDK_DISABLED=true | ||
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 | ||
OTEL_LOGS_EXPORTER=otlp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be enabled by default w/ a flag to disable? Or should it be disabled by default & use the flag to enable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep it disabled by default because, realistic speaking, we don't need an observability stack to debug locally. Although, log querying and aggregation for development would be nice.
Lock OTEL packages version.
This PR introduces OpenTelemetry (OTEL) instrumentation for all applications.
The OTEL integration in our stack is implemented through two separate code paths:
@narval/open-telemetry
library handles Node.js instrumentation registration and exporter configuration.open-telemetry
module in@narval/nestjs-shared
exposes NestJS services for traces and metrics, leveraging dependency injection for improved testability.OTEL auto-instruments many popular packages (full list here). In addition to these default instrumentations, this PR adds Prisma instrumentation for database visibility. Note that the Prisma instrumentation could alternatively be implemented at the infrastructure level through database instrumentation.
Outgoing and Incoming HTTP Requests
For HTTP request tracing:
W3CTraceContextPropagator
andW3CBaggagePropagator
automatically set trace context headersThis enables continuous tracing across distributed processes.
Logs
Winston logs automatically include trace context when a span context exists, thanks to auto-instrumentation of @opentelemetry/instrumentation-winston.
The logs are transported to OTEL's log provider using @opentelemetry/winston-transport. The transport is auto-instrumented by the SDK when
winston
,@opentelemetry/instrumentation-winston
and@opentelemetry/winston-transport
packages are installed.Testing this PR
Important
In development, OTEL's collector is disabled by default. If you want to enable it, see the OTEL section in the README.
Start Jaeger and OTEL's collector containers.
Add the following to the .env of each application.
Note
You can see the list of the SDK's environment variables here.
Or, alternatively, override your local .env with the new defaults.
Restart the applications and make some requests to the API.
Now, head over to Jaeger's http://localhost:16686
Debug OTEL SDK
To debug what's going on within OTEL's SDK and all its magical auto-instrumentation, set the diagnostic log level in the
instrumentOpenTelemetry
function in eachmain.ts
file:DiagLogLevel.DEBUG
: Logs everything verbosely - setup, instrumentation loading, exported spans and metricsDiagLogLevel.INFO
: Shows minimal logs for setup, instrumentation loading, and errorsMissing Prisma Traces
If Prisma traces aren't appearing, regenerate the database clients:
This is required because the PR adds two preview features to the Prisma client:
tracing
andinteractiveTransactions
, which need client regeneration to take effect.See more https://www.prisma.io/docs/orm/prisma-client/observability-and-logging/opentelemetry-tracing
Possible Improvements
Use decorators to reduce boilerplate when creating OpenTelemetry spans:
@Counter(name, options?)
- Increments a metric when method is called@ExecutionTime
- Measures method execution time with a gauge metric@SpanAttribute(name)
- Adds method argument as span attribute (similar to NestJS@Query
)@Span(name, options?)
- Creates span around method execution, using class name, method name and arguments in span name (e.g.FooService.baz
)Reference