Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement native OpenTelemetry infrastructure #1461

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aaronmondal
Copy link
Member

@aaronmondal aaronmondal commented Nov 5, 2024

This commit adds the OTLP exports to Nativelink and extends the nativelink deployments in the operator with OpenTelemetryCollector sidecars. The exposed traces, metrics and logs are published through Kafka to NATS Jetstream.


This change is Reviewable

This commit adds the OTLP exports to Nativelink and extends the
`nativelink` deployments in the operator with OpenTelemetryCollector
sidecars. The exposed traces, metrics and logs are published through
Kafka to NATS Jetstream.
@aaronmondal
Copy link
Member Author

aaronmondal commented Nov 5, 2024

cc @allada @SchahinRohani You might want to play around with this while it's in preview.

@allada One thing that we'll need to figure out is where to put the OtlpServer in the nativelink-config. At the moment it requires a dummy listener which might not play well with workers.

@SchahinRohani You might want to look into OTLP, Kafka topics and NATS Jetstream. This initial implementation doesn't add structure, but at least it provides a central point to aggregate the logs, traces and metrics of nativelink deployments.

I'll polish this a bit, but for now we have the following (assuming the pod for the nativelink-cas is e.g. kubectl port-forward nativelink-cas-ff6544bb8-v4w86)

# Telemetry (CPU usage etc)
kubectl port-forward nativelink-cas-ff6544bb8-v4w86 8888
curl localhost:8888/metrics

# The info that was previously in the experimental_prometheus endpoint
kubectl port-forward nativelink-cas-ff6544bb8-v4w86 8888
curl localhost:8889/metrics

# The aggregated stream across cas, scheduler and worker deployments
kubectl exec -n nats-system nats-box-5d4d987f5b-thbfs -- nats stream info TELEMETRY

(also I still have a small bug in the kustomization. You'll need to apply it twice. The one I use is this variant of the deploy/dev operator:

native up

# Then modify kubernetes/deploy/dev to this and run `kubectl apply -k deploy/dev`:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

components:
- ../../kubernetes/components/operator

# Change this value to deploy custom overlays.
patches:
- patch: |-
    - op: replace
      path: /spec/path
      value: ./kubernetes/overlays/lre
  target:
    kind: Kustomization
    name: nativelink

# Modify this value to change the URL of the repository with deployment files.
#
# This is usually only necessary if you change deployment YAML files or
# NativeLink config files. If you only intend to change the Rust sources you can
# leave this as is and need to ensure that the Alerts below are patched to build
# your local sources.
- patch: |-
    - op: replace
      path: /spec/url
      value: https://github.com/aaronmondal/nativelink
    - op: replace
      path: /spec/ref/branch
      value: otel
  target:
    kind: GitRepository
    name: nativelink

# Setting the flake outputs to `./src_root#xxx` causes the Tekton pipelines to
# build nativelink from your local sources.
#
# During development, the following formats might be useful as well:
#
# `github:user/repo#outname` to build an image from an arbitrary flake output.
#
# `github:TraceMachina/nativelink?ref=pull/<PR_NUMBER>/head#<OUT>` to deploy a
# outputs from a Pull request.
- patch: |-
    - op: replace
      path: /spec/eventMetadata/flakeOutput
      value: ./src_root#image
  target:
    kind: Alert
    name: nativelink-image-alert
- patch: |-
    - op: replace
      path: /spec/eventMetadata/flakeOutput
      value: ./src_root#nativelink-worker-init
  target:
    kind: Alert
    name: nativelink-worker-init-alert
- patch: |-
    - op: replace
      path: /spec/eventMetadata/flakeOutput
      value: ./src_root#nativelink-worker-lre-cc
  target:
    kind: Alert
    name: nativelink-worker-alert

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 23 files reviewed, and pending CI: Remote / large-ubuntu-22.04, and 1 discussions need to be resolved


nativelink-config/src/cas_server.rs line 184 at r1 (raw file):

#[derive(Deserialize, Debug, Default)]
#[serde(deny_unknown_fields)]
pub struct OtlpConfig {

this should hold the hardcoded timeouts also, I'd take a bet that those will need to be adjusted based on needs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants