Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve provider observability #267

Open
troian opened this issue Dec 2, 2024 · 2 comments
Open

improve provider observability #267

troian opened this issue Dec 2, 2024 · 2 comments
Assignees
Labels
repo/provider Akash provider-services repo issues

Comments

@troian
Copy link
Member

troian commented Dec 2, 2024

Is your feature request related to a problem? Please describe.

the issue

Inventory operator

Extend inventory operator to provide not only inventory availability, but health of all key components:

  • egress

  • hardware

    • GPUs (device driver plugins)
    • storage devices
  • cluster

    • cluster status
    • node(s) status

Ceph

  • does Rook/Ceph have API to check state of the ceph cluster @chainzero
  • limit amount of mounted persistent volumes (at moment it is set to unlimited)
  • limit total amount of PV on single kube cluster
  • test IOPS
    • test for max IOPS at provider provisioning stage
    • periodically test for current IOPS
  • check if there is a way to limit IOPS per workload
@troian
Copy link
Member Author

troian commented Dec 2, 2024

there is a CEPH API
have to check how details information can be obtained, does it support streaming (websockets)

@github-project-automation github-project-automation bot moved this to Backlog (not prioritized) in Core Product and Engineering Roadmap Dec 3, 2024
@chainzero chainzero moved this from Backlog (not prioritized) to In Progress (prioritized) in Core Product and Engineering Roadmap Dec 3, 2024
@chainzero
Copy link
Collaborator

Along with the CEPH REST API referenced by @troian above there is:

rook-ceph includes Prometheus Metrics in default install.

Example raw output from an OCL managed provider:
https://gist.github.com/chainzero/be7a39f2a9ab8d663420f8c8441a2090

Summary of analytic categories:
https://gist.github.com/chainzero/762dc0aa3b07e03b2f0ec3faf29a09a0

@chainzero chainzero added repo/provider Akash provider-services repo issues and removed awaiting-triage labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
Status: In Progress (prioritized)
Development

No branches or pull requests

3 participants