Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add outlier dashboard url to workload creation output #39

Merged
merged 7 commits into from
Dec 8, 2023
Merged

Conversation

SurbhiJainUSC
Copy link
Collaborator

Fixes / Features

  • This PR outputs the link of the monitoring dashboard deployed in Google Cloud project using Terraform resource in https://github.com/google/cloud-tpu-monitoring-debugging. The monitoring dashboard will display the statistics and outlier mode of GKE metrics. The dashboard can be filtered for a particular GKE Custer, job name or a GKE Pod.

Testing / Documentation

Tested for internal projects.

  • [ y ] Tests pass
  • [ n ] Appropriate changes to documentation are included in the PR
    • No change required in README

@SurbhiJainUSC SurbhiJainUSC changed the title Outlier Add outlier dashboard url to workload creation output Dec 7, 2023
xpk.py Outdated
xpk_print(
'Check statistics and outlier mode of GKE metrics for your workload here:'
# pylint: disable=line-too-long
f'http://console.cloud.google.com/monitoring/dashboards/builder/{outlier_dashboard_id}?project={args.project}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to attach the clusterName, jobName, and podName filters to the link here too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great idea. I can add the cluster name to the link. However, it is not ideal to add a pod name filter as there can be multiple pods in a cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, yeah that makes sense. Thanks Surbhi. So adding clusterName/jobName should be enough. Then users can choose to filter between pods on their own if they choose

Copy link
Collaborator Author

@SurbhiJainUSC SurbhiJainUSC Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added clusterName filter in the url output.

JobName is a user defined label. When I add f.umlabel.jobset.sigs.k8s.io%2Fjobset-name.JobName=<workload_name> to the dashboard url, Google cloud interprets it as f.umlabel.jobset.sigs=<workload_name> that doesn't filter the metrics on the dashboard correctly for the jobName. This is happening because Google Cloud separates variable based on . and that results in failure in extracing variable name out of our user defined label: jobset.sigs.k8s.io/jobset-name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I am able to reproduce this. Even if a select jobName from the dashboard, then copy/paste the link into a separate tab I can see the issue.

I am ok merging this but it would be good to:

  1. Add a snippet to the "Check statistics.." message saying filter to your job name
  2. Check with the Cloud Monitoring team or file a bug with them. This seems like unexpected behavior but it would be good to check if we are missing something here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yeah, I will add a message to filter based on job name.
  2. The resource labels actually contain _ like project_id and cluster_name, so I am assuming adding . to a label is not what cloud monitoring expects. But, will check with the team.

xpk.py Outdated

dashboards = return_value.strip().split('\n')
if len(dashboards) > 1:
xpk_print(f'Multiple dashboards with same {outlier_dashboard_filter} exist in the project:{args.project}.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting edge case, how would someone fix this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would Cloud Monitoring actually let this happen? Or would they block it at creation time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Google Cloud allows multiple custom dashboard with same name. If there are multiple dashboards with displayName as GKE - TPU Monitoring Dashboard, then you would have to check the dashboard id in the terraform state files which is stored in GCS: https://github.com/google/cloud-tpu-monitoring-debugging?tab=readme-ov-file#configure-terraform-to-store-state-in-cloud-storage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks. Not sure we can avoid this for now then. Longer term we will probably move the outlier dashboard to a resource dashboard instead of a Terraform created one. So that would solve this issue since we will know the dashboard id

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a small message saying what the user should do in this case? I am guessing they would need to delete all but one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure will add that.

@SurbhiJainUSC SurbhiJainUSC merged commit 9d74584 into main Dec 8, 2023
3 checks passed
@SurbhiJainUSC SurbhiJainUSC deleted the outlier branch December 8, 2023 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants