-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add outlier dashboard url to workload creation output #39
Conversation
xpk.py
Outdated
xpk_print( | ||
'Check statistics and outlier mode of GKE metrics for your workload here:' | ||
# pylint: disable=line-too-long | ||
f'http://console.cloud.google.com/monitoring/dashboards/builder/{outlier_dashboard_id}?project={args.project}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to attach the clusterName, jobName, and podName filters to the link here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, great idea. I can add the cluster name to the link. However, it is not ideal to add a pod name filter as there can be multiple pods in a cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, yeah that makes sense. Thanks Surbhi. So adding clusterName/jobName should be enough. Then users can choose to filter between pods on their own if they choose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added clusterName
filter in the url output.
JobName
is a user defined label. When I add f.umlabel.jobset.sigs.k8s.io%2Fjobset-name.JobName=<workload_name>
to the dashboard url, Google cloud interprets it as f.umlabel.jobset.sigs=<workload_name>
that doesn't filter the metrics on the dashboard correctly for the jobName
. This is happening because Google Cloud separates variable based on .
and that results in failure in extracing variable name out of our user defined label: jobset.sigs.k8s.io/jobset-name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I am able to reproduce this. Even if a select jobName from the dashboard, then copy/paste the link into a separate tab I can see the issue.
I am ok merging this but it would be good to:
- Add a snippet to the "Check statistics.." message saying filter to your job name
- Check with the Cloud Monitoring team or file a bug with them. This seems like unexpected behavior but it would be good to check if we are missing something here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yeah, I will add a message to filter based on job name.
- The resource labels actually contain
_
likeproject_id
andcluster_name
, so I am assuming adding.
to a label is not what cloud monitoring expects. But, will check with the team.
xpk.py
Outdated
|
||
dashboards = return_value.strip().split('\n') | ||
if len(dashboards) > 1: | ||
xpk_print(f'Multiple dashboards with same {outlier_dashboard_filter} exist in the project:{args.project}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting edge case, how would someone fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would Cloud Monitoring actually let this happen? Or would they block it at creation time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, Google Cloud allows multiple custom dashboard with same name. If there are multiple dashboards with displayName
as GKE - TPU Monitoring Dashboard
, then you would have to check the dashboard id in the terraform state files which is stored in GCS: https://github.com/google/cloud-tpu-monitoring-debugging?tab=readme-ov-file#configure-terraform-to-store-state-in-cloud-storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks. Not sure we can avoid this for now then. Longer term we will probably move the outlier dashboard to a resource dashboard instead of a Terraform created one. So that would solve this issue since we will know the dashboard id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a small message saying what the user should do in this case? I am guessing they would need to delete all but one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure will add that.
Fixes / Features
Testing / Documentation
Tested for internal projects.