Skip to content

Commit

Permalink
new rule: nodelocal dns timeout
Browse files Browse the repository at this point in the history
Change-Id: I57ecd7709bd13d1f31e2d898d2eb93eb06160552
GitOrigin-RevId: 324fbcf
  • Loading branch information
gcpdiag team authored and copybara-github committed Aug 3, 2023
1 parent 8827491 commit 0a9338f
Show file tree
Hide file tree
Showing 3 changed files with 101 additions and 0 deletions.
67 changes: 67 additions & 0 deletions gcpdiag/lint/gke/err_2023_010_nodelocal_timeout.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
NodeLocal DNSCache timeout errors.
On clusters with NodeLocal DNSCache enabled sometimes response to a DNS
request was not received from kube-dns in 2 seconds and hence the DNS
timeout errors crop up.
"""
from gcpdiag import lint, models
from gcpdiag.lint.gke import util
from gcpdiag.queries import apis, gke, logs

MATCH_STR_1 = '[ERROR] plugin/errors: 2'
logs_by_project = {}


def prepare_rule(context: models.Context):
clusters = gke.get_clusters(context)
for project_id in {c.project_id for c in clusters.values()}:
logs_by_project[project_id] = logs.query(
project_id=project_id,
resource_type='k8s_container',
log_name='log_id("stdout")',
filter_str=f'textPayload:"{MATCH_STR_1}"',
)


def run_rule(context: models.Context, report: lint.LintReportRuleInterface):
# skip entire rule is logging disabled
if not apis.is_enabled(context.project_id, 'logging'):
report.add_skipped(None, 'logging api is disabled')
return

# Any work to do?
clusters = gke.get_clusters(context)
if not clusters:
report.add_skipped(None, 'no clusters found')

# Search the logs.
def filter_f(log_entry):
try:
if MATCH_STR_1 in log_entry['textPayload']:
return True
except KeyError:
return False

bad_clusters = util.gke_logs_find_bad_clusters(
context=context, logs_by_project=logs_by_project, filter_f=filter_f)

# Create the report.
for _, c in sorted(clusters.items()):
if c in bad_clusters:
report.add_failed(c, logs.format_log_entry(bad_clusters[c]))
else:
report.add_ok(c)
3 changes: 3 additions & 0 deletions gcpdiag/lint/gke/snapshots/ERR_2023_010.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
* gke/ERR/2023_010: NodeLocal DNSCache timeout errors.
(logging api is disabled) [SKIP]

31 changes: 31 additions & 0 deletions website/content/en/rules/gke/ERR/2023_010.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: "gke/ERR/2023_010"
linkTitle: "ERR/2023_010"
weight: 1
type: docs
description: >
NodeLocal DNSCache timeout errors.
---

**Product**: [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)\
**Rule class**: ERR - Something that is very likely to be wrong


### Description

On clusters with NodeLocal DNSCache enabled sometimes response to a DNS request was not received from kube-dns in 2 seconds and hence the DNS timeout errors crop up.

You can use the following filter to find matching log lines:
```
textPayload:"[ERROR] plugin/errors: 2"
resource.type="k8s_container"
```

### Remediation

Increase the number of kube-dns replicas.


### Further information

- https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache

0 comments on commit 0a9338f

Please sign in to comment.