Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry pipeline improvements (iteration 1) #3133

Merged
merged 16 commits into from
Oct 12, 2024

Conversation

nagworld9
Copy link
Contributor

@nagworld9 nagworld9 commented May 30, 2024

Description

This pr address below issues to improve data reliability

  1. Sending important events: Implemented as sending this event directly to wireserver and bypass the collecton_and_send channel. Added new parameter immediate_flush to add_event. Callers need to set to true if this is important event and needs to flush immediately. As a fallback, if we fail to send, we will save to disk for collector to pick up as like a normal event.
  2. Throttling issues/retries on telemetrydata API: Today we have aggressive retries on throttling errors. This fix will reduce the number of retries with better delay.
  3. Retry on reading event files: Collector may pick up and read the file while file is modified.

I added necessary comments, still questions/concerns happy to go over a call.

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@@ -106,17 +106,20 @@ def _get_update_handler(iterations=1, test_data=None, protocol=None, autoupdate_
This function returns a mocked version of the UpdateHandler object to be used for testing. It will only run the
main loop [iterations] no of times.
"""
# Added try-finally block to fix false positive pylint warning contextmanager-generator-missing-cleanup
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review for event.py

azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
logger.periodic_error(logger.EVERY_FIFTEEN_MINUTES, "[PERIODIC] {0}".format(ustr(e)))
self.flush_or_save_event(event, message, immediate_flush)

def flush_or_save_event(self, event, message, immediate_flush):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to report_or_save_event

azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved

class EventLogger(object):
def __init__(self):
self.event_dir = None
self.periodic_events = {}
self.protocol = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not a good idea to have this uninitialized property. let's look for alternatives... maybe get the protocol on demand when sending an important event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending on demand may add other compilations, wherever we send imp event, we need reference for protocol object as well plus we need to add protocol as optional property to add_event method with None.

or if you have something else in mind for on demand, lets have quick call to discuss

azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/collect_telemetry_events.py Outdated Show resolved Hide resolved
Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing comments, changes lgtm! Will give another review after Norberto's comments are addressed

maddieford
maddieford previously approved these changes Oct 3, 2024
(cherry picked from commit 55ea21f)
azurelinuxagent/common/event.py Show resolved Hide resolved
azurelinuxagent/common/protocol/wire.py Outdated Show resolved Hide resolved
azurelinuxagent/common/utils/restutil.py Show resolved Hide resolved
@@ -468,7 +481,10 @@ def http_request(method,
# retry attempts
if _is_throttle_status(resp.status):
was_throttled = True
max_retry = max(max_retry, THROTTLE_RETRIES)
# Today, THROTTLE_RETRIES set to big number as opposite to slow down the retry attempts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this comment, could you please explain:

Today, THROTTLE_RETRIES set to big number as opposite to slow down the retry attempts

Copy link
Contributor Author

@nagworld9 nagworld9 Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today when any api return throttle error, we make 26 attempts for retying as opposed to back off and make fewer attempts with good delay.

My fix will change this behavior only for telemetry calls and keeping same for non-telemetry calls as I wasn't sure why it was implemented that way.

I'll improve the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense

narrieta
narrieta previously approved these changes Oct 11, 2024
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; one minor comment

azurelinuxagent/common/utils/restutil.py Show resolved Hide resolved
narrieta
narrieta previously approved these changes Oct 11, 2024
maddieford
maddieford previously approved these changes Oct 11, 2024
@nagworld9 nagworld9 dismissed stale reviews from maddieford and narrieta via 1a1130c October 11, 2024 23:38
@nagworld9 nagworld9 merged commit 3402fb9 into Azure:develop Oct 12, 2024
11 checks passed
@nagworld9 nagworld9 deleted the imp-event branch October 12, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants