Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration testing for memory limits #210

Closed
wants to merge 16 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jun 4, 2024

This contains 3 PRs worth of work and will be refactored.
The sequence is as follows:

  1. While race-testing the new integration test, the Arrow receiver turned up a bug identified by the race detector. the inFlightWG was shared by all streams in the receiver, so it would work in testing (almost always) but was incorrect. This may explain unusually long stream shutdowns.
  2. The Arrow exporter was not consistently using gRPC status errors, which the Reciever had fixed in [otel-arrow/receiver] Receiver concurrency fixes; readability improvements & restructuring #205; the code was out-of-line with the design of the top-level directory. Whereas the OTel-Arrow exporter had been inserting consumererror.NewPermanent() wrappers, it is the Exporter module which supports standard OTLP and Arrow one layer up that is responsible for permanent error labeling. Returning gRPC status errors is always preferred to fmt.Errorf in gRPC components.
  3. A new memory-limited integration test is added, which exposed the need for correct status error handling, also exposed an Arrow Consumer bug. The underlying Arrow IPC Reader object has an Err() method that was not being checked--instead we had (3a) a fallback mechanism, checking expected record count, and (3b) write to os.Stderr because of the (known) missing error.

@jmacd
Copy link
Contributor Author

jmacd commented Jun 4, 2024

Replaced by all the above mentions.

@jmacd jmacd closed this Jun 4, 2024
jmacd added a commit that referenced this pull request Jun 5, 2024
Part of #210.

This updates memory-limit error handling because, prior to this fix,
there were two fallbacks.

(1) the use of os.Stderr to print the message that we had lost
(2) a test for the expected number of records to match

The first is now removed (i.e. no printing to os.Stderr). The second is
now an internal error.

The consumer is expected to test for the memory limit error explicitly
(e.g., and handle it as ResourceExhausted). A new function is added to
do this (NewLimitErrorFromError) that parses the message using a regexp.
An upstream issue report and PR will be filed with Arrow-Go to overcome
this in the future.

See apache/arrow#41989.
jmacd added a commit that referenced this pull request Jun 5, 2024
Part of #210.

Mainly, changes the use of `fmt.Errorf()` and bare
`context.Context.Err()` values, uses gRPC-Go's `status.Errorf()` to wrap
the error with a code that gRPC and its consumers recognize. The code
was out-of-line with the design of the top-level directory. Whereas the
OTel-Arrow exporter had been inserting consumererror.NewPermanent()
wrappers, it is the Exporter module which supports standard OTLP and
Arrow one layer up that is responsible for permanent error labeling.
Returning gRPC status errors is always preferred to fmt.Errorf in gRPC
components.


Secondly, re-order and rename of the fields passed to the "arrow stream
error" log statement, so that it matches the Reciever. This is used as
the basis of a test for logging consistency and was otherwise an
unintentional disagreement ("which" and "where").
jmacd added a commit that referenced this pull request Jun 5, 2024
Part of #210.

Mainly, fixes a race in the use of `inFlightWG` which is meant to be
per-stream but is per-Receiver presently.

Secondly, re-order and rename of the fields passed to the "arrow stream
error" log statement, so that it matches the Exporter. This is used as
the basis of a test for logging consistency and there was otherwise an
unintentional disagreement ("which" and "where").
jmacd added a commit that referenced this pull request Jun 5, 2024
This is consistent with gRPC behavior for unary exporters, as with for
HTTP exporters.
It also helps us test with realistic conditions in the new integration
test of #210.

https://grpc.io/docs/guides/deadlines/ explains why the default behavior
is the way it is for unary gRPC. This is also the case for Golang's
net/http client, which is used for the OTLP/HTTP exporter. Therefore, it
seems the right thing to do for Arrow as well. Further, if we were to
support extending the deadline on purpose, it would belong in the
exporterhelper code (IMO).

---------

Co-authored-by: Laurent Quérel <laurent.querel@gmail.com>
jmacd added a commit that referenced this pull request Jun 5, 2024
The new test ensures that we record memory limit errors
Replaces #210.
Depends on #211, #212, #213, and particularly #215.
codeboten referenced this pull request in open-telemetry/opentelemetry-collector-contrib Jun 11, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
|
[github.com/open-telemetry/otel-arrow](https://togithub.com/open-telemetry/otel-arrow)
| `v0.23.0` -> `v0.24.0` |
[![age](https://developer.mend.io/api/mc/badges/age/go/github.com%2fopen-telemetry%2fotel-arrow/v0.24.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/go/github.com%2fopen-telemetry%2fotel-arrow/v0.24.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/go/github.com%2fopen-telemetry%2fotel-arrow/v0.23.0/v0.24.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/go/github.com%2fopen-telemetry%2fotel-arrow/v0.23.0/v0.24.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

> [!WARNING]
> Some dependencies could not be looked up. Check the Dependency
Dashboard for more information.

---

### Release Notes

<details>
<summary>open-telemetry/otel-arrow
(github.com/open-telemetry/otel-arrow)</summary>

###
[`v0.24.0`](https://togithub.com/open-telemetry/otel-arrow/releases/tag/v0.24.0)

[Compare
Source](https://togithub.com/open-telemetry/otel-arrow/compare/v0.23.0...v0.24.0)

Jitter is applied to once per process, not once per stream.
[https://github.com/open-telemetry/otel-arrow/pull/199](https://togithub.com/open-telemetry/otel-arrow/pull/199)
Network statistics tracing instrumentation simplified.
[https://github.com/open-telemetry/otel-arrow/pull/201](https://togithub.com/open-telemetry/otel-arrow/pull/201)
Protocol includes use of more gRPC codes.
[https://github.com/open-telemetry/otel-arrow/pull/202](https://togithub.com/open-telemetry/otel-arrow/pull/202)
Receiver concurrency bugfix.
[https://github.com/open-telemetry/otel-arrow/pull/205](https://togithub.com/open-telemetry/otel-arrow/pull/205)
Concurrent batch processor size==0 bugfix.
[https://github.com/open-telemetry/otel-arrow/pull/208](https://togithub.com/open-telemetry/otel-arrow/pull/208)
New integration testing.
[https://github.com/open-telemetry/otel-arrow/pull/210](https://togithub.com/open-telemetry/otel-arrow/pull/210)
Use gRPC Status codes in the Arrow exporter.
[https://github.com/open-telemetry/otel-arrow/pull/211](https://togithub.com/open-telemetry/otel-arrow/pull/211)
Fix stream-shutdown race in Arrow receiver.
[https://github.com/open-telemetry/otel-arrow/pull/212](https://togithub.com/open-telemetry/otel-arrow/pull/212)
Avoid work for already-canceled requests.
[https://github.com/open-telemetry/otel-arrow/pull/213](https://togithub.com/open-telemetry/otel-arrow/pull/213)
Call IPCReader.Err() after reader loop.
[https://github.com/open-telemetry/otel-arrow/pull/215](https://togithub.com/open-telemetry/otel-arrow/pull/215)
Update to Arrow-Go v16.1.0.
[https://github.com/open-telemetry/otel-arrow/pull/218](https://togithub.com/open-telemetry/otel-arrow/pull/218)
Update to OpenTelemetry Collector v0.102.x.
[https://github.com/open-telemetry/otel-arrow/pull/219](https://togithub.com/open-telemetry/otel-arrow/pull/219)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "on tuesday" (UTC), Automerge - At any
time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/open-telemetry/opentelemetry-collector-contrib).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiLCJyZW5vdmF0ZWJvdCJdfQ==-->

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com>
Co-authored-by: Yang Song <songy23@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant