Skip to content

Commit

Permalink
Edits.
Browse files Browse the repository at this point in the history
  • Loading branch information
jmacd committed Oct 10, 2024
1 parent e9a0d56 commit 47b656f
Showing 1 changed file with 34 additions and 59 deletions.
93 changes: 34 additions & 59 deletions text/0257-utf8-handling.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,6 @@ of telemetry caused by string-valued fields that are not well-formed.
A number of options are available, all of which are better than doing
nothing about this problem.

This document also proposes to extend the OpenTelemetry API to support
byte-slice valued attributes. Otherwise, users have no good
alterntives to handling this kind of data, which is often how invalid
UTF-8 arises.

## Internal details

Text handling is such a key aspect of programming languages that
Expand Down Expand Up @@ -129,45 +124,17 @@ diferent protocol buffer implementation would likely be the first to
observe the invalid UTF-8, and by that time, a large batch of
telemetry could fail as a result.

### Responsibility to the API user

When a user uses a telemetry SDK to gain observability into their
application, it can lead to quite the opposite result when invalid
UTF-8 causes loss of telemetry. When a user turns to a telemetry
system to obseve their system and makes a UTF-8 mistake, the
consequent total loss of telemetry is a definite bad outcome. While
the above discussion addresses attempts to correct a problem,
OpenTelemetry has not given the API user what they want in this
situation.

#### Byte-slice valued attribute API

The OpenTelemetry API specification does not permit the use of
byte-slice valued attributes. This is one reason why invalid UTF-8
arises, because OpenTelemetry users have not been given an
alternative.

When faced with a desire to record invalid UTF-8 in a stream of
telemetry, users have no good options. They can use a base64
encoding, but they will have to do so manually, and with extra code,
and they are likely to learn this only after first making the mistake
of using a string-value to encode bytes data. When users see
corrupted telemetry data containing � characters, they will return to
OpenTelemetry with a request: how should they report bytes data?

Ironically, in a prototype Collector processor to repair invalid
UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field
constructor to report the invalid data, and this is something an
OpenTelemetry API user cannot do.

This OTEP proposes to add an attribute-value type for byte slices, to
represent an array of bytes with unknown encoding. Examples data
values where this attribute could be useful:

- checksum values (e.g., a sha256 checksum represented as 32 bytes)
- hash values (e.g., a 56-bit hash value represented as 7 bytes)
- register contents (e.g., a memory region ... 64 bytes)
- invalid UTF-8 (e.g., data lost because of UTF-8 validation).
### No byte-slice valued attribute API

As a caveat, the OpenTelemetry project has previously debated and
rejected the potential to support a byte-slice typed attribute. This
potential feature was rejected because it allows API users a way to
record a possibly uninterpretable sequence of bytes. Users with
invalid UTF-8 are left with a few options, for example:

- Base64-encode the invalid data wrapped in human-readable syntax
(e.g., `base64:WNhbHF1ZWVuY29hY2hod`).
- Transmute the byte slice to an integer slice, which is supported.

### Responsibility to the Collector user

Expand All @@ -188,26 +155,34 @@ protect users. There are several places this could be done:
3. Prior to an exporter, with data coming from either an external
source and/or modified by potentially untrustworhty code.

Depending on user preferences, any of these outcomes might be best.
Every mutating processor could force the pipeline to re-check for
validity, in the strictest of configurations.

#### New collector component Capabilities for UTF-8 strictness
Each of these approaches will take significant effort and cost the
user at runtime, therefore:

Each collector component that strongly requires valid UTF-8 will
declare so in an Capabilities field, and this will cause the Collector
to re-validate UTF-8 before invoking that component. By default,
exporters will require valid UTF-8.
- UTF-8 validation SHOULD be enabled by default
- Users SHOULD be able to opt out of UTF-8 validation
- A `receiverhelper` library SHOULD offer a function to correct
invalid UTF-8 in-place, with two configurable outcomes (1) reject
individual items containing invalid UTF-8, (2) repair invalid UTF-8.

Provided that no processors are strict about UTF-8 validation, UTF-8
validation will happen only once per pipeline segment. This means
UTF-8 validation can be deferred until after sampling and filtering
operations have finsihed. Each pipeline segment will have at least
one UTF-8 validation step, which will either automatically correct the
problem (default) or reject the individual item of data (optional).
When an OpenTelemetry collector receives telemetry data in any
protocol, in cases where the underlying RPC or protocol buffer
libraries does not guarantee valid UTF-8, the `receiverhelper` library
SHOULD be called to optionally validate UTF-8 according to the
confiuration described above.

### Alternatives considered

#### Exhaustive validation

The Collector behavior proposed above only validates data after it is
received, not after it is modified by a processor. We consider the
risk of a malfunctioning processor to be acceptable. If this happens,
it will be considered a bug in the Collector component. In other
words, the Collector SHOULD NOT perform internal data validation and
it SHOULD perform external data validation.

#### Permissive trasnport

We observe that some protocol buffer implementations are permissive
with respect to invalid UTF-8, while others are strict. It can be
risky to combine dissimilar protocol buffer implementations in a
Expand Down

0 comments on commit 47b656f

Please sign in to comment.