Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle invalid accept-charset in requests - default to utf-8 #584

Merged
merged 6 commits into from
Sep 6, 2024

Conversation

pjfanning
Copy link
Contributor

@pjfanning pjfanning commented Aug 12, 2024

see #300

I've added #585 to look into the XML case - this PR focuses on text/plain.

@pjfanning pjfanning marked this pull request as draft August 12, 2024 20:50
Comment on lines +286 to +295
"render text with requested encoding if an `Accept-Charset` request header requests a non-UTF-8 encoding" in {
Get() ~> `Accept-Charset`(`ISO-8859-1`) ~> complete(foo) ~> check {
responseEntity shouldEqual HttpEntity(ContentType(`text/plain`, `ISO-8859-1`), foo)
}
}
"render text with UTF-8 encoding if an `Accept-Charset` request header requests an unknown encoding" in {
Get() ~> `Accept-Charset`(HttpCharset("unknown")(Nil)) ~> complete(foo) ~> check {
responseEntity shouldEqual HttpEntity(ContentType(`text/plain`, `UTF-8`), foo)
}
}
Copy link
Member

@raboof raboof Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is a comment on the current behavior, and not on your change, but this seems strange to me: it looks like the default string marshaller will return the string passed to it as text/plain claiming it is in whatever charset that was requested by the user, without doing anything to make sure the string is actually in that charset.

This is only correct if you assume the implementation correctly took into account the charset, which seems extremely unlikely. I wonder if it wouldn't make more sense to leave out the optional charset parameter to the return content type in this case. If people have specific requirements (which seems somewhat unlikely) they can specify the charset explicitly.

This would be a behavior change, but it seems like a reasonable one if we note it in the release notes.

Copy link
Contributor Author

@pjfanning pjfanning Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the change to the main source code in this PR, this test actually fails - the response is an error response without my main source change

Copy link

@michaelf-crwd michaelf-crwd Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thanks for having a look at this bug / feature request. I recently sent in a duplicate issue (#583) unaware of this one. I worked on a similar PR the past two days, and have now found and looked through your changes - that would seem to fix it nicely! Great.

For your consideration, I'd like to suggest adding in an extra unit-test in "MarshallingSpec.scala" with something along the lines of:

val invalidAcceptCharsetHeader = `Accept-Charset`(HttpCharsetRange(HttpCharset.custom("invalid")))
val request = HttpRequest().withHeaders(Seq(invalidAcceptCharsetHeader))
marshalToResponse("Hello", request).entity.contentType.charsetOption should be(Some(HttpCharsets.`UTF-8`))

Some refactoring thoughts; Perhaps the try-catch in "PredefinedToEntityMarshallers.scala" could be more subtly placed in "HttpCharset.scala" as:

/** Returns this HttpCharset instance if the provided nio charset is available or a desired default otherwise */
  def withNioCharsetOrElse(default: HttpCharset): HttpCharset = {
    if (_nioCharset.isSuccess) {
      // Return this instance, as the provided nioCharset did not result in a java.nio.charset.UnsupportedCharsetException
      this
    } else {
      default
    }
  }

and then have the "PredefinedToEntityMarshallers.scala" do:

Marshaller.withOpenCharset(mediaType) { (s, cs) => {
  HttpEntity(mediaType.withCharset(cs.withNioCharsetOrElse(HttpCharsets.`UTF-8`)), s)
} }

Looking forward to seeing your changes merged.

Thanks again and best regards,
Michael

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @michaelf-crwd - I have added a new HttpCharsets function similar to what you suggested (c8514e7).
I will look later at adding extra tests like the one you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raboof would you be able to take a look at this again? I don't think this change breaks anything and makes the pekko-http support for Accept-Charset a bit more tolerant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without doing anything to make sure the string is actually in that charset

@raboof, WDYM with that? Java strings are basically UTF-16 encoded, so they are known to be unicode. What else could go wrong?

Copy link
Member

@raboof raboof Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant is, when the route does complete(someUnicodeString), and the marshalling logic results in a Content-Type: text/plain; charset=ISO-8859-1 header, that response seems wrong: the header promises the body will be in ISO-8859-1, but the body is actually in unicode. I think it would make more sense to make sure we produce a Content-Type: text/plain; charset=UTF-8 header unless the developer explicitly marked the response as being in a different charset.

Copy link
Member

@raboof raboof Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or will the marshalling logic actually convert the unicode input into bytes using the charset from the content type? in that case of course it's all good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does, sorry about the noise 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pjfanning pjfanning added this to the 1.1.0-M2 milestone Aug 28, 2024
Copy link

@nvollmar nvollmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@raboof
Copy link
Member

raboof commented Sep 6, 2024

Sorry for not finding time to look into this sooner, will do so now.

Copy link
Contributor

@jrudolph jrudolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I agree with the idea to treat the Accept-XYZ headers rather softly, especially Accept-Charset which is losing relevancy these days. As long as the server clearly indicates what representation it uses, it seems fine.

Copy link
Member

@raboof raboof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested various scenario's:

Explicit charset in response

i.e. a route with complete(HttpEntity(ContentTypes.text/html(UTF-8), "Say Hällö to Pekko HTTP")):

  • request: curl -v localhost:8080/hello
    old behavior: response in Content-Type: text/html; charset=UTF-8 💚
    new behavior: same 💚
  • request: curl -v -H "Accept-Charset: blop" localhost:8080/hello
    old behavior: 500 error and stacktrace in the logs 🛑
    new behavior: 406 Not Acceptable, "Resource representation is only available with these types: text/html; charset=UTF-8" 💚
  • request: curl -v -H "Accept-Charset: US-ASCII" localhost:8080/hello
    old behavior: 406 Not Acceptable, "Resource representation is only available with these types: text/html; charset=UTF-8" 💚
    new behavior: same 💚
  • curl -v -H "Accept-Charset: ISO-8859-1" localhost:8080/hello
    old -> response 406 Not Acceptable 💚
    new -> same 💚

No explicit charset in response

i.e. a route with complete("Say Hällö to Pekko HTTP"):

with complete("Say Hällö to Pekko HTTP")

  • request: curl -v localhost:8080/hello
    old behavior: response in text/plain; charset=UTF-8 💚
    new behavior: same 💚
  • request: curl -v -H "Accept-Charset: blop" localhost:8080/hello
    old behavior: 500 error and stacktrace in the logs 🛑
    new behavior: response in text/plain; charset=UTF-8 💚
  • request: curl -v -H "Accept-Charset: US-ASCII" localhost:8080/hello
    old behavior: response with Content-Type: text/plain; charset=US-ASCII header, but UTF-8 body response converted to US-ASCII 🛑 💚
    new behavior: same 🛑 💚
  • curl -v -H "Accept-Charset: ISO-8859-1" localhost:8080/hello
    old -> response with Content-Type: text/plain; charset=ISO-8859 header, but UTF-8 body response converted to US-ASCII 🛑 💚
    new -> same 🛑 💚

In summary, I think this PR is a definite improvement. Some behavior seems wrong to me, but that behavior was already like that, so that doesn't need to delay merging this PR or releasing 1.1.0. I'll put my code where my mouth is and propose a fix for the behavior that looks wrong to me, but that's independent of 1.1.0. Sorry for taking so long before confirming ;) the strings are converted to bytes taking into account the encodings just fine, all good, sorry about the noise 🤦

@jrudolph
Copy link
Contributor

jrudolph commented Sep 6, 2024

the strings are converted to bytes taking into account the encodings just fine, all good, sorry about the noise 🤦

The whole negotiation story is really sophisticated and tries to take that into account, indeed. It's also one of the performance hotspots of the routing layer.

@jrudolph
Copy link
Contributor

jrudolph commented Sep 6, 2024

Anyway thanks for that comprehensive testing session, @raboof 👍

@pjfanning pjfanning merged commit e6f1f60 into apache:main Sep 6, 2024
10 checks passed
@pjfanning pjfanning deleted the accept-charset branch September 6, 2024 16:37
@pjfanning
Copy link
Contributor Author

Thanks everyone. I'll merge as is but if anyone has any requests for changes, I'll look at them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants