handle invalid accept-charset in requests - default to utf-8 #584

pjfanning · 2024-08-12T20:50:14Z

I've added #585 to look into the XML case - this PR focuses on text/plain.

scalafmt

raboof · 2024-08-13T07:44:30Z

.../test/scala/org/apache/pekko/http/scaladsl/server/directives/MarshallingDirectivesSpec.scala

+    "render text with requested encoding if an `Accept-Charset` request header requests a non-UTF-8 encoding" in {
+      Get() ~> `Accept-Charset`(`ISO-8859-1`) ~> complete(foo) ~> check {
+        responseEntity shouldEqual HttpEntity(ContentType(`text/plain`, `ISO-8859-1`), foo)
+      }
+    }
+    "render text with UTF-8 encoding if an `Accept-Charset` request header requests an unknown encoding" in {
+      Get() ~> `Accept-Charset`(HttpCharset("unknown")(Nil)) ~> complete(foo) ~> check {
+        responseEntity shouldEqual HttpEntity(ContentType(`text/plain`, `UTF-8`), foo)
+      }
+    }


I realize this is a comment on the current behavior, and not on your change, but this seems strange to me: it looks like the default string marshaller will return the string passed to it as text/plain claiming it is in whatever charset that was requested by the user, without doing anything to make sure the string is actually in that charset.

This is only correct if you assume the implementation correctly took into account the charset, which seems extremely unlikely. I wonder if it wouldn't make more sense to leave out the optional charset parameter to the return content type in this case. If people have specific requirements (which seems somewhat unlikely) they can specify the charset explicitly.

This would be a behavior change, but it seems like a reasonable one if we note it in the release notes.

before the change to the main source code in this PR, this test actually fails - the response is an error response without my main source change

Hi,

Thanks for having a look at this bug / feature request. I recently sent in a duplicate issue (#583) unaware of this one. I worked on a similar PR the past two days, and have now found and looked through your changes - that would seem to fix it nicely! Great.

For your consideration, I'd like to suggest adding in an extra unit-test in "MarshallingSpec.scala" with something along the lines of:

val invalidAcceptCharsetHeader = `Accept-Charset`(HttpCharsetRange(HttpCharset.custom("invalid"))) val request = HttpRequest().withHeaders(Seq(invalidAcceptCharsetHeader)) marshalToResponse("Hello", request).entity.contentType.charsetOption should be(Some(HttpCharsets.`UTF-8`))

Some refactoring thoughts; Perhaps the try-catch in "PredefinedToEntityMarshallers.scala" could be more subtly placed in "HttpCharset.scala" as:

/** Returns this HttpCharset instance if the provided nio charset is available or a desired default otherwise */ def withNioCharsetOrElse(default: HttpCharset): HttpCharset = { if (_nioCharset.isSuccess) { // Return this instance, as the provided nioCharset did not result in a java.nio.charset.UnsupportedCharsetException this } else { default } }

and then have the "PredefinedToEntityMarshallers.scala" do:

Marshaller.withOpenCharset(mediaType) { (s, cs) => { HttpEntity(mediaType.withCharset(cs.withNioCharsetOrElse(HttpCharsets.`UTF-8`)), s) } }

Looking forward to seeing your changes merged.

Thanks again and best regards,
Michael

Thanks @michaelf-crwd - I have added a new HttpCharsets function similar to what you suggested (c8514e7).
I will look later at adding extra tests like the one you suggested.

@raboof would you be able to take a look at this again? I don't think this change breaks anything and makes the pekko-http support for Accept-Charset a bit more tolerant.

without doing anything to make sure the string is actually in that charset

@raboof, WDYM with that? Java strings are basically UTF-16 encoded, so they are known to be unicode. What else could go wrong?

what I meant is, when the route does complete(someUnicodeString), and the marshalling logic results in a Content-Type: text/plain; charset=ISO-8859-1 header, that response seems wrong: the header promises the body will be in ISO-8859-1, but the body is actually in unicode. I think it would make more sense to make sure we produce a Content-Type: text/plain; charset=UTF-8 header unless the developer explicitly marked the response as being in a different charset.

or will the marshalling logic actually convert the unicode input into bytes using the charset from the content type? in that case of course it's all good.

Indeed it does, sorry about the noise 🤦

nvollmar

lgtm

raboof · 2024-09-06T12:30:26Z

Sorry for not finding time to look into this sooner, will do so now.

jrudolph

lgtm, I agree with the idea to treat the Accept-XYZ headers rather softly, especially Accept-Charset which is losing relevancy these days. As long as the server clearly indicates what representation it uses, it seems fine.

raboof

I tested various scenario's:

Explicit charset in response

i.e. a route with complete(HttpEntity(ContentTypes.text/html(UTF-8), "Say Hällö to Pekko HTTP")):

request: curl -v localhost:8080/hello
old behavior: response in Content-Type: text/html; charset=UTF-8 💚
new behavior: same 💚
request: curl -v -H "Accept-Charset: blop" localhost:8080/hello
old behavior: 500 error and stacktrace in the logs 🛑
new behavior: 406 Not Acceptable, "Resource representation is only available with these types: text/html; charset=UTF-8" 💚
request: curl -v -H "Accept-Charset: US-ASCII" localhost:8080/hello
old behavior: 406 Not Acceptable, "Resource representation is only available with these types: text/html; charset=UTF-8" 💚
new behavior: same 💚
curl -v -H "Accept-Charset: ISO-8859-1" localhost:8080/hello
old -> response 406 Not Acceptable 💚
new -> same 💚

No explicit charset in response

i.e. a route with complete("Say Hällö to Pekko HTTP"):

with complete("Say Hällö to Pekko HTTP")

request: curl -v localhost:8080/hello
old behavior: response in text/plain; charset=UTF-8 💚
new behavior: same 💚
request: curl -v -H "Accept-Charset: blop" localhost:8080/hello
old behavior: 500 error and stacktrace in the logs 🛑
new behavior: response in text/plain; charset=UTF-8 💚
request: curl -v -H "Accept-Charset: US-ASCII" localhost:8080/hello
old behavior: response with Content-Type: text/plain; charset=US-ASCII header, ~~but UTF-8 body~~ response converted to US-ASCII 🛑 💚
new behavior: same 🛑 💚
curl -v -H "Accept-Charset: ISO-8859-1" localhost:8080/hello
old -> response with Content-Type: text/plain; charset=ISO-8859 header, ~~but UTF-8 body~~ response converted to US-ASCII 🛑 💚
new -> same 🛑 💚

In summary, I think this PR is a definite improvement. Some behavior seems wrong to me, but that behavior was already like that, so that doesn't need to delay merging this PR or releasing 1.1.0. I'll put my code where my mouth is and propose a fix for the behavior that looks wrong to me, but that's independent of 1.1.0. Sorry for taking so long before confirming ;) the strings are converted to bytes taking into account the encodings just fine, all good, sorry about the noise 🤦

jrudolph · 2024-09-06T14:07:17Z

the strings are converted to bytes taking into account the encodings just fine, all good, sorry about the noise 🤦

The whole negotiation story is really sophisticated and tries to take that into account, indeed. It's also one of the performance hotspots of the routing layer.

jrudolph · 2024-09-06T14:11:17Z

Anyway thanks for that comprehensive testing session, @raboof 👍

pjfanning · 2024-09-06T16:37:49Z

Thanks everyone. I'll merge as is but if anyone has any requests for changes, I'll look at them.

add tests for accept-charset

38cceb3

pjfanning marked this pull request as draft August 12, 2024 20:50

default to utf-8 if charset is invalid

479dfa1

scalafmt

pjfanning force-pushed the accept-charset branch from e7fd945 to 479dfa1 Compare August 12, 2024 20:53

raboof reviewed Aug 13, 2024

View reviewed changes

pjfanning added 3 commits August 13, 2024 18:03

xml tests (1 broken still)

3ccd053

add charsetWithUtf8Failover function

c8514e7

Update MarshallingSpec.scala

1ed3fd7

pjfanning marked this pull request as ready for review August 16, 2024 15:26

pjfanning mentioned this pull request Aug 16, 2024

handle invalid charsets in accept-charset header (XML case) #585

Open

scala 2.12 compile issue

6c4aaf7

pjfanning added this to the 1.1.0-M2 milestone Aug 28, 2024

nvollmar approved these changes Sep 6, 2024

View reviewed changes

jrudolph approved these changes Sep 6, 2024

View reviewed changes

raboof approved these changes Sep 6, 2024

View reviewed changes

pjfanning merged commit e6f1f60 into apache:main Sep 6, 2024
10 checks passed

pjfanning deleted the accept-charset branch September 6, 2024 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle invalid accept-charset in requests - default to utf-8 #584

handle invalid accept-charset in requests - default to utf-8 #584

pjfanning commented Aug 12, 2024 •

edited

Loading

raboof Aug 13, 2024 •

edited

Loading

pjfanning Aug 13, 2024 •

edited

Loading

michaelf-crwd Aug 15, 2024 •

edited

Loading

pjfanning Aug 16, 2024

pjfanning Aug 30, 2024

jrudolph Sep 6, 2024

raboof Sep 6, 2024 •

edited

Loading

raboof Sep 6, 2024 •

edited

Loading

raboof Sep 6, 2024

jrudolph Sep 6, 2024

nvollmar left a comment

raboof commented Sep 6, 2024

jrudolph left a comment

raboof left a comment •

edited

Loading

jrudolph commented Sep 6, 2024

jrudolph commented Sep 6, 2024

pjfanning commented Sep 6, 2024

handle invalid accept-charset in requests - default to utf-8 #584

handle invalid accept-charset in requests - default to utf-8 #584

Conversation

pjfanning commented Aug 12, 2024 • edited Loading

raboof Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

pjfanning Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

michaelf-crwd Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

pjfanning Aug 16, 2024

Choose a reason for hiding this comment

pjfanning Aug 30, 2024

Choose a reason for hiding this comment

jrudolph Sep 6, 2024

Choose a reason for hiding this comment

raboof Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

raboof Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

raboof Sep 6, 2024

Choose a reason for hiding this comment

jrudolph Sep 6, 2024

Choose a reason for hiding this comment

nvollmar left a comment

Choose a reason for hiding this comment

raboof commented Sep 6, 2024

jrudolph left a comment

Choose a reason for hiding this comment

raboof left a comment • edited Loading

Choose a reason for hiding this comment

Explicit charset in response

No explicit charset in response

jrudolph commented Sep 6, 2024

jrudolph commented Sep 6, 2024

pjfanning commented Sep 6, 2024

pjfanning commented Aug 12, 2024 •

edited

Loading

raboof Aug 13, 2024 •

edited

Loading

pjfanning Aug 13, 2024 •

edited

Loading

michaelf-crwd Aug 15, 2024 •

edited

Loading

raboof Sep 6, 2024 •

edited

Loading

raboof Sep 6, 2024 •

edited

Loading

raboof left a comment •

edited

Loading