Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #11811 insensitive header name set #11823

Merged
merged 3 commits into from
May 22, 2024

Conversation

gregw
Copy link
Contributor

@gregw gregw commented May 21, 2024

Fix #11811 insensitive header name set by:

  • Using a EnumSet and TreeSet to ensure no duplicates in the set
  • Using an ArrayList to preserve the ordering (not necessary, but useful).

Fix #11811 insensitive header name set by:
 + Using a EnumSet and TreeSet to ensure no duplicates in the set
 + Using an ArrayList to preserve the ordering (not necessary, but useful).
@gregw gregw requested a review from lachlan-roberts May 21, 2024 07:17
@gregw gregw merged commit 10d0898 into jetty-12.0.x May 22, 2024
9 checks passed
@gregw gregw deleted the fix/jetty-12.0.x/11811/insensitiveHeaderNameSet branch May 22, 2024 20:42
Comment on lines +622 to +635
return new AbstractSet<>()
{
@Override
public Iterator<String> iterator()
{
return list.iterator();
}

@Override
public int size()
{
return list.size();
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically Set implementations support random access or O(log(size)) and rarely have linear access. This may cause surprise performance issues in the calling code.
What about replacing the List<String> list variable with a hash set, while retaining all other logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned set will have "random access" via the set API, but it will be implemented by AbstractSet as an O(n) search. Moreover, a user is unlikely to be using the header name set for random access, as they have the direct get methods available to seek and individual field. The only realistic usage of the header set is to iterator over it's values to see all headers. The list is a perfect compromise in this case.

Note also, that due to HTTP headers need to know about ordering, we have already determined with JMH benchmarks that for realistic sizes of HTTP header collections, it is more efficient to do an O(n) search than to have the expense of building a O(log(n)) map. Thus rather than creating an ordered map of HttpFields or a map and a list, we simple create a list of HttpFields. So the get methods are actually O(N) as well. This is something we have carefully considered and will continue to monitor.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on assumptions how the user uses this set. They may iterate over it (then we are good), or they may use contains and then we're less good. We do not need a sorted set here. HashSet would be faster to build than TreeSet and would have same semantics as the set returned here. If ordering of headers returned by this method is somehow important, LinkedHashSet could provide for that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, from code maintainer perspective, the previous version of the code didn't look like super optimized -- it used streams' along with their overhead (for example in Trino project we area heavy stream users except for data runtime, where we avoid them). The new code doesn't look as super optimized to me because of AbstractSet. If it indeed is carefully optimized now, my personal preference would be to put a code comment informing the reader about the tradeoffs considered (like "this is optimal for requests that have no more than about x headers").

Copy link
Contributor Author

@gregw gregw May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on assumptions how the user uses this set.

Indeed, but they are assumptions based on good experience and real world data. See the detailed benchmarks and discussions in #3682

They may iterate over it (then we are good), or they may use contains and then we're less good.
Our evidence is that if they use contains we are still good. Only if they use contains many times does the list performance approach the cost of building a hashmap or other structure.

We do not need a sorted set here.
It is not needed, but HTTP headers are ordered, so maintaining that order can improve compliance with many SHOULD aspects of the RFCs that do specify some aspects of ordering.

HashSet would be faster to build than TreeSet and would have same semantics as the set returned here.

Except that HashSet is not case insensitive, so additional work is need to change cases etc.

If ordering of headers returned by this method is somehow important, LinkedHashSet could provide for that.

again, not suitable for case insensitive. Also, the order we want is not the natural sort order of the keys.

BTW, from code maintainer perspective, the previous version of the code didn't look like super optimized -- it used streams' along with their overhead (for example in Trino project we area heavy stream users except for data runtime, where we avoid them).

We totally agree that streams are not really optimal.

The new code doesn't look as super optimized to me because of AbstractSet.

I would not say it is super optimized, but it is somewhat optimized as informed by #3682

If it indeed is carefully optimized now, my personal preference would be to put a code comment informing the reader about the tradeoffs considered (like "this is optimal for requests that have no more than about x headers").

We tend not to clutter the code with such comments, as the issues and PRs that have all the detail are both persistent and linked from source code. However, I'm not opposed to making the comment in the HttpFields class generally that it is implemented internally as a list. I'll reopen #11811 to remind myself to do that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashSet would be faster to build than TreeSet and would have same semantics as the set returned here.

Except that HashSet is not case insensitive, so additional work is need to change cases etc.

The Set<String> returned here is not case insensitive either, by virtue of inheriting from AbstractSet and defining only iterator and size. AbstractSet uses Object.equals semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi said:

The Set returned here is not case insensitive either

Good point! I'm working on fixes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #11846... still a work in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getHeaderNames should return header name once also when request has it in different case
5 participants