-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should TimeZoneIdMapper support efficient storage and retrieval of non-canonical IANA IDs? #5610
Comments
Thanks @sffc for starting this conversation. I don't yet have an opinion about the best solution here, but I've got some initial questions/comments below. The main guidance I'd have (based on my experience with time zone canonicalization in ECMAScript) is that a time zone API should enable an efficient way to answer the following questions/usecases:
Does this list sound correct? Does that list of use cases influence the proposed solutions in your OP? Other than the last one in the list, are there others that current ICU4X doesn't meet?
Well, we could choose to make the indexes stable over time if we wanted to (with newly-added zones being given new, larger indexes than existing zones), at the cost of being unable to use the indexes to provide a lexicographic sort order. In other words from the caller's perspective they'd be IDs not indexes and would need to be documented as such. I'm not suggesting that we should make them stable, but we could. If we did do this, then it'd be smart to randomize the order to avoid clients fooling themselves into thinking that the indexes were sorted. If we choose to make an unstable ID visible to callers, then maybe naming it something like
I assume you mean
One variant on this could be for ICU4X to provide some method that dynamically creates a data structure that callers will want, instead of requiring every caller to figure this out on their own. |
1-5 are all supported with current data, although we don't have an API yet for 4. But we already have a list of all canonical BCP-47 IDs in the data. I'm definitely not in favor of returning any stable integer, because that just invites people to store them, and then we've created a new taxonomy. I think if we decided to support this, returning a string reference is about at far as I'd want to go. I'm open to the idea of us using a string interning data structure internally. |
I don't know enough Rust to know if this idea is practical, but could the memory savings be achieved without letting callers access the actual index value by wrapping the index as a private field in a public struct? For example, a
Apologies if my Rust inexperience is showing here, if this is not practical. |
Yeah, we could return a wrapped thing, or a string reference, if we have the data. |
How many bytes does a string reference cost in Rust? Would a wrapped |
Well, for the best user experience, a wrapped |
I was assuming that the list of interned strings would be immutable and global so wouldn't need a reference to it. But this assumption might be wrong? |
We have a policy of no global caches in ICU4X (unlike ICU4C) because there's no one size fits all solution to caching. So the cache would need to be owned by an instance of an object, and the u16 key would need to always be passed into the same instance... Would you be happy enough with a cookbook example in our docs about how to do this in userland? |
Oh, OK. If global caches are out of scope, then yes a cookbook example makes sense for this case, at least as a starting point to see if it's ergonomic enough. Ignoring the u16 vs. interned IANA string vs. BCP-47 TinyAsciiStr which is just an optimization, my main concern would be that if the most ergonomic solution is to canonicalize IDs, then it's easier for the ecosystem to end up in a place where renaming an IANA zone causes lots of code to break. At a minimum, it seems like there should be stable BCP47 IDs for all non-canonical IANA zones, and that ICU4X should not canonicalize by default. So users would have to opt into canonicalizing. IIRC <25% of IANA IDs are Links (aka non-canonical), in case that matters in measuring the impact of not canonicalizing. |
See #5533 (comment)
Currently, TimeZoneIdMapper has two data payloads:
These data structures allow efficient implementation of the following operations:
However, it does not support efficient storage or retrieval of a non-canonical IANA name. For example, it supports mapping from "europe/kiev" to any of the following:
&str
But we don't have a representation that allows the user to return "Europe/Kiev" later: only "Europe/Kyiv". The client would need to store the string on their own.
A few ways to enable this behavior with data:
One concern I have is that any of these approaches involves returning an index of some sort into a structure. However, that structure is ephemeral and could change across ICU4X versions, so I don't want people storing these indices. We currently always return BCP-47 IDs because those have a stability policy.
An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.
Thoughts @justingrant?
The text was updated successfully, but these errors were encountered: