Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should TimeZoneIdMapper support efficient storage and retrieval of non-canonical IANA IDs? #5610

Open
sffc opened this issue Sep 27, 2024 · 9 comments
Labels
C-time-zone Component: Time Zones needs-approval One or more stakeholders need to approve proposal U-temporal User: Temporal

Comments

@sffc
Copy link
Member

sffc commented Sep 27, 2024

See #5533 (comment)

Currently, TimeZoneIdMapper has two data payloads:

  1. Trie: IANA string to BCP-47 ID (does NOT support random access: only key-to-value lookup)
  2. Map: BCP-47 ID to canonical IANA string (random access supported)

These data structures allow efficient implementation of the following operations:

  • IANA to BCP-47
  • BCP-47 to IANA
  • Normalize IANA
  • Canonicalize IANA

However, it does not support efficient storage or retrieval of a non-canonical IANA name. For example, it supports mapping from "europe/kiev" to any of the following:

  • "Europe/Kiev" returned as an owned String
  • "uaiev" returned as a TinyStr
  • "Europe/Kyiv" returned as &str

But we don't have a representation that allows the user to return "Europe/Kiev" later: only "Europe/Kyiv". The client would need to store the string on their own.

A few ways to enable this behavior with data:

  1. Change the map from BCP-47 ID to IANA string to also include non-canonical aliases, and return an opaque integer of some sort to select the correct alias
  2. Switch from using a Trie to using a Map so that random access is supported, and then let the client save the index of their non-canonical time zone in the payload
  3. Add a third payload where this is its only job

One concern I have is that any of these approaches involves returning an index of some sort into a structure. However, that structure is ephemeral and could change across ICU4X versions, so I don't want people storing these indices. We currently always return BCP-47 IDs because those have a stability policy.

An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.

Thoughts @justingrant?

@sffc sffc added needs-approval One or more stakeholders need to approve proposal C-time-zone Component: Time Zones U-temporal User: Temporal labels Sep 27, 2024
@justingrant
Copy link

Thanks @sffc for starting this conversation. I don't yet have an opinion about the best solution here, but I've got some initial questions/comments below.

The main guidance I'd have (based on my experience with time zone canonicalization in ECMAScript) is that a time zone API should enable an efficient way to answer the following questions/usecases:

  1. Is this string a valid IANA ID?
  2. Do these two IANA IDs represent the same time zone, meaning they are aliases of each other in IANA? Note that this comparison should happen via a library method to dissuade users from using a naive string comparison, and because the data underlying this comparison will change across ICU4X versions.
  3. What is the normalized IANA ID for this IANA ID? (IANA strings should always be normalized before use, because they're case-insensitive)
  4. What is the list of time zones that I should use to build a UI picker? The list must include only one entry for each canonical IANA ID.
  5. How should I efficiently store a large number of time zones in memory, assuming that canonicalization *is* OK? Canonicalization is OK when the zone never needs to be converted back into an IANA ID.
  6. How should I efficiently store a large number of time zones in memory when canonicalization is *not* OK, like when a user's choice of IANA ID needs to be persisted and we want to respect the user's initial choice, despite future changes to IANA like Kiev=>Kyiv.

Does this list sound correct?

Does that list of use cases influence the proposed solutions in your OP?

Other than the last one in the list, are there others that current ICU4X doesn't meet?

that structure is ephemeral and could change across ICU4X versions

Well, we could choose to make the indexes stable over time if we wanted to (with newly-added zones being given new, larger indexes than existing zones), at the cost of being unable to use the indexes to provide a lexicographic sort order.

In other words from the caller's perspective they'd be IDs not indexes and would need to be documented as such. I'm not suggesting that we should make them stable, but we could.

If we did do this, then it'd be smart to randomize the order to avoid clients fooling themselves into thinking that the indexes were sorted.

If we choose to make an unstable ID visible to callers, then maybe naming it something like unstableId would be smart.

  • "uaiev" returned as a TinyStr

I assume you mean TinyAsciiStr?

An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.

One variant on this could be for ICU4X to provide some method that dynamically creates a data structure that callers will want, instead of requiring every caller to figure this out on their own.

@sffc
Copy link
Member Author

sffc commented Oct 3, 2024

1-5 are all supported with current data, although we don't have an API yet for 4. But we already have a list of all canonical BCP-47 IDs in the data.

I'm definitely not in favor of returning any stable integer, because that just invites people to store them, and then we've created a new taxonomy. I think if we decided to support this, returning a string reference is about at far as I'd want to go.

I'm open to the idea of us using a string interning data structure internally.

@justingrant
Copy link

I'm definitely not in favor of returning any stable integer

I don't know enough Rust to know if this idea is practical, but could the memory savings be achieved without letting callers access the actual index value by wrapping the index as a private field in a public struct?

For example, a NamedTimeZoneId struct with an index: u16 private field, and with some public methods in the impl like these?

  • from_IANA factory method that accepts a string. Could also have from_BCP47.
  • to_IANA instance method that returns the IANA name as an (interned?) string. Could also have to_BCP47.
  • get_all_IANA static method to return the full list of normalized IANA IDs. Could also have get_all_BCP47.
  • Maybe an instance method that accepts a NamedTimeZoneId and returns true if both the argument and Self resolve to the same canonical ID, false otherwise. Could also be a static method.
  • For serialization/deserialization, use the IANA string.

Apologies if my Rust inexperience is showing here, if this is not practical.

@sffc
Copy link
Member Author

sffc commented Oct 3, 2024

Yeah, we could return a wrapped thing, or a string reference, if we have the data.

@justingrant
Copy link

How many bytes does a string reference cost in Rust? Would a wrapped u16 be cheaper?

@sffc
Copy link
Member Author

sffc commented Oct 4, 2024

Well, for the best user experience, a wrapped u16 would still want to retain a reference to the string interning structure or data payload so that it can be dereferenced to a string, so it ends up being the same size as a string reference. I feel like at that point we're splitting hairs; if you really want to just store a u16 and nothing else, then you should just do it yourself.

@justingrant
Copy link

I was assuming that the list of interned strings would be immutable and global so wouldn't need a reference to it. But this assumption might be wrong?

@sffc
Copy link
Member Author

sffc commented Oct 4, 2024

I was assuming that the list of interned strings would be immutable and global so wouldn't need a reference to it. But this assumption might be wrong?

We have a policy of no global caches in ICU4X (unlike ICU4C) because there's no one size fits all solution to caching. So the cache would need to be owned by an instance of an object, and the u16 key would need to always be passed into the same instance...

Would you be happy enough with a cookbook example in our docs about how to do this in userland?

@justingrant
Copy link

Oh, OK. If global caches are out of scope, then yes a cookbook example makes sense for this case, at least as a starting point to see if it's ergonomic enough.

Ignoring the u16 vs. interned IANA string vs. BCP-47 TinyAsciiStr which is just an optimization, my main concern would be that if the most ergonomic solution is to canonicalize IDs, then it's easier for the ecosystem to end up in a place where renaming an IANA zone causes lots of code to break.

At a minimum, it seems like there should be stable BCP47 IDs for all non-canonical IANA zones, and that ICU4X should not canonicalize by default. So users would have to opt into canonicalizing. IIRC <25% of IANA IDs are Links (aka non-canonical), in case that matters in measuring the impact of not canonicalizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-time-zone Component: Time Zones needs-approval One or more stakeholders need to approve proposal U-temporal User: Temporal
Projects
None yet
Development

No branches or pull requests

2 participants