Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize YearNames storage #5602

Open
Manishearth opened this issue Sep 27, 2024 · 5 comments
Open

Optimize YearNames storage #5602

Manishearth opened this issue Sep 27, 2024 · 5 comments
Labels
discuss Discuss at a future ICU4X-SC meeting

Comments

@Manishearth
Copy link
Member

Currently YearNames are stored as "eras" (a zeromap) or "cyclic" (a varzerovec).

Most calendars have either a single era, or a small fixed set of eras (gregory, coptic, ethiopic). Only Japanese has many. The map is wasteful in these cases, we have thousands of entries for YearNames that repeat the era names over and over again.

What if we optimized most calendars to have a VarZeroVec of eras?

pub enum YearNamesV1<'data> {
    FixedEras(VarZeroVec<'data, str>),
    VariableEras(ZeroMap<'data, PotentialUtf8, str>),
    Cyclic(VarZeroVec<'data, str>),
}

and then formatting eras can be enum FormattingEra { Index(u8), Code(EraCode) }. We can assign index 0 to the "extended year era" for consistency.

We'd need to be careful in the ethiopic shared-era situation to assign indices in such a way that both calendars share indices.

In fact we could even go one further and use a VarZeroVec for Japanese, but this is fraught in the face of maintaining the index mapping.

cc @zbraniecki @sffc

@Manishearth Manishearth added the discuss Discuss at a future ICU4X-SC meeting label Sep 27, 2024
@sffc
Copy link
Member

sffc commented Sep 27, 2024

I really like this design. It seems consistent with what we did with the LeapLinear design: optimize the data structure based on what calendars actually need. Using era codes instead of numeric indices for Japanese seems like the right call.

@sffc
Copy link
Member

sffc commented Sep 27, 2024

While we're here, we should take a look at the encoding of VariableEras. That field will make YearNamesV1 bigger for everyone. A hack-that-isnt-really-so-bad would be to make everyone use a VarZeroVec<str> and store entries such as the following for Japanese:

  • reiwa:Reiwa
  • reiwa:令和

where you binary-search for the key before the : and then slice out the name following the :.

@Manishearth
Copy link
Member Author

Hmm, not entirely opposed, though that is super hacky.

@sffc
Copy link
Member

sffc commented Sep 27, 2024

It's not super hacky if you think of it as a stringly typed VarVarULE with ":" as the field separator.

@sffc
Copy link
Member

sffc commented Sep 27, 2024

Might need to choose a field separator that is always greater than or less than BCP-47 so that sorting works correctly.

@Manishearth Manishearth added this to the ICU4X 2.0 Beta ⟨P1⟩ milestone Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Discuss at a future ICU4X-SC meeting
Projects
None yet
Development

No branches or pull requests

2 participants