Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): initial normalization 🙀 #9728

Merged
merged 23 commits into from
Oct 20, 2023

Conversation

srl295
Copy link
Member

@srl295 srl295 commented Oct 9, 2023

  • failing tests
  • fix the tests

For: #9468

@keymanapp-test-bot skip

@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Oct 9, 2023

@mcdurdin
Copy link
Member

mcdurdin commented Oct 10, 2023

Steven's note: moving this into the issue.

Markers:

  1. 1 e\u{0300}\m{problem_marker}\u{0320} (NFD) reorders 1 e\u{0320}\u{0300} (without markers)
  2. We then re-inject markers:
    • 1 e\m{problem_marker}\u{0320}\u{0300} (tie marker to following char), or
    • 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to preceding char), or
    • 1 e\m{problem_marker}\u{0320}\u{0300} (insert marker as early as possible in norm. cluster)

Argument for following-char method is that it makes more sense for end-of-context:

  • 1 e\u{0300}\u{0320}\m{problem_marker} ==> 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to follow char, i.e. end-of-string)
  • To implement this, work backwards from end-of-context to first stable char

Proposed Algorithm:

  1. First, we need to transform compat chars and NFC chars to NFD: break strings at markers, normalize each string to NFD
  2. Concatenate the result, including markers, and re-break at stable characters
  3. Re-normalize each chunk, removing markers and remembering attachment to following char, to get any NFD reordering, then re-inject markers

This algorithm can be applied to any string, including: keyboard source strings, input context from app (cached context), and output in various stages from the processor (i.e. before and after each transform step).


What about repeated characters:

  • 1 e\u{0300}\u{0320}\m{problem_marker}\u{0300} --> 1 e\u{0320}\u{0300}\u{0300} (where does marker go?)

This hinges on:

  • Given two independent normalized-to-NFD strings, when concatenated, can we ever have codepoints change? (it is recognized that reorder is possible.) --> my expectation is that if this is the case, it will be one or two old sequences that we can special-case on, so we should be able to move forward.

Refs:


Alternative solution: markers always move to end of a normalized sequence.

<key to="[\u{0320}\u{0300}]\m{problem_marker}">

  • <transform from="e\u{0320}\m{problem_marker}\u{0300}" to="HELLO">
  • <transform from="e\u{0320}\u{0300}\m{problem_marker}" to="GOODBYE">

@srl295 srl295 changed the title chore(core): Some failing normalization transform tests 🙀 feat(core): proper normalization to-spec 🙀 Oct 10, 2023
@github-actions github-actions bot added the feat label Oct 10, 2023
- 0 steps forward, 4294967294 steps back

For: #9468
- change tests to NFD for now

For: #9468
- split out 'process_key_string'

For: #9468
@srl295 srl295 force-pushed the feat/core/9468-normalization-epic-ldml branch from cb65861 to e64f596 Compare October 12, 2023 15:10
- change more tests to NFD for now

For: #9468
- more progress in the normalization pipeline. Trying to keep it from leaking.
- normalize output to NFC.
- normalize json test data to NFC (both context and expected).
- we do NOT try to normalize the 'embedded' strings currently.
- yet more test data fixes (turns out u+00e0 ≠ u+00e8)

For: #9468
@srl295 srl295 force-pushed the feat/core/9468-normalization-epic-ldml branch from 426668b to a577bd2 Compare October 12, 2023 21:58
@srl295 srl295 marked this pull request as ready for review October 12, 2023 21:59
- fix unnecessary test churn.

For: #9468
@srl295
Copy link
Member Author

srl295 commented Oct 16, 2023

Is this PR getting the tests ready for #9468. I can see that Steven's note: moving this into the issue. has not been implemented yet.

It actually does normalization for the first time. So it is part of the implementation. I moved the notes out because all of those notes haven't been implemented, so they'd be on the issue not the PR

Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going, but hope we can simplify the use of the normalization functions a little.

std::u32string old_ctxtstr_nfc;
(void)context_to_string(state, old_ctxtstr_nfc, false);
ldml::normalize_nfc(old_ctxtstr_nfc, status);
assert(U_SUCCESS(status));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return if !U_SUCCESS(status) when in release builds, because old_ctxtstr_nfc will be in an uncertain state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be in an unchanged state but yes.

core/src/ldml/ldml_processor.cpp Show resolved Hide resolved
// so that we don't have to reconvert it inside the transform code.
std::u32string key_str32 = kmx::u16string_to_u32string(key_str);
// normalize the keystroke to NFD
ldml::normalize_nfd(key_str32, status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ldml::normalize_nfd(key_str32, status);
ldml::normalize_nfd(key_str32, status);
assert(U_SUCCESS(status));

/** NFC and no markers */
std::u32string ctxtstr_cleanedup = ctxtstr;
// TODO-LDML: remove markers!
ldml::normalize_nfc(ctxtstr_cleanedup, status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ldml::normalize_nfc(ctxtstr_cleanedup, status);
ldml::normalize_nfc(ctxtstr_cleanedup, status);
assert(U_SUCCESS(status));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a pattern emerging with these assertions which we should potentially cleanup, e.g.

if(!ldml::normalize_nfc_asserted(ctxtstr_cleanedup)) {
  return;
}

and something like:

bool ldml::normalize_nfc_asserted(std::u32string &str) {
  UErrorCode status = U_ZERO_ERROR;  
  ldml::normalize_nfc(str, status);
  assert(U_SUCCESS(status));
  return U_SUCCESS(status);
}

The function name I've come up with could be better!

The fact that two assertions are missing in your PR highlights the need for the refactor 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense- thanks

Comment on lines 346 to 348
str.pop_back();
str.pop_back();
str.pop_back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider asserting the values of the UC_SENTINEL and CODE_DEADKEY?

core/src/ldml/ldml_transforms.cpp Outdated Show resolved Hide resolved
Comment on lines 854 to 858
if (U_SUCCESS(status)) {
// if failure, leave it alone
str = km::kbp::kmx::u16string_to_u32string(rstr);
}
return str;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer the failure cases in each function to be tested and return str just like the following functions do, for consistency.

Also, is there a good reason to return str as well as modify the input parameter? This seems likely to lead to unintentional wrong uses of the functions. Why not return bool success value instead (as suggested earlier in my review)?

core/src/ldml/ldml_transforms.hpp Outdated Show resolved Hide resolved
srl295 and others added 2 commits October 16, 2023 20:22
For: #9468

Co-authored-by: Marc Durdin <marc@durdin.net>
- verify marker string as it's popped off
- add a missing pragma to debuglog.h
- move UErrorCode out of ldml_processor, change normalization functions to return true on success

For: #9468
@srl295
Copy link
Member Author

srl295 commented Oct 18, 2023

here's what happens when UASSERT_SUCCESS() detects failure ( U_CE_NOT_FOUND_ERROR was a bogus error I set for demonstration purposes)

context   :  []
found vkey 65:0
- key action: 0x41/modifier 0x0
579665555	../../../src/ldml/ldml_transforms.cpp:863	normalize_nfd	U_FAILURE(U_CE_NOT_FOUND_ERROR)
----------------------------------- stderr -----------------------------------
Assertion failed: (ldml::normalize_nfd(key_str32)), function process_key_string, file ldml_processor.cpp, line 270.

@srl295 srl295 requested a review from mcdurdin October 18, 2023 22:58
@srl295
Copy link
Member Author

srl295 commented Oct 18, 2023

PTAL

Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty close now. LGTM

Comment on lines 36 to 40
#define UASSERT_SUCCESS(status) \
if (U_FAILURE(status)) { \
DebugLog("U_FAILURE(%s)", u_errorName(status)); \
assert(U_SUCCESS(status)); \
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in failure cases in release builds, where the assert won't happen? We'll get different behaviour because it will fall through and continue. Do we need to return false or something in the function?

Also might be better to do this as an inline function because it can be optimized more cleanly and avoids accidental misuse with function calls as the status parameter etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to return false.

Note that the call site already kept the string unchanged if the assert failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also improved the transform loading code to handle assert failures as validation failures

@srl295
Copy link
Member Author

srl295 commented Oct 19, 2023 via email

@mcdurdin
Copy link
Member

mcdurdin commented Oct 19, 2023 via email

@mcdurdin
Copy link
Member

#define my_assert(x) my_assert_((x),__LINE__)

image

pffft

- move meat of uassert_success() into an inline

For: #9468
- improve assert failure handling- fix a potential heap leak

For: #9468
@srl295 srl295 changed the title feat(core): proper normalization to-spec 🙀 feat(core): initial normalization 🙀 Oct 19, 2023
@@ -456,18 +458,23 @@ transform_entry::transform_entry(
}
}

void
bool
transform_entry::init() {
if (!fFrom.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally should be a precondition to reduce function depth. Sorry I missed this before.

Suggested change
if (!fFrom.empty()) {
if (fFrom.empty()) {
return false;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, found another … valid shouldn't be overwritten without check. PTAL

- reduce function depth
- catch another gotcha (valid was being overwritten)

For: #9468
@@ -798,7 +815,15 @@ transforms::load(
const std::u32string toStr = kmx::u16string_to_u32string(kplus.strs->get(element->to));
KMX_DWORD mapFrom = element->mapFrom; // copy, because of alignment
KMX_DWORD mapTo = element->mapTo; // copy, because of alignment
newGroup.emplace_back(fromStr, toStr, mapFrom, mapTo, kplus); // creating a transform_entry
assert(!fromStr.empty());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we get an earlier assert

Comment on lines +419 to +420
if (!valid)
return; // exit early
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, if valid came in false we leave with it false.

Only change it from true->false on err, don't set it to true here.

Comment on lines +419 to +420
if (!valid)
return; // exit early
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the status pattern from ICU. It means more work on the caller side but I guess I am okay with it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sure the c'tor can return failure. Could be a factory model, but the caller is using emplace_back() so the idea is a stack based construction anyway. on failure we're going to bail out of the entire load.

@srl295 srl295 merged commit 16d24bc into master Oct 20, 2023
17 checks passed
@srl295 srl295 deleted the feat/core/9468-normalization-epic-ldml branch October 20, 2023 03:57
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 17.0.196-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants