feat(core): initial normalization 🙀 #9728

srl295 · 2023-10-09T23:46:49Z

failing tests
fix the tests

For: #9468

@keymanapp-test-bot skip

For: #9468

keymanapp-test-bot · 2023-10-09T23:46:53Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

mcdurdin · 2023-10-10T02:17:56Z

Steven's note: moving this into the issue.

Markers:

1 e\u{0300}\m{problem_marker}\u{0320} (NFD) reorders 1 e\u{0320}\u{0300} (without markers)
We then re-inject markers:
- 1 e\m{problem_marker}\u{0320}\u{0300} (tie marker to following char), or
- 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to preceding char), or
- 1 e\m{problem_marker}\u{0320}\u{0300} (insert marker as early as possible in norm. cluster)

Argument for following-char method is that it makes more sense for end-of-context:

1 e\u{0300}\u{0320}\m{problem_marker} ==> 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to follow char, i.e. end-of-string)
To implement this, work backwards from end-of-context to first stable char

Proposed Algorithm:

First, we need to transform compat chars and NFC chars to NFD: break strings at markers, normalize each string to NFD
Concatenate the result, including markers, and re-break at stable characters
Re-normalize each chunk, removing markers and remembering attachment to following char, to get any NFD reordering, then re-inject markers

This algorithm can be applied to any string, including: keyboard source strings, input context from app (cached context), and output in various stages from the processor (i.e. before and after each transform step).

What about repeated characters:

1 e\u{0300}\u{0320}\m{problem_marker}\u{0300} --> 1 e\u{0320}\u{0300}\u{0300} (where does marker go?)

This hinges on:

Given two independent normalized-to-NFD strings, when concatenated, can we ever have codepoints change? (it is recognized that reorder is possible.) --> my expectation is that if this is the case, it will be one or two old sequences that we can special-case on, so we should be able to move forward.

Refs:

https://unicode.org/reports/tr15/#Concatenation

Alternative solution: markers always move to end of a normalized sequence.

<key to="[\u{0320}\u{0300}]\m{problem_marker}">

<transform from="e\u{0320}\m{problem_marker}\u{0300}" to="HELLO">
<transform from="e\u{0320}\u{0300}\m{problem_marker}" to="GOODBYE">

- colorize ldml test For: #9468

- 0 steps forward, 4294967294 steps back For: #9468

- change tests to NFD for now For: #9468

- split out 'process_key_string' For: #9468

- change more tests to NFD for now For: #9468

- more progress in the normalization pipeline. Trying to keep it from leaking. - normalize output to NFC. - normalize json test data to NFC (both context and expected). - we do NOT try to normalize the 'embedded' strings currently. - yet more test data fixes (turns out u+00e0 ≠ u+00e8) For: #9468

- fix windows build For: #9468

core/subprojects/packagefiles/icu/meson.build

- fix unnecessary test churn. For: #9468

- fix int -> auto For: #9468

- simplify normalize_nfc and normalize_nfd functions For: #9468

srl295 · 2023-10-16T22:25:06Z

Is this PR getting the tests ready for #9468. I can see that Steven's note: moving this into the issue. has not been implemented yet.

It actually does normalization for the first time. So it is part of the implementation. I moved the notes out because all of those notes haven't been implemented, so they'd be on the issue not the PR

mcdurdin

I like where this is going, but hope we can simplify the use of the normalization functions a little.

mcdurdin · 2023-10-16T22:28:46Z

core/src/ldml/ldml_processor.cpp

+  std::u32string old_ctxtstr_nfc;
+  (void)context_to_string(state, old_ctxtstr_nfc, false);
+  ldml::normalize_nfc(old_ctxtstr_nfc, status);
+  assert(U_SUCCESS(status));


Do we need to return if !U_SUCCESS(status) when in release builds, because old_ctxtstr_nfc will be in an uncertain state?

It'll be in an unchanged state but yes.

core/src/ldml/ldml_processor.cpp

mcdurdin · 2023-10-16T22:30:31Z

core/src/ldml/ldml_processor.cpp

+  // so that we don't have to reconvert it inside the transform code.
+  std::u32string key_str32 = kmx::u16string_to_u32string(key_str);
+  // normalize the keystroke to NFD
+  ldml::normalize_nfd(key_str32, status);


Suggested change

ldml::normalize_nfd(key_str32, status);

ldml::normalize_nfd(key_str32, status);

assert(U_SUCCESS(status));

mcdurdin · 2023-10-16T22:31:39Z

core/src/ldml/ldml_processor.cpp

+  /** NFC and no markers */
+  std::u32string ctxtstr_cleanedup = ctxtstr;
+  // TODO-LDML: remove markers!
+  ldml::normalize_nfc(ctxtstr_cleanedup, status);


Suggested change

ldml::normalize_nfc(ctxtstr_cleanedup, status);

ldml::normalize_nfc(ctxtstr_cleanedup, status);

assert(U_SUCCESS(status));

I see a pattern emerging with these assertions which we should potentially cleanup, e.g.

if(!ldml::normalize_nfc_asserted(ctxtstr_cleanedup)) { return; }

and something like:

bool ldml::normalize_nfc_asserted(std::u32string &str) { UErrorCode status = U_ZERO_ERROR; ldml::normalize_nfc(str, status); assert(U_SUCCESS(status)); return U_SUCCESS(status); }

The function name I've come up with could be better!

The fact that two assertions are missing in your PR highlights the need for the refactor 😁

makes sense- thanks

mcdurdin · 2023-10-16T22:36:19Z

core/src/ldml/ldml_processor.cpp

+      str.pop_back();
+      str.pop_back();
+      str.pop_back();


We should consider asserting the values of the UC_SENTINEL and CODE_DEADKEY?

core/src/ldml/ldml_transforms.cpp

mcdurdin · 2023-10-16T22:46:08Z

core/src/ldml/ldml_transforms.cpp

+  if (U_SUCCESS(status)) {
+    // if failure, leave it alone
+    str = km::kbp::kmx::u16string_to_u32string(rstr);
+  }
+  return str;


I think I'd prefer the failure cases in each function to be tested and return str just like the following functions do, for consistency.

Also, is there a good reason to return str as well as modify the input parameter? This seems likely to lead to unintentional wrong uses of the functions. Why not return bool success value instead (as suggested earlier in my review)?

core/src/ldml/ldml_transforms.hpp

For: #9468 Co-authored-by: Marc Durdin <marc@durdin.net>

- verify marker string as it's popped off - add a missing pragma to debuglog.h - move UErrorCode out of ldml_processor, change normalization functions to return true on success For: #9468

srl295 · 2023-10-18T22:49:13Z

here's what happens when UASSERT_SUCCESS() detects failure ( U_CE_NOT_FOUND_ERROR was a bogus error I set for demonstration purposes)

context   :  []
found vkey 65:0
- key action: 0x41/modifier 0x0
579665555	../../../src/ldml/ldml_transforms.cpp:863	normalize_nfd	U_FAILURE(U_CE_NOT_FOUND_ERROR)
----------------------------------- stderr -----------------------------------
Assertion failed: (ldml::normalize_nfd(key_str32)), function process_key_string, file ldml_processor.cpp, line 270.

- refactor internal normalization call For: #9468

- consistency For: #9468

srl295 · 2023-10-18T22:58:41Z

PTAL

mcdurdin

This is pretty close now. LGTM

mcdurdin · 2023-10-19T07:51:48Z

core/src/ldml/ldml_transforms.hpp

+#define UASSERT_SUCCESS(status) \
+  if (U_FAILURE(status)) { \
+    DebugLog("U_FAILURE(%s)", u_errorName(status)); \
+    assert(U_SUCCESS(status)); \
+  }


What do we do in failure cases in release builds, where the assert won't happen? We'll get different behaviour because it will fall through and continue. Do we need to return false or something in the function?

Also might be better to do this as an inline function because it can be optimized more cleanly and avoids accidental misuse with function calls as the status parameter etc.

changed to return false.

Note that the call site already kept the string unchanged if the assert failed.

I also improved the transform loading code to handle assert failures as validation failures

srl295 · 2023-10-19T13:27:03Z

In-line functions won’t get the assert line number

mcdurdin · 2023-10-19T14:09:29Z

``` #define my_assert(x) my_assert_((x),__LINE__) ```

…

In-line functions won’t get the assert line number

mcdurdin · 2023-10-19T14:10:44Z

#define my_assert(x) my_assert_((x),__LINE__)

pffft

- move meat of uassert_success() into an inline For: #9468

- improve assert failure handling- fix a potential heap leak For: #9468

mcdurdin · 2023-10-19T21:53:08Z

core/src/ldml/ldml_transforms.cpp

@@ -456,18 +458,23 @@ transform_entry::transform_entry(
  }
 }

-void
+bool
 transform_entry::init() {
  if (!fFrom.empty()) {


Ideally should be a precondition to reduce function depth. Sorry I missed this before.

Suggested change

if (!fFrom.empty()) {

if (fFrom.empty()) {

return false;

}

nah, found another … valid shouldn't be overwritten without check. PTAL

- reduce function depth - catch another gotcha (valid was being overwritten) For: #9468

srl295 · 2023-10-19T22:13:21Z

core/src/ldml/ldml_transforms.cpp

@@ -798,7 +815,15 @@ transforms::load(
        const std::u32string toStr                      = kmx::u16string_to_u32string(kplus.strs->get(element->to));
        KMX_DWORD mapFrom                               = element->mapFrom; // copy, because of alignment
        KMX_DWORD mapTo                                 = element->mapTo;   // copy, because of alignment
-        newGroup.emplace_back(fromStr, toStr, mapFrom, mapTo, kplus);  // creating a transform_entry
+        assert(!fromStr.empty());


so we get an earlier assert

srl295 · 2023-10-19T22:14:18Z

core/src/ldml/ldml_transforms.cpp

+  if (!valid)
+    return; // exit early


so, if valid came in false we leave with it false.

Only change it from true->false on err, don't set it to true here.

mcdurdin · 2023-10-19T22:14:26Z

core/src/ldml/ldml_transforms.cpp

+  if (!valid)
+    return; // exit early


This is similar to the status pattern from ICU. It means more work on the caller side but I guess I am okay with it!

it makes sure the c'tor can return failure. Could be a factory model, but the caller is using emplace_back() so the idea is a stack based construction anyway. on failure we're going to bail out of the entire load.

keyman-server · 2023-10-20T18:10:14Z

Changes in this pull request will be available for download in Keyman version 17.0.196-alpha

chore(core): Some failing normalization transform tests 🙀

dcb6796

For: #9468

srl295 added work-in-progress requires-design-work DO-NOT-MERGE labels Oct 9, 2023

srl295 added this to the A17S23 milestone Oct 9, 2023

srl295 self-assigned this Oct 9, 2023

github-actions bot added core/ Keyman Core chore epic-ldml labels Oct 9, 2023

srl295 changed the title ~~chore(core): Some failing normalization transform tests 🙀~~ feat(core): proper normalization to-spec 🙀 Oct 10, 2023

Merge branch 'master' into feat/core/9468-normalization-epic-ldml

8c4642a

github-actions bot added the feat label Oct 10, 2023

srl295 added 4 commits October 10, 2023 17:51

chore(core): dx: improve test output 🙀

15e6fb6

- colorize ldml test For: #9468

feat(core): ldml normalization 🙀

daa0523

- 0 steps forward, 4294967294 steps back For: #9468

feat(core): ldml normalization 🙀

522d2ef

- change tests to NFD for now For: #9468

feat(core): ldml normalization 🙀

e64f596

- split out 'process_key_string' For: #9468

srl295 force-pushed the feat/core/9468-normalization-epic-ldml branch from cb65861 to e64f596 Compare October 12, 2023 15:10

srl295 added 2 commits October 12, 2023 15:12

feat(core): ldml normalization 🙀

736af4b

- change more tests to NFD for now For: #9468

srl295 force-pushed the feat/core/9468-normalization-epic-ldml branch from 426668b to a577bd2 Compare October 12, 2023 21:58

srl295 marked this pull request as ready for review October 12, 2023 21:59

srl295 requested review from mcdurdin and rc-swag as code owners October 12, 2023 21:59

srl295 added 2 commits October 12, 2023 16:59

Merge branch 'master' into feat/core/9468-normalization-epic-ldml

84b4836

feat(core): ldml normalization 🙀

b3de1ff

- fix windows build For: #9468

srl295 commented Oct 12, 2023

View reviewed changes

core/subprojects/packagefiles/icu/meson.build Show resolved Hide resolved

srl295 added 2 commits October 12, 2023 17:46

feat(core): ldml normalization 🙀

95ad1c4

- fix unnecessary test churn. For: #9468

feat(core): ldml normalization 🙀

bf15feb

- fix int -> auto For: #9468

srl295 added 2 commits October 16, 2023 17:14

Merge branch 'master' into feat/core/9468-normalization-epic-ldml

79f37f7

fix(core): ldml update to normalization per review 🙀

9ee183c

- simplify normalize_nfc and normalize_nfd functions For: #9468

mcdurdin requested changes Oct 16, 2023

View reviewed changes

srl295 and others added 2 commits October 16, 2023 20:22

fix(core): ldml update to normalization per review 🙀

93e1c18

For: #9468 Co-authored-by: Marc Durdin <marc@durdin.net>

fix(core): ldml more updates to normalization per review 🙀

9dbf70c

- verify marker string as it's popped off - add a missing pragma to debuglog.h - move UErrorCode out of ldml_processor, change normalization functions to return true on success For: #9468

srl295 added 3 commits October 18, 2023 17:55

fix(core): ldml more updates to normalization per review 🙀

f1b12a1

- refactor internal normalization call For: #9468

Merge branch 'master' into feat/core/9468-normalization-epic-ldml

c4c7361

fix(core): ldml more updates to normalization per review 🙀

3723e6b

- consistency For: #9468

srl295 requested a review from mcdurdin October 18, 2023 22:58

mcdurdin approved these changes Oct 19, 2023

View reviewed changes

srl295 added 2 commits October 19, 2023 13:16

fix(core): ldml more updates to normalization per review 🙀

4afe038

- move meat of uassert_success() into an inline For: #9468

fix(core): ldml more updates to normalization per review 🙀

1d9820b

- improve assert failure handling- fix a potential heap leak For: #9468

srl295 changed the title ~~feat(core): proper normalization to-spec 🙀~~ feat(core): initial normalization 🙀 Oct 19, 2023

Merge branch 'master' into feat/core/9468-normalization-epic-ldml

9292410

mcdurdin approved these changes Oct 19, 2023

View reviewed changes

fix(core): ldml more updates to normalization per review 🙀

4cdc2c3

- reduce function depth - catch another gotcha (valid was being overwritten) For: #9468

srl295 commented Oct 19, 2023

View reviewed changes

mcdurdin reviewed Oct 19, 2023

View reviewed changes

mcdurdin approved these changes Oct 19, 2023

View reviewed changes

srl295 merged commit 16d24bc into master Oct 20, 2023
17 checks passed

srl295 deleted the feat/core/9468-normalization-epic-ldml branch October 20, 2023 03:57

mcdurdin added the epic-core/9999-normalization label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): initial normalization 🙀 #9728

feat(core): initial normalization 🙀 #9728

srl295 commented Oct 9, 2023 •

edited

Loading

keymanapp-test-bot bot commented Oct 9, 2023 •

edited

Loading

mcdurdin commented Oct 10, 2023 •

edited by srl295

Loading

srl295 commented Oct 16, 2023

mcdurdin left a comment

mcdurdin Oct 16, 2023

srl295 Oct 16, 2023

mcdurdin Oct 16, 2023

mcdurdin Oct 16, 2023

mcdurdin Oct 16, 2023

srl295 Oct 17, 2023

mcdurdin Oct 16, 2023

mcdurdin Oct 16, 2023

srl295 commented Oct 18, 2023

srl295 commented Oct 18, 2023

mcdurdin left a comment

mcdurdin Oct 19, 2023

srl295 Oct 19, 2023

srl295 Oct 19, 2023

srl295 commented Oct 19, 2023 via email •

edited

Loading

mcdurdin commented Oct 19, 2023 via email •

edited by srl295

Loading

mcdurdin commented Oct 19, 2023

mcdurdin Oct 19, 2023

srl295 Oct 19, 2023

srl295 Oct 19, 2023

srl295 Oct 19, 2023

mcdurdin Oct 19, 2023

srl295 Oct 19, 2023

keyman-server commented Oct 20, 2023

	ldml::normalize_nfd(key_str32, status);
	ldml::normalize_nfd(key_str32, status);
	assert(U_SUCCESS(status));

	ldml::normalize_nfc(ctxtstr_cleanedup, status);
	ldml::normalize_nfc(ctxtstr_cleanedup, status);
	assert(U_SUCCESS(status));

-  if (!fFrom.empty()) {
+  if (fFrom.empty()) {
+    return false;
+  }

feat(core): initial normalization 🙀 #9728

feat(core): initial normalization 🙀 #9728

Conversation

srl295 commented Oct 9, 2023 • edited Loading

keymanapp-test-bot bot commented Oct 9, 2023 • edited Loading

User Test Results

Test Artifacts

mcdurdin commented Oct 10, 2023 • edited by srl295 Loading

Steven's note: moving this into the issue.

srl295 commented Oct 16, 2023

mcdurdin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 commented Oct 18, 2023

srl295 commented Oct 18, 2023

mcdurdin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 commented Oct 19, 2023 via email • edited Loading

mcdurdin commented Oct 19, 2023 via email • edited by srl295 Loading

mcdurdin commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyman-server commented Oct 20, 2023

srl295 commented Oct 9, 2023 •

edited

Loading

keymanapp-test-bot bot commented Oct 9, 2023 •

edited

Loading

mcdurdin commented Oct 10, 2023 •

edited by srl295

Loading

srl295 commented Oct 19, 2023 via email •

edited

Loading

mcdurdin commented Oct 19, 2023 via email •

edited by srl295

Loading