-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): initial normalization 🙀 #9728
Conversation
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
Steven's note: moving this into the issue.Markers:
Argument for following-char method is that it makes more sense for end-of-context:
Proposed Algorithm:
This algorithm can be applied to any string, including: keyboard source strings, input context from app (cached context), and output in various stages from the processor (i.e. before and after each transform step). What about repeated characters:
This hinges on:
Refs: Alternative solution: markers always move to end of a normalized sequence.
|
cb65861
to
e64f596
Compare
- change more tests to NFD for now For: #9468
- more progress in the normalization pipeline. Trying to keep it from leaking. - normalize output to NFC. - normalize json test data to NFC (both context and expected). - we do NOT try to normalize the 'embedded' strings currently. - yet more test data fixes (turns out u+00e0 ≠ u+00e8) For: #9468
426668b
to
a577bd2
Compare
- fix windows build For: #9468
- simplify normalize_nfc and normalize_nfd functions For: #9468
It actually does normalization for the first time. So it is part of the implementation. I moved the notes out because all of those notes haven't been implemented, so they'd be on the issue not the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going, but hope we can simplify the use of the normalization functions a little.
core/src/ldml/ldml_processor.cpp
Outdated
std::u32string old_ctxtstr_nfc; | ||
(void)context_to_string(state, old_ctxtstr_nfc, false); | ||
ldml::normalize_nfc(old_ctxtstr_nfc, status); | ||
assert(U_SUCCESS(status)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return
if !U_SUCCESS(status)
when in release builds, because old_ctxtstr_nfc
will be in an uncertain state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be in an unchanged state but yes.
core/src/ldml/ldml_processor.cpp
Outdated
// so that we don't have to reconvert it inside the transform code. | ||
std::u32string key_str32 = kmx::u16string_to_u32string(key_str); | ||
// normalize the keystroke to NFD | ||
ldml::normalize_nfd(key_str32, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ldml::normalize_nfd(key_str32, status); | |
ldml::normalize_nfd(key_str32, status); | |
assert(U_SUCCESS(status)); |
core/src/ldml/ldml_processor.cpp
Outdated
/** NFC and no markers */ | ||
std::u32string ctxtstr_cleanedup = ctxtstr; | ||
// TODO-LDML: remove markers! | ||
ldml::normalize_nfc(ctxtstr_cleanedup, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ldml::normalize_nfc(ctxtstr_cleanedup, status); | |
ldml::normalize_nfc(ctxtstr_cleanedup, status); | |
assert(U_SUCCESS(status)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a pattern emerging with these assertions which we should potentially cleanup, e.g.
if(!ldml::normalize_nfc_asserted(ctxtstr_cleanedup)) {
return;
}
and something like:
bool ldml::normalize_nfc_asserted(std::u32string &str) {
UErrorCode status = U_ZERO_ERROR;
ldml::normalize_nfc(str, status);
assert(U_SUCCESS(status));
return U_SUCCESS(status);
}
The function name I've come up with could be better!
The fact that two assertions are missing in your PR highlights the need for the refactor 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense- thanks
core/src/ldml/ldml_processor.cpp
Outdated
str.pop_back(); | ||
str.pop_back(); | ||
str.pop_back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider asserting the values of the UC_SENTINEL and CODE_DEADKEY?
core/src/ldml/ldml_transforms.cpp
Outdated
if (U_SUCCESS(status)) { | ||
// if failure, leave it alone | ||
str = km::kbp::kmx::u16string_to_u32string(rstr); | ||
} | ||
return str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer the failure cases in each function to be tested and return str
just like the following functions do, for consistency.
Also, is there a good reason to return str as well as modify the input parameter? This seems likely to lead to unintentional wrong uses of the functions. Why not return bool success value instead (as suggested earlier in my review)?
here's what happens when
|
PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty close now. LGTM
core/src/ldml/ldml_transforms.hpp
Outdated
#define UASSERT_SUCCESS(status) \ | ||
if (U_FAILURE(status)) { \ | ||
DebugLog("U_FAILURE(%s)", u_errorName(status)); \ | ||
assert(U_SUCCESS(status)); \ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do in failure cases in release builds, where the assert won't happen? We'll get different behaviour because it will fall through and continue. Do we need to return false or something in the function?
Also might be better to do this as an inline function because it can be optimized more cleanly and avoids accidental misuse with function calls as the status parameter etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to return false.
Note that the call site already kept the string unchanged if the assert failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also improved the transform loading code to handle assert failures as validation failures
In-line functions won’t get the assert line number
|
```
#define my_assert(x) my_assert_((x),__LINE__)
```
… In-line functions won’t get the assert line number
|
core/src/ldml/ldml_transforms.cpp
Outdated
@@ -456,18 +458,23 @@ transform_entry::transform_entry( | |||
} | |||
} | |||
|
|||
void | |||
bool | |||
transform_entry::init() { | |||
if (!fFrom.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally should be a precondition to reduce function depth. Sorry I missed this before.
if (!fFrom.empty()) { | |
if (fFrom.empty()) { | |
return false; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, found another … valid
shouldn't be overwritten without check. PTAL
- reduce function depth - catch another gotcha (valid was being overwritten) For: #9468
@@ -798,7 +815,15 @@ transforms::load( | |||
const std::u32string toStr = kmx::u16string_to_u32string(kplus.strs->get(element->to)); | |||
KMX_DWORD mapFrom = element->mapFrom; // copy, because of alignment | |||
KMX_DWORD mapTo = element->mapTo; // copy, because of alignment | |||
newGroup.emplace_back(fromStr, toStr, mapFrom, mapTo, kplus); // creating a transform_entry | |||
assert(!fromStr.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we get an earlier assert
if (!valid) | ||
return; // exit early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, if valid came in false we leave with it false.
Only change it from true->false on err, don't set it to true here.
if (!valid) | ||
return; // exit early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the status pattern from ICU. It means more work on the caller side but I guess I am okay with it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sure the c'tor can return failure. Could be a factory model, but the caller is using emplace_back()
so the idea is a stack based construction anyway. on failure we're going to bail out of the entire load.
Changes in this pull request will be available for download in Keyman version 17.0.196-alpha |
For: #9468
@keymanapp-test-bot skip