You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think one issue throughout the data is that different sources and data input has handled orthographies with an "apostrophe"-looking character.
The data might have:
' (U+0027) Apostrophe
’ (U+2019) Right single quotation mark
ʼ (U+02BC) Modifier letter apostrophe
...and probably more (even combining marks).
While there might be proper canonical information for some orthographies, it seems to me that this is most likely arbitrary based on sources and data input and should probably be canonized or disambiguated in some way. E.g. we might want to unify how such characters are input in the data, and we might want to disambiguate this character so that several "possible alternatives" satisfy a language check.
It is further questionable if those should be treated as character, or if those would make a good case for required punctuation, or what indeed their role is.
It is further questionable if those should be treated as character, or if those would make a good case for required punctuation, or what indeed their role is.
This depends on the language. In Kildin Saami (single apostrophe) or Nenets (double apostrophe) they should be regarded letter characters, because they have sound equivalents.
I think one issue throughout the data is that different sources and data input has handled orthographies with an "apostrophe"-looking character.
The data might have:
'
(U+0027) Apostrophe’
(U+2019) Right single quotation markʼ
(U+02BC) Modifier letter apostrophe...and probably more (even combining marks).
While there might be proper canonical information for some orthographies, it seems to me that this is most likely arbitrary based on sources and data input and should probably be canonized or disambiguated in some way. E.g. we might want to unify how such characters are input in the data, and we might want to disambiguate this character so that several "possible alternatives" satisfy a language check.
It is further questionable if those should be treated as character, or if those would make a good case for required punctuation, or what indeed their role is.
Ping @MrBrezina
The text was updated successfully, but these errors were encountered: