Skip to content

Latest commit

 

History

History
217 lines (153 loc) · 6.96 KB

transliteration.md

File metadata and controls

217 lines (153 loc) · 6.96 KB

Transliteration

Transliterations are very powerful. ICU first used them to transliterate from one script to another, e.g. from Latin script to Arabic script. But it can do much more, e.g. decomposing compound characters, removing some diacritics, and re-composing characters.

icu:transliterate

icu:transliterate(input as xs:string, transliterator-ID as xs:string) as xs:string

Transliterates the input string with a ICU transliterator.

Arguments:

  • input is your input string to be transliterated
  • transliterator-ID is the ID of a transliterator or a sequence of IDs separated with semicolon, e.g. NFD; [:nonspacing mark:] Remove; NFC for removing diacritics

See ICU documentation.

Bound to Transliterator.getInstance(...).

icu:transliterator-from-rules

icu:transliterator-from-rules(ID as xs:string, rules as xs:string, direction as xs:string) as xs:boolean

Creates a new custom transliterator from the given rules and register it under the ID for subsequent calls via icu:transliterate.

Arguments:

  • ID is the identifier via which the transliterator can be called in subsequent calls.
  • rules are the custom rules the transliterator is created from.
  • direction is either the string 'FORWARD' or the string 'REVERSE'.

See ICU documentation.

Bound to Transliterator.createFromRules(...).

icu:transliterator-ids

icu:transliterator-ids() as xs:string*

Returns a sequence of IDs of the available transliterators.

Bound to Transliterator.getAvailableIds().

Examples

Removal of Diacritics

Here's an XSLT stylesheet that outputs plain text and removes diacritics from all characters in the text nodes:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:icu="https://unicode-org.github.io/icu/"
    version="2.0">

   <xsl:output method="text"/>

   <xsl:param name="transliterator" as="xs:string" select="'NFD; [:nonspacing mark:] Remove; NFC'"/>

   <xsl:template match="text()">
      <xsl:value-of select="icu:transliterate(., $transliterator)"/>
   </xsl:template>

</xsl:stylesheet>

Applied on the following XML source file it outputs: Parlez vous francais?, francais without the cedilla:

<t>Parlez vous français?</t>

The important part is the compound transliterator 'NFD; [:nonspacing mark:] Remove; NFC. It consists of 3 transliterators.

The first one, NFD normalizes the input by decomposing compound characters. The result is that c with the cedilla result in two unicode characters. The bare cedilla is a non-spacing character.

The second one, Remove is applied only on non-spacing characters. So the result of this step is the string with the cedilla removed.

The third one, NFC normalizes by (re-)composing characters and non-spacing characters again. – We could have applied the removal only onto a subset of non-spacing characters.

A common use case of such stripping diacritics include sorting in lexical order or matching strings against a list of strings where diacritics are not consistently used. In both cases stripping diacritics improves the result.

Such a matching case in an upcycling task was the reason why this library was first developed.

A Custom Transliterator

You can define your own transliterator, register it under an ID, and then use it in subsequent calls to transliterate strings. Here comes a somewhat constructed example, maybe usable for an author that wants to write a novel without e. He's not good as Georges Perec, because it sails around the e quite mechanically...

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:icu="https://unicode-org.github.io/icu/"
    version="2.0">

   <xsl:output method="text"/>

   <xsl:param name="transliterator" as="xs:string" select="'custom'"/>

   <xsl:param name="rules" as="xs:string" required="true"/>

   <xsl:param name="direction" as="xs:string" select="'forward'"/>

   <xsl:template match="/">
      <xsl:variable name="custom-transliterator" select="icu:transliterator-from-rules('custom', unparsed-text($rules), $direction)"/>
      <xsl:apply-templates/>
   </xsl:template>

   <xsl:template match="text()">
      <xsl:value-of select="icu:transliterate(., $transliterator)"/>
   </xsl:template>

</xsl:stylesheet>

He uses an external file named void.txt with rules for the custom transliterator:

e > a;

He translates his novel

<t>Cette été ...</t>

Now, running the following command

./xslt.sh -config:saxon-config.xml -xsl:doc/transliterator-from-external-rules.xsl -s:doc/novel.xml rules=void.txt transliterator="NFD;custom;NFC"

results in

Catta átá ...

What goes on here? Compound characters are first decomposed by NFD. Then e is replaced with a by the custom transliterator. Then non-spacing accents are (re-)composed again.

Note, that when defining a custom transliterator with icu:transliterator-from-rules(...), this function has to be called before calls to icu:transliterate(...) which makes use of it. Also be aware of the lazy evaluation of XSLT: A call of the function in a global <xsl:variable> won't get evaluated before the variable is needed for the output of the transformation; the transliterator is not stored in the variable, but as a side effect in Java transliterator registry. If $custom-transliterator would have been defined as a global variable, the call of icu:transliterate(...) would have been resulted in an exception like this:

Error at char 19 in expression in xsl:value-of/@select on line 23 column 69 of transliterator-from-external-rules.xsl:
   illegal transliterator ID for icu:transliterate(..., NFD;custom;NFC)

Listing Translators Available in ICU4J

transliterator-ids.html shows how to use icu:transliterator-ids() to get available transliterators. There are several hundred! The list can be build with the following command:

mvn package
./xslt.sh -config:saxon-config.xml -xsl:doc/transliterator-ids.html -s:pom.xml

Task

Write a transliterator for Arabic script to Latin script that gets vowels right.