This is simple converter for files written by LaTeX when inputenc
package
with utf8
option is used. All Unicode characters with character code
greater than 127 are encoded as \IeC{LICR code}
, this tool can translate
them to utf8
codes.
Lua library and two sample scripts are provided.
Conversion library.
process(string)
conversion function, all \IeC
codes are translated to utf8
.
load_enc(table with encodings)
conversion is based on font encodings, you must specify font encodings
used in the document. For each of these encodings, file <encname>enc.dfu
must exist. In these files, conversion tables are provided.
texlua iec2utf.lua "used fontenc" < filename > newfile
You must specify font encodings used in the document, default is T1 T2A T2B T2C T3 T5 LGR
, which covers European languages with Latin and Cyrrilic alphabets, and Vietnamese.
iec2utf "used fontenc" filename
Bash wrapper of iec2utf.lua, which use default T1 T2A T2B T2C T3 T5 LGR
encodings and rewrites origin with result.
For example, this sample:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[]{makeidx}
\makeindex
\begin{document}
Hello
\index{Příliš}
\index{žluťoučký}
\index{kůň}
\index{úpěl}
\index{ďábelské}
\index{ódy}
\printindex
\end{document}
produces this raw index file:
\indexentry{P\IeC {\v r}\IeC {\'\i }li\IeC {\v s}}{1}
\indexentry{\IeC {\v z}lu\IeC {\v t}ou\IeC {\v c}k\IeC {\'y}}{1}
\indexentry{k\IeC {\r u}\IeC {\v n}}{1}
\indexentry{\IeC {\'u}p\IeC {\v e}l}{1}
\indexentry{\IeC {\v d}\IeC {\'a}belsk\IeC {\'e}}{1}
\indexentry{\IeC {\'o}dy}{1}
When you try to process this index file with xindy (for correct utf8
support with xindy
, it is best to load language module with
-M lang/langname/utf8-lang
, in this case language is czech
):
texindy -M lang/czech/utf8-lang filename.idx
the result is incorrect:
\item \IeC {\'o}dy, 1
\item \IeC {\'u}p\IeC {\v e}l, 1
\item \IeC {\v d}\IeC {\'a}belsk\IeC {\'e}, 1
\item \IeC {\v z}lu\IeC {\v t}ou\IeC {\v c}k\IeC {\'y}, 1
\indexspace
\item k\IeC {\r u}\IeC {\v n}, 1
\indexspace
\item P\IeC {\v r}\IeC {\'\i }li\IeC {\v s}, 1
We must convert the .idx
file to utf8
encoding in order to be processed
correctly with xindy
:
texlua iec2utf.lua < filename.idx > new.idx
mv new.idx filename.idx
texindy -M lang/czech/utf8-lang
The result is now correct:
\lettergroup{D}
\item ďábelské, 1
\indexspace
\lettergroup{K}
\item kůň, 1
\indexspace
\lettergroup{O}
\item ódy, 1
\indexspace
\lettergroup{P}
\item Příliš, 1
\indexspace
\lettergroup{U}
\item úpěl, 1
\indexspace
\lettergroup{Ž}
\item žluťoučký, 1
To simplify process described above, script utftexindy.lua
is provided.
ieclib
with T1, T2A, T2B, T2C, T3, T5
and LGR
font encoding is loaded,
all command line options are passed to texindy
, with exception of the -L
option for language, which is transformed to -M lang/<langname>/utf8-lang
.
texlua utftexindy.lua -L czech filename.idx
and
utftexindy -L czech filename.idx
These are equivalent to
texlua iec2utf.lua < filename.idx | texindy -i -M lang/czech/utf8-lang -o filename.ind