XeLaTeX: Unicode font fallback for unsupported characters

cover
This blog post was published 10 years ago and may or may not have aged well. While reading please keep in mind that it may no longer be accurate or even relevant.

When typesetting with LaTeX, it is easy when you wring in a single language. But as soon as you want to typeset Unicode text in multiple languages, you’re quickly out of luck. LaTeX is just not made for Unicode, and you need a lot of helper packages, documentation reading, and complicated configuration in your document to get it all right.

All I wanted was to typeset the following example text (Unicode). It contains regular latin characters, chinese characters, modern greek and polytonic (ancient) greek.

Latin text. Chinese text: 紫薇北斗星  Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.

I thought that this was a simple task. I thought: let’s just use XeLaTeX, which has out-of-the-box Unicode support. In the end, it was a simple task, but only after struggling to solve a particular problem. To show you the problem, I ran the following straightforward code through XeLaTeX…

\\documentclass\[\]{book}

\\usepackage{fontspec}

\\begin{document}
Latin text. Chinese text: 紫薇北斗星 Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια
Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.
\\end{document}

… and the following PDF was produced:

XeLaTeX rendering Computer Modern font with unsupported unicode characters

It turns out that the missing unicode characters are not XeLaTeX’s fault. The problem is that the used font (XeLaTeX by default uses a slightly more encompassing Computer Modern font) has not all unicode characters implemented. To implement all unicode characters in a single font (about 1.1 million characters) is a monumental task, and there are only a small handful of fonts whose maintainers aim to have full support of all characters (one of them is GNU FreeFont, which is already part of the Debian distribution, and therefore available to XeLaTeX).

So, I thought, let’s just use a font which is dedicated to unicode. I selected in my document the pretty Junicode font:

\\setmainfont{Junicode}

The result was:

XeLaTex and Junicode font with chinese and greek characters

Now, Greek worked, but still no Chinese characters. It turned out that even fonts which are dedicated to Unicode do not yet have all possible characters implemented. Because it’s a lot of work to produce high-quality fonts with matching styles for millions of possible characters.

So, how do regular web browsers or office applications do it? They use a mechanism called font fallback. When a particular character is not implemented in the chosen main font, another font is silently used which does have this particular character implemented. XeLaTeX can do the same with a package called ucharclasses, and it gives you full control over the fallback font selection process. The ucharclasses documentation gives an example using the \fontspec  font selection. I decided to use the font IPAexMincho which supports Chinese characters. So I added to my document:

\\usepackage\[CJK\]{ucharclasses}
\\setTransitionsForCJK{\\fontspec{IPAexMincho}}{\\fontspec{Junicode}}

… but when running XeLaTeX with this addition, ucharclasses somehow entered an endless loop with high CPU usage for the TexLive 2014 distribution (part of Debian). It hung at the line:

(./ucharclass.aux) (/usr/share/texmf/tex/latex/tipa/t3cmr.fd)

Endless googling didn’t bring up any useful hints. Something must have changed in the internals, and the ucharclasses documentation needs updating. In any event, it took me 4 hours to find a fix. The solution was to use a font selection other than \fontspec{} — because it doesn’t seem to be compatible with ucharclasses any more. Instead, I used fontspec’s suggested \newfontfamily  mechanism. Here is the final working code:

\\documentclass\[\]{book}
\\usepackage{fontspec}
\\setmainfont{Junicode}
\\newfontfamily\\myregularfont{Junicode}
\\newfontfamily\\mychinesefont{IPAexMincho}

\\usepackage\[CJK\]{ucharclasses}
\\setTransitionsForCJK{\\mychinesefont}{\\myregularfont}

\\begin{document}
Latin text. Chinese text: 紫薇北斗星  Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.
\\end{document}

Here is the result: Mixed latin, chinese, and greek scripts with two different fonts: Junicode and IPAexMincho:

Pretty!

If you found a mistake in this blog post, or would like to suggest an improvement to this blog post, please me an e-mail to michael@franzl.name; as subject please use the prefix "Comment to blog post" and append the post title.
 
Copyright © 2023 Michael Franzl