Chapter 3. Unicode

From the Unicode HOWTO. People in different countries use different characters to represent the words of their native languages. Nowadays most applications, including email systems and web browsers, are 8-bit clean, i.e. they can operate on and display text correctly provided that it is represented in an 8-bit character set, like ISO-8859-1.” [Bruno Haible: Linux Unicode Howto]

What is Unicode? From: [http://www.unicode.org/standard/WhatIsUnicode.html]


    “Unicode provides a unique number for every character,
      no matter what the platform,
      no matter what the program,
      no matter what the language.

  

Essentially, Unicode encoding supplies the unique character number that is typed to the page each time you enter a letter on your keyboard. A font-set then renders the character onto the screen so that you may read it. It's important to understand that Unicode is not a font-set, it is a protocol for the mappings that font-sets use to render fonts on your screen or on to a printed page.

[Tip]Tip

All new dictionaries should be written using the Unicode character set. If you use a text editor that is unicode capable, all should be well.

UTF-8 is a way of wrapping up all real-world characters in a portable and efficient way. This includes most 8 bit and many 16 bit or 2 byte character sets. Your current character sets are probably included, so it may be as simple as putting <?xml version="1.0" encoding="UTF-8" standalone="no"?> as the XML declaration of your final document.

You should ensure that your dictionary does not rely on any particular font set and is equally functional when rendered as simple text. Remember, fonts are just "pretty renderings" of real characters. Most modern Text Editors (e.g. Xemacs, emacs, Vim, GEdit, Kedit, notepad) should be fine.

This is not the place for a full explanation of Unicode. Please see Markus Kuhn's excellent summary at http://www.cl.cam.ac.uk/~mgk25/unicode.htm. The Linux Unicode HOWTO is well worth visiting: ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html

UTF-8 Quick notes

[Tip]Tip

If you have automated some or all of your dictionary construction, please be careful to maintain character type compatibility throughout the process. C coders should use the “wide character type”. Most scripting languages now also support UTF-8 (Python, Perl, PHP, Java and Ruby at least). Shell scripts usually adopt the local environment settings. Please check your gawk and sed are mapping cleanly. You may need very recent versions.

More Information. If you are on a Linux (or similar) system, try man 7 unicode. You may also have some unicode tools on board: man 1 unicode. You may also need to set your LANG environment settings, most Linux type systems support doing this on a per instance basis, that is you may run a number of language and locales concurrently. Examples are given later.