Chapter 14. Supported and Unsupported Platforms

Table of Contents

Dictd Database Format
Using c5 style
Using xmltei2xmldict.pl
Bedic
Testing in an emulated environment
Evolutionary Dictionary
VOK file format

This chapter explains how conversion from TEI format into different other formats works - for supported platforms - or would work - for unsupported platforms.

[Note]Note

This chapter is incomplete. Actually, the whole HOWTO should be restructured to reflect information flow better.

Dictd Database Format

For this task two ways are viable. Both supports multiple headwords for a single entry, ie. multiple orth elements inside a single entry. This is useful to give alternate spellings. Another feature is generation of an inverse index. With this, the direction of translation of the dictionary can be reversed, ie. an English-Arabic dictionary would become queryable Arabic-English.

Using c5 style

This process uses an XSLT stylesheet and dictfmt -c5. The stylesheet needs to be applied to the TEI file. Any XSLT processor should do the work. Using Sablotron, the command to give would be

sabcmd xsl/tei2c5.xsl la1-la2.tei >la1-la2.c5

Using xsltproc from libxslt, you would say

xsltproc xsl/tei2c5.xsl la1-la2.tei >la1-la2.c5

Next, dictfmt has to be called to create the final dictd database files:

dictfmt -t --headword-separator %%% la1-la2 <la1-la2.c5

Using xmltei2xmldict.pl

Second, using xmltei2xmldict.pl. This is a conversion utility developed by the FreeDict project itself. It converts the TEI file into dictd database format without producing intermediate files. Its disadvantage is that it depends on the FreeDict project for the conversion to be successful, because the dictd database format is still subject to changes and cannot be as well understood as dictfmt does, which comes from the dictd project itself. For the software dependencies, please read the README.

An example command line would be:

xmltei2xmldict.pl -f la1-la2.tei -t xsl/tei2txt.xsl

The c5 style is faster but needs more memory, since it loads the entire TEI file before applying the XSLT stylesheet. The xmltei2xmldict.pl style instead uses the SAX API, converting the dictionary by applying an XSLT stylesheet entrywise. For example to convert the file deu-eng.tei, which contains about 81000 entries, using my 550 MHz / 128 MB machine took the following times:

Table 14.1. Timing comparison of c5 and xmltei2xmldict.pl conversion styles

xmltei2xmldict.plc5 style


real    38m8.771s
user    24m54.358s
sys     0m37.752s


real    27m14.123s
user    1m53.763s
sys     0m38.618s


It should be noted that my machine has not enough memory for c5 style, so it spent 25 of the 27 minutes in swapping.

Bedic

The bedic project maintains the library libbedic and the application zbedic, which are designed to run on the Zaurus PDA from Sharp.

The following features of the bedic format 0.9.6 are supported:

  • separate senses (subsenses in bedic terminology)

  • pronunciation

  • POS information

  • examples

  • cross references

  • usage domain indication

A precondition to see these features is that the information is present in the respective TEI dictionary. The presently only dictionary using all supported features is Khasi-English.

To convert dictionaries into the bedic format, you can give the command make release-bedic from the directory of the respective dictionary module. The resulting file should be found in $FREEDICTDIR/release/dic.

The conversion process follows these steps:

  1. The entries are ordered, so that those whose headwords are homographs come to stand next to each other. Then these homographs are integrated into a single entry. For these steps the two separate XSLT stylesheets sort.xsl and group-homographs-sorted.xsl are employed, because XSLT 1.0 has no notion of giving access to the current node list, eg. through an additional axis.

    Because of this grouping, the number shown by zbedic under "Items in dict" is usually less than the number of entries in the original TEI file.

  2. The stylesheet tei2dic.xsl does the main conversion step from TEI into bedic format.

    The "char-precedence" property of the bedic format is for most languages currently only supported by supplying a modified version of the Wikipedia-char-precedence that comes with libbedic, because it includes many accented characters. If the order of the headwords as shown by zbedic is incorrect for your language, a specific char-precedence could be added easily to the stylesheet.

    The cross references generated in bedic format are typeless, ie. the label in front of a cross reference just says "See also:" instead of "Synonyms:", "Antonyms:" etc. Such typing could be added easily after we have defined a typology to use in all dictionaries of FreeDict, ideally. In other places such typologies are used as well, eg. for part of speech tag content. We might have expected to get such typologies from the TEI itself, but since they do not offer them, developing typologies might be something FreeDict must add to its agenda.

  3. The resulting file is normalized according to Unicode Normalization Form NFC (Canonical Decomposition followed by Canonical Composition). For this Charlint - A Character Normalization Tool is used. See Makefile.common for details on how to install it so that it is found by the conversion process.

  4. The following Perl filter replaces certain escape sequences by NUL and ESC characters. This could not be done by the stylesheets, because XML does not allow those characters to be represented:

    perl -pi -e 's/\\0/\x00/gm; s/\\e/\e/gm;' <in >out
  5. The xerox tool that comes with libbedic and dictzip are run. Also, the version number is given to the resulting .dic.dz file.

  6. Optionally, if the command make release-zaurus was given, an ipk package is generated. This is not done per default, since each dictionary consists of only one file.

Testing in an emulated environment

This is an important step, especially for a developer who does not have a Zaurus device available. Qt/Embedded (qte), QTopia (qpe) and zbedic can be compiled on for the x86 architecture and can run either under the Linux Console Framebuffer or in the Qt/Embedded Virtual X11 Framebuffer (qvfb).

Free versions of qte, qpe and qvfb are available from Trolltech.

You have to compile qte, qpe, libbedic and zbedic in that order. The build instructions of qpe also talk of building Qt2 (for qvfb) and Qt3 (for other tools), though for testing zbedic no tools from Qt3 are required and as far as my SuSE 9.0 system was concerned, the qt3-extensions package already included a working qvfb.

When compiling qte, make sure you follow the build instructions given in qpe (not qte), which will make sure that the right customization header file qconfig-qpe.h is used for qte compilation. I will abstain from further qte/qpe building advice, since the instructions coming with qpe/qte are already complicated enough.

To compile zbedic for the x86 architecture, you have to issue a command like ARCH=x86 make in the libbedic and zbedic source directories. Most likely you will also have to edit the Makefiles to point to your qpe source directory.

Since the qpe package manager cannot run from the qpe image directory because it uses absolute paths, you cannot install zbedic from the .ipk file and have to install its files manually into the qpe image directory. Also you have to register the .dic.dz file ending and mime type using a text editor.

To start qpe or zbedic (and possibly qvfb) after a successful build, you have to set up lots of environment variables correctly, which is explained in the qpe documentation as well. You can give the -qws switch to zbedic to run it without qpe.

The most important info to give here - since this is not mentioned anywhere in the qpe documentation - is the mapping of directories when you run qpe from an image directory. Just a mention of them will make you understand, though. They are ~/Applications, ~/Documents and ~/Settings. The .dic.dz files have to be put into ~/Documents for zbedic to see them.

Combining marks are not shown properly on the Zaurus and in the default qpe build. Most likely this is a limitation of the Qt/Embedded font format .qpf. Even though Qt/Embedded can support freetype and through it TrueType fonts, it is not available on the Zaurus. To avoid combining marks as far as possible, the text is normalized to Unicode Normalization Form C, where precomposed characters are preferred over combining sequences. Still some combining characters remain, which in turn are not shown properly, viz. sequences involving the combining tilde in fra-eng pronunciations.

Evolutionary Dictionary

The FreeDict dictionaries are available for the shareware program Evolutionary Dictionary from http://www.evolutionary.net/dict-info.htm.

The dictionary files for this platform must be created with the Win32 GUI-only Application Dictionary.exe, available from http://www.evolutionary.net/dict-info1.htm. It currently does not run in the Windows Emulator wine.

According to its documentation Dictionary.exe can import:

  • VOK files, a format specifically made for Dictionary.exe (see next section)

  • Ergane format, used by the free PC translation program Ergane. They say to make sure you export from Ergane in type 2 format (either 2a or 2b). The program and free modules can be obtained from the web site - http://www.travlang.com/Ergane/. But the downloadable Ergane _cannot_ export anything!

  • "Mr Honey - English-German" (no format description available)

  • WordNet (again, no format description available here)

VOK file format

This is a plain text file format in Windows-1252 encoding (this last fact is not documented in the Online Help of Dictionary.exe!). It looks like this:

[words]
word/translated-word
word2a;word2b/translated-word2a;translated-word2b
    .
    .

[phrases]
>category-name
phrase/translated-phrase
    .
    .
>category-name
phrase/translated-phrase
    .
    .
[notes]
<any-text>

There should not be spaces around the '/' character. In the [words] or [phrases] section, you may have ';' characters on either side of the '/' to indicate multiple translations of a word. eg.

ability/Faehigkeit;Begabung

The dictionaries built with this program have certain limits:

Max Word Size - 128 characters
Max Words per entry - 16
Max Entry Size - 1024

Max Words - 1,048,576 - Not entries
Max Word Data - 16Mb characters - Uncompressed (Compressed approx twice that amount)

Max Phrase Categories - 64
Data per Category - 64Kb characters - Uncompressed (Compressed approx twice that amount)

Notes - 64Kb characters - Uncompressed

There is no limit to the number of entries, as such. There is a limit to the number of total words in all entries (i.e. between each semicolon) and the space used for all entries.

Supplemental Entries - 64Kb - Words added or edited on a handheld are stored in a supplemental record. When the dictionary is saved by this program the words are re-indexed and the supplemental record is cleared.