Table of Contents
This chapter explains how conversion from TEI format into different other formats works - for supported platforms - or would work - for unsupported platforms.
![]() | Note |
|---|---|
This chapter is incomplete. Actually, the whole HOWTO should be restructured to reflect information flow better. |
For this task two ways are viable. Both supports multiple headwords
for a single entry, ie. multiple orth elements
inside a single entry. This is useful to give alternate spellings.
Another feature is generation of an inverse index. With this, the
direction of translation of the dictionary can be reversed, ie. an
English-Arabic dictionary would become queryable Arabic-English.
This process uses an XSLT stylesheet and dictfmt -c5. The stylesheet needs to be applied to the TEI file. Any XSLT processor should do the work. Using Sablotron, the command to give would be
sabcmd xsl/tei2c5.xsl
la1-la2.tei
>la1-la2.c5
Using xsltproc from libxslt, you would say
xsltproc xsl/tei2c5.xsl
la1-la2.tei
>la1-la2.c5
Next, dictfmt has to be called to create the final dictd database files:
dictfmt -t --headword-separator %%%
la1-la2
<la1-la2.c5
Second, using xmltei2xmldict.pl. This is a
conversion utility developed by the FreeDict project itself. It
converts the TEI file into dictd database format without producing
intermediate files. Its disadvantage is that it depends on the
FreeDict project for the conversion to be successful, because the
dictd database format is still subject to changes and cannot be as
well understood as dictfmt does, which comes from
the dictd project itself. For the software dependencies, please read
the README.
An example command line would be:
xmltei2xmldict.pl -f
la1-la2.tei -t
xsl/tei2txt.xsl
The c5 style is faster but needs more memory, since it loads the
entire TEI file before applying the XSLT stylesheet. The
xmltei2xmldict.pl style instead uses the SAX API,
converting the dictionary by applying an XSLT stylesheet entrywise.
For example to convert the file deu-eng.tei,
which contains about 81000 entries, using my 550 MHz / 128 MB machine
took the following times:
Table 14.1. Timing comparison of c5 and xmltei2xmldict.pl conversion styles
| xmltei2xmldict.pl | c5 style |
|---|---|
|
|
It should be noted that my machine has not enough memory for c5 style, so it spent 25 of the 27 minutes in swapping.
The bedic project maintains the library libbedic and the application zbedic, which are designed to run on the Zaurus PDA from Sharp.
The following features of the bedic format 0.9.6 are supported:
separate senses (subsenses in bedic terminology)
pronunciation
POS information
examples
cross references
usage domain indication
A precondition to see these features is that the information is present in the respective TEI dictionary. The presently only dictionary using all supported features is Khasi-English.
To convert dictionaries into the bedic format, you can give the
command make release-bedic from the directory of the
respective dictionary module. The resulting file should be found in
$FREEDICTDIR/release/dic.
The conversion process follows these steps:
The entries are ordered, so that those whose headwords
are homographs come to stand next to each other. Then these
homographs are integrated into a single entry. For these steps the two
separate XSLT stylesheets sort.xsl and
group-homographs-sorted.xsl are employed, because XSLT
1.0 has no notion of giving access to the current node list, eg.
through an additional axis.
Because of this grouping, the number shown by zbedic under "Items in dict" is usually less than the number of entries in the original TEI file.
The stylesheet tei2dic.xsl does
the main conversion step from TEI into bedic format.
The "char-precedence" property of the bedic format is for most languages currently only supported by supplying a modified version of the Wikipedia-char-precedence that comes with libbedic, because it includes many accented characters. If the order of the headwords as shown by zbedic is incorrect for your language, a specific char-precedence could be added easily to the stylesheet.
The cross references generated in bedic format are typeless, ie. the label in front of a cross reference just says "See also:" instead of "Synonyms:", "Antonyms:" etc. Such typing could be added easily after we have defined a typology to use in all dictionaries of FreeDict, ideally. In other places such typologies are used as well, eg. for part of speech tag content. We might have expected to get such typologies from the TEI itself, but since they do not offer them, developing typologies might be something FreeDict must add to its agenda.
The resulting file is normalized according to Unicode
Normalization Form NFC (Canonical Decomposition followed by
Canonical Composition). For this Charlint - A
Character Normalization Tool is used. See
Makefile.common for details on how to
install it so that it is found by the conversion process.
The following Perl filter replaces certain escape sequences by NUL and ESC characters. This could not be done by the stylesheets, because XML does not allow those characters to be represented:
perl -pi -e 's/\\0/\x00/gm; s/\\e/\e/gm;' <in >out
The xerox tool that comes
with libbedic and dictzip are run. Also,
the version number is given to the resulting
.dic.dz file.
Optionally, if the command make release-zaurus was given, an ipk package is generated. This is not done per default, since each dictionary consists of only one file.
This is an important step, especially for a developer who does not have a Zaurus device available. Qt/Embedded (qte), QTopia (qpe) and zbedic can be compiled on for the x86 architecture and can run either under the Linux Console Framebuffer or in the Qt/Embedded Virtual X11 Framebuffer (qvfb).
Free versions of qte, qpe and qvfb are available from Trolltech.
You have to compile qte, qpe, libbedic and zbedic in that order. The build instructions of qpe also talk of building Qt2 (for qvfb) and Qt3 (for other tools), though for testing zbedic no tools from Qt3 are required and as far as my SuSE 9.0 system was concerned, the qt3-extensions package already included a working qvfb.
When compiling qte, make sure you follow the build instructions
given in qpe (not qte), which will make sure that the right
customization header file qconfig-qpe.h is used
for qte compilation. I will abstain from further qte/qpe building
advice, since the instructions coming with qpe/qte are already
complicated enough.
To compile zbedic for the x86 architecture, you have to issue a command like ARCH=x86 make in the libbedic and zbedic source directories. Most likely you will also have to edit the Makefiles to point to your qpe source directory.
Since the qpe package manager cannot run from the qpe image
directory because it uses absolute paths, you cannot install zbedic
from the .ipk file and have to install its
files manually into the qpe image directory. Also you have to
register the .dic.dz file ending and mime
type using a text editor.
To start qpe or zbedic (and possibly qvfb) after a successful
build, you have to set up lots of environment variables correctly,
which is explained in the qpe documentation as well. You can give the
-qws switch to zbedic to run it without qpe.
The most important info to give here - since this is not mentioned
anywhere in the qpe documentation - is the mapping of directories
when you run qpe from an image directory. Just a mention of them
will make you understand, though. They are
~/Applications,
~/Documents and
~/Settings. The .dic.dz
files have to be put into ~/Documents for
zbedic to see them.
Combining marks are not shown properly on the Zaurus and in the
default qpe build. Most likely this is a limitation of the
Qt/Embedded font format .qpf. Even though
Qt/Embedded can support freetype and through it TrueType fonts, it
is not available on the Zaurus. To avoid combining marks as far as
possible, the text is normalized to Unicode Normalization Form C,
where precomposed characters are preferred over combining
sequences. Still some combining characters remain, which in turn
are not shown properly, viz. sequences involving the combining
tilde in fra-eng pronunciations.
The FreeDict dictionaries are available for the shareware program Evolutionary Dictionary from http://www.evolutionary.net/dict-info.htm.
The dictionary files for this platform must be created with the Win32 GUI-only Application Dictionary.exe, available from http://www.evolutionary.net/dict-info1.htm. It currently does not run in the Windows Emulator wine.
According to its documentation Dictionary.exe can import:
VOK files, a format specifically made for Dictionary.exe (see next section)
Ergane format, used by the free PC translation program Ergane. They say to make sure you export from Ergane in type 2 format (either 2a or 2b). The program and free modules can be obtained from the web site - http://www.travlang.com/Ergane/. But the downloadable Ergane _cannot_ export anything!
"Mr Honey - English-German" (no format description available)
WordNet (again, no format description available here)
This is a plain text file format in Windows-1252 encoding (this last fact is not documented in the Online Help of Dictionary.exe!). It looks like this:
[words]
word/translated-word
word2a;word2b/translated-word2a;translated-word2b
.
.
[phrases]
>category-name
phrase/translated-phrase
.
.
>category-name
phrase/translated-phrase
.
.
[notes]
<any-text>There should not be spaces around the '/' character. In the [words] or [phrases] section, you may have ';' characters on either side of the '/' to indicate multiple translations of a word. eg.
ability/Faehigkeit;Begabung
The dictionaries built with this program have certain limits:
Max Word Size - 128 characters Max Words per entry - 16 Max Entry Size - 1024 Max Words - 1,048,576 - Not entries Max Word Data - 16Mb characters - Uncompressed (Compressed approx twice that amount) Max Phrase Categories - 64 Data per Category - 64Kb characters - Uncompressed (Compressed approx twice that amount) Notes - 64Kb characters - Uncompressed
There is no limit to the number of entries, as such. There is a limit to the number of total words in all entries (i.e. between each semicolon) and the space used for all entries.
Supplemental Entries - 64Kb - Words added or edited on a handheld are stored in a supplemental record. When the dictionary is saved by this program the words are re-indexed and the supplemental record is cleared.