<part id="tools">
  <title>Developer's toolkit</title>
  <partintro>
    <para>Here we outline the tools needed or useful for building, editing and
      maintaining translating dictionaries.</para>

    <para>Most scripts and stylesheets mentioned in the following chapters can
      be found in the <ulink
	url="http://cvs.sourceforge.net/viewcvs.py/freedict/tools/">tools
	module</ulink> in the CVS repository of the FreeDict project.</para>
  </partintro>

  &writing-texteditor;

  <chapter id="validation">
    <title>Validating TEI format files</title>
    <para>Validating TEI files requires an installed version of the TEI DTD and
      a validating XML parser. It is assumed you have
      <application>sp</application> or <application>opensp</application>
      installed. These make the commands <command>nsgmls</command> respective
      <command>onsgmls</command> available.</para>

    <para>Validation should be available as a target in the Makefile of each
      dictionary module. Try <command>make validation</command> for the
      dictionary module you want to validate (<command>cd</command> to it
      first).</para>

    <para>The Makefile target itself is simple. It sets some environment
      variables, so that nsgmls can find the DTDs, tells it to handle XML data
      and then calls it.</para>

    <para>This is what is done:
      <programlisting>
        test -e $(XMLSOC) || (echo "Please set path to xml.soc file!"; exit 1)
        export SP_ENCODING=XML
        export SP_CHARSET_FIXED=YES
        export SGML_CATALOG_FILES="$(XMLSOC):$(CATALOGS)"
        nsgmls -s -E 10 $(LANGUAGE).tei
      </programlisting>
    </para>
  </chapter>

  <chapter id="x2tei">
    <title>Converting into TEI format</title>
    <para>Unfortunately, many translation database/wordlist projects use their
      own format and so require a specialized converter.  Some adhere to
      standards, but mostly their own. A problem with standards is that you
      have too many to choose from. It is an aim of FreeDict to lead to some
      kind of standardisation of data with the TEI format.  Also we hope that
      upstream projects (ie. the projects where actual dictionary data comes
      from) make their data available in TEI XML format or even directly use
      TEI XML as their primary data format.</para>

    <para>
      <itemizedlist>
	<listitem>
	  <para><command>ding2tei.pl</command> - conversion of the <ulink
	      url="http://dict.tu-chemnitz.de/help.html#download">ding
	      database</ulink> (English/German) into TEI format</para>
	</listitem>
	<listitem>
	  <para><command>hd2tei.pl</command> - conversion of the "hd" format
	    (which <application>dictfmt</application> also understands) into
	    TEI format</para>
	</listitem>
	<listitem>
	  <para><command>tab2tei.pl</command> - conversion of tab delimited
	    plain text file into TEI format</para> </listitem>
	<listitem>
	  <para><command>dict2tei.py</command> -  conversion of an already
	    formatted dictd database into TEI format</para>
	</listitem>
      </itemizedlist>
    </para>

    <section id="ergane">
      <title>Technical notes on the Ergane databases</title>
      <para>Ergane 5.0, which was the source for the first FreeDict databases,
	had an export function.  Back then it was easy to use its output and
	convert it into TEI.  But by now <ulink
	  url="http://download.travlang.com/">Ergane 8.0</ulink> is out -
	without any export function.  Then and now Ergane used Microsoft Access
	databases.  Using knowledge about the structure of the Ergabe databases
	it is possible to implement a new export function.</para>

      <para>At the heart of Ergane's approach to dictionary encoding is an
	index of word meanings.  The meanings of every word are encoded as
	numbers.  The word meanings of different languages are mapped to the
	same index numbers.  It is possible to find the translations of the
	words of language <replaceable>la1</replaceable> in language
	<replaceable>la2</replaceable> by this SQL query:</para>

      <para><computeroutput>SELECT * FROM la1, la2 WHERE la1.EspKey = la2.EspKey</computeroutput></para>

      <para>Ergane explains it uses Esperanto as an intermediate language.  But that
	is not really necessary.  The Ergane database just contains some special tables,
	but they are not special to Esperanto.</para>

      <table id="ergane-woordenboek">
	<title>The 'Woordenboek' table structure</title>
	<tgroup cols="3">
	  <thead>
	    <row>
	      <entry>data type</entry>
	      <entry>column name</entry>
	      <entry>description</entry>
	    </row>
	  </thead>
	  <tbody>
	    <row>
	      <entry>longint</entry>
	      <entry>keyno</entry>
	      <entry>primary key</entry>
	    </row>
	    <row>
	      <entry>longint</entry>
	      <entry>EspKey</entry>
	      <entry>meaning number</entry>
	    </row>
	    <row>
	      <entry>Text(510)</entry>
	      <entry>XEntry</entry>
	      <entry>orthography (the headword), encoding?</entry>
	    </row>
	    <row>
	      <entry>byte</entry>
	      <entry>Type</entry>
	      <entry>Part of Speech. For codes see conversion script.</entry>
	    </row>
	    <row>
	      <entry>char(2)</entry>
	      <entry>GType</entry>
	      <entry>Genus type. code?</entry>
	    </row>
	    <row>
	      <entry>byte</entry>
	      <entry>FType</entry>
	      <entry>Flexion type. codes?</entry>
	    </row>
	    <row>
	      <entry>Memo</entry>
	      <entry>Omschr</entry>
	      <entry>transliteration?</entry>
	    </row>
	    <row>
	      <entry>Byte</entry>
	      <entry>Freq</entry>
	      <entry></entry>
	    </row>
	    <row>
	      <entry>longint</entry>
	      <entry>Volgorde</entry>
	      <entry>sorting key</entry>
	    </row>
	    <row>
	      <entry>longint</entry>
	      <entry>Opm</entry>
	      <entry></entry>
	    </row>
	    <row>
	      <entry>longint</entry>
	      <entry>Opm2</entry>
	      <entry></entry>
	    </row>
	    <row>
	      <entry>Text(240)</entry>
	      <entry>Sortkey</entry>
	      <entry></entry>
	    </row>
	    <row>
	      <entry>Text(170)</entry>
	      <entry>Uitspraak</entry>
	      <entry>pronunciation?</entry>
	    </row>
	  </tbody>
	</tgroup>
      </table>

      <para>Some Woordenboek tables of language databases have extra
	columns.</para>

    </section>
  </chapter>

  <chapter id="platforms">
    <title>Supported and Unsupported Platforms</title>

    <para>This chapter explains how conversion from TEI format into different
      other formats works - for supported platforms - or would work - for
      unsupported platforms.</para>

    <note>
      <para>This chapter is incomplete. Actually, the whole HOWTO should be
	restructured to reflect information flow better.</para>
    </note>

    <section id="tei2dict">
      <title>Dictd Database Format</title>
      <para>For this task two ways are viable. Both supports multiple headwords
	for a single entry, ie. multiple <sgmltag>orth</sgmltag> elements
	inside a single entry.  This is useful to give alternate spellings.
	Another feature is generation of an inverse index. With this, the
	direction of translation of the dictionary can be reversed, ie. an
	English-Arabic dictionary would become queryable Arabic-English.</para>

      <section id="tei2dict-c5">
	<title>Using c5 style</title>
	<para>This process uses an XSLT stylesheet and <command>dictfmt -c5</command>.
	  The stylesheet needs to be applied to the TEI file. Any XSLT processor
	  should do the work. Using
	  <ulink url="http://www.gingerall.com/charlie/ga/xml/p_sab.xml">Sablotron</ulink>,
	  the command to give would be</para>

	<para><userinput>sabcmd xsl/tei2c5.xsl
	    <replaceable>la1-la2</replaceable>.tei
	    ><replaceable>la1-la2</replaceable>.c5</userinput></para>

	<para>Using xsltproc from <ulink
	    url="http://xmlsoft.org/XSLT/">libxslt</ulink>, you would
	  say</para>

	<para><userinput>xsltproc xsl/tei2c5.xsl
	    <replaceable>la1-la2</replaceable>.tei
	    ><replaceable>la1-la2</replaceable>.c5</userinput></para>

	<para>Next, <command>dictfmt</command> has to be called to create the
	  final dictd database files:</para>

	<para><userinput>dictfmt -t --headword-separator %%%
	    <replaceable>la1-la2</replaceable>
	    &lt;<replaceable>la1-la2</replaceable>.c5</userinput></para>
      </section>

      <section id="tei2dict-perl">
	<title>Using <command>xmltei2xmldict.pl</command></title>
	<para>Second, using <command>xmltei2xmldict.pl</command>. This is a
	  conversion utility developed by the FreeDict project itself. It
	  converts the TEI file into dictd database format without producing
	  intermediate files.  Its disadvantage is that it depends on the
	  FreeDict project for the conversion to be successful, because the
	  dictd database format is still subject to changes and cannot be as
	  well understood as <command>dictfmt</command> does, which comes from
	  the dictd project itself. For the software dependencies, please read
	  the <ulink
	    url="http://cvs.sourceforge.net/viewcvs.py/freedict/tools/README?rev=1.4&amp;view=markup"><filename>README</filename></ulink>.</para>

	<para>An example command line would be:</para>

	<para><userinput>xmltei2xmldict.pl -f
	    <replaceable>la1-la2</replaceable>.tei -t
	    xsl/tei2txt.xsl</userinput></para>
      </section>

      <para>The c5 style is faster but needs more memory, since it loads the
	entire TEI file before applying the XSLT stylesheet.  The
	<command>xmltei2xmldict.pl</command> style instead uses the SAX API,
	converting the dictionary by applying an XSLT stylesheet entrywise.
	For example to convert the file <filename>deu-eng.tei</filename>,
	which contains about 81000 entries, using my 550 MHz / 128 MB machine
	took the following times:</para>

      <table frame="none" id="tei2dict-timing">
	<title>Timing comparison of c5 and <command>xmltei2xmldict.pl</command> conversion styles</title>
	<tgroup cols="2">
	  <thead>
	    <row>
	      <entry><command>xmltei2xmldict.pl</command></entry>
	      <entry>c5 style</entry>
	    </row>
	  </thead>
	  <tbody>
	    <row>
	      <entry><computeroutput><literallayout>
real    38m8.771s
user    24m54.358s
sys     0m37.752s</literallayout></computeroutput>
	      </entry>
	      <entry><computeroutput><literallayout>
real    27m14.123s
user    1m53.763s
sys     0m38.618s</literallayout></computeroutput>
	      </entry>
	    </row>
	  </tbody>
	</tgroup>
      </table>

      <para>It should be noted that my machine has not enough memory for c5 style,
	so it spent 25 of the 27 minutes in swapping.</para>
    </section>

      <section id="bedic">
	<title>Bedic</title>
	<para>The <ulink url="http://bedic.sf.net/">bedic project</ulink>
	  maintains the library <application>libbedic</application> and the
	  application <application>zbedic</application>, which are designed to
	  run on the Zaurus PDA from Sharp.</para>

	<para>The following features of the bedic format 0.9.6 are supported:
	  <itemizedlist>
	    <listitem><para>separate senses (subsenses in bedic terminology)</para></listitem>
	    <listitem><para>pronunciation</para></listitem>
	    <listitem><para>POS information</para></listitem>
	    <listitem><para>examples</para></listitem>
	    <listitem><para>cross references</para></listitem>
	    <listitem><para>usage domain indication</para></listitem>
	  </itemizedlist>
	</para>

	<para>A precondition to see these features is that the information is
	  present in the respective TEI dictionary. The presently only
	  dictionary using all supported features is Khasi-English.</para>

	<para>To convert dictionaries into the bedic format, you can give the
	  command <command>make release-bedic</command> from the directory of the
	  respective dictionary module. The resulting file should be found in
	  <filename>$FREEDICTDIR/release/dic</filename>.</para>

	<para>The conversion process follows these steps:</para>

	<orderedlist>
	  <listitem>
	    <para>The entries are ordered, so that those whose headwords
	      are homographs come to stand next to each other. Then these
	      homographs are integrated into a single entry. For these steps the two
	      separate XSLT stylesheets <filename>sort.xsl</filename> and
	      <filename>group-homographs-sorted.xsl</filename> are employed, because XSLT
	      1.0 has no notion of giving access to the current node list, eg.
	      through an additional axis.</para>
	    <para>Because of this grouping, the number shown by zbedic under "Items
	      in dict" is usually less than the number of entries in the original TEI
	      file.</para>
	  </listitem>
	  <listitem><para>The stylesheet <filename>tei2dic.xsl</filename> does
	      the main conversion step from TEI into bedic format.</para>
	    <para>The "char-precedence" property of the bedic format is
	      for most languages currently only supported by supplying a
	      modified version of the Wikipedia-char-precedence that comes with
	      libbedic, because it includes many accented characters. If the
	      order of the headwords as shown by zbedic is incorrect for your
	      language, a specific char-precedence could be added easily to
	      the stylesheet.</para>
	    <para>The cross references generated in bedic format are typeless,
	      ie. the label in front of a cross reference just says "See also:"
	      instead of "Synonyms:", "Antonyms:" etc. Such typing could be
	      added easily after we have defined a typology to use in all
	      dictionaries of FreeDict, ideally. In other places such
	      typologies are used as well, eg. for part of speech tag content.
	      We might have expected to get such typologies from the TEI
	      itself, but since they do not offer them, developing typologies
	      might be something FreeDict must add to its agenda.</para>
	  </listitem>
	  <listitem><para>The resulting file is normalized according to Unicode
	      Normalization Form NFC (Canonical Decomposition followed by
	      Canonical Composition). For this <ulink type="http"
		url="http://www.w3.org/International/charlint/">Charlint - A
		Character Normalization Tool</ulink> is used. See
	      <filename>Makefile.common</filename> for details on how to
	      install it so that it is found by the conversion process.</para>
	  </listitem>
	  <listitem><para>The following Perl filter replaces certain escape
	      sequences by NUL and ESC characters.  This could not be done by
	      the stylesheets, because XML does not allow those characters to
	      be represented:</para>
	    <programlisting>perl -pi -e 's/\\0/\x00/gm; s/\\e/\e/gm;' &lt;in >out</programlisting>
	  </listitem>
	  <listitem><para>The <application>xerox</application> tool that comes
	      with libbedic and <application>dictzip</application> are run. Also,
	      the version number is given to the resulting
	      <filename>.dic.dz</filename> file.</para> </listitem>
	  <listitem><para>Optionally, if the command <command>make
		release-zaurus</command> was given, an ipk package is
	      generated. This is not done per default, since each dictionary
	      consists of only one file.</para>
	  </listitem>
	</orderedlist>

	<section id="bedic-testing">
	  <title>Testing in an emulated environment</title>
	  <para>This is an important step, especially for a developer who does
	    not have a Zaurus device available. Qt/Embedded (qte), QTopia (qpe) and zbedic
	    can be compiled on for the x86 architecture and can run either
	    under the Linux Console Framebuffer or in the Qt/Embedded Virtual
	    X11 Framebuffer (qvfb).</para>
	  <para>Free versions of qte, qpe and qvfb are available from <ulink
	      type="http"
	      url="http://www.trolltech.com/products/qtopia/gpl.html">Trolltech</ulink>.</para>
	  <para>You have to compile qte, qpe, libbedic and zbedic in that order.
	    The build instructions of qpe also talk of building Qt2 (for qvfb)
	    and Qt3 (for other tools), though for testing zbedic no tools from
	    Qt3 are required and as far as my SuSE 9.0 system was concerned, the
	    qt3-extensions package already included a working qvfb.</para>
	  <para>When compiling qte, make sure you follow the build instructions
	    given in qpe (not qte), which will make sure that the right
	    customization header file <filename>qconfig-qpe.h</filename> is used
	    for qte compilation. I will abstain from further qte/qpe building
	    advice, since the instructions coming with qpe/qte are already
	    complicated enough.</para>
	  <para>To compile zbedic for the x86 architecture, you have to issue a
	    command like <command>ARCH=x86 make</command> in the libbedic and
	    zbedic source directories. Most likely you will also have to edit the
	    Makefiles to point to your qpe source directory.</para>
	  <para>Since the qpe package manager cannot run from the qpe image
	    directory because it uses absolute paths, you cannot install zbedic
	    from the <filename>.ipk</filename> file and have to install its
	    files manually into the qpe image directory. Also you have to
	    register the <filename>.dic.dz</filename> file ending and mime
	    type using a text editor.</para>
	  <para>To start qpe or zbedic (and possibly qvfb) after a successful
	    build, you have to set up lots of environment variables correctly,
	    which is explained in the qpe documentation as well. You can give the
	    <literal>-qws</literal> switch to zbedic to run it without qpe.</para>
	  <para>The most important info to give here - since this is not mentioned
	    anywhere in the qpe documentation - is the mapping of directories
	    when you run qpe from an image directory. Just a mention of them
	    will make you understand, though.  They are
	    <filename>~/Applications</filename>,
	    <filename>~/Documents</filename> and
	    <filename>~/Settings</filename>. The <filename>.dic.dz</filename>
	    files have to be put into <filename>~/Documents</filename> for
	    zbedic to see them.</para>
	  <para>Combining marks are not shown properly on the Zaurus and in the
	    default qpe build. Most likely this is a limitation of the
	    Qt/Embedded font format <filename>.qpf</filename>. Even though
	    Qt/Embedded can support freetype and through it TrueType fonts, it
	    is not available on the Zaurus. To avoid combining marks as far as
	    possible, the text is normalized to Unicode Normalization Form C,
	    where precomposed characters are preferred over combining
	    sequences. Still some combining characters remain, which in turn
	    are not shown properly, viz. sequences involving the combining
	    tilde in fra-eng pronunciations.</para>
	</section>
      </section>

      <section id="evolutionary">
	<title>Evolutionary Dictionary</title>

	<para>The FreeDict dictionaries are available for the shareware program
	  <application>Evolutionary Dictionary</application> from <ulink
	    url="http://www.evolutionary.net/dict-info.htm">http://www.evolutionary.net/dict-info.htm</ulink>.</para>

	<para>The dictionary files for this platform must be created with the
	  Win32 GUI-only Application <application>Dictionary.exe</application>,
	  available from <ulink url="http://www.evolutionary.net/dict-info1.htm"
	    >http://www.evolutionary.net/dict-info1.htm</ulink>. 
	  It currently does not run in the Windows Emulator <application>wine</application>.</para>

	<para>According to its documentation <application>Dictionary.exe</application>
	  can import:</para>

	<itemizedlist>
	  <listitem>
	    <para>VOK files, a format specifically made for
	      <application>Dictionary.exe</application> (see next section)</para>
	  </listitem>
	  <listitem>
	    <para>Ergane format, used by the free PC translation program
	      <application>Ergane</application>.
	      They say to make sure you export from Ergane in type 2 format (either 2a
	      or 2b).  The program and free modules can be obtained from the web site -
	      <ulink
		url="http://www.travlang.com/Ergane/">http://www.travlang.com/Ergane/</ulink>.
	      But the downloadable Ergane _cannot_ export anything!</para>
	  </listitem>
	  <listitem><para>"Mr Honey - English-German" (no format description available)</para>
	  </listitem>
	  <listitem><para><ulink url="http://www.cogsci.princeton.edu/~wn/">WordNet</ulink>
	    (again, no format description available here)</para>
	  </listitem>
	</itemizedlist>

	<section id="evolutionary-vok">
	  <title>VOK file format</title>

	  <para>This is a plain text file format in Windows-1252 encoding (this last
	    fact is not documented in the Online Help of
	    <application>Dictionary.exe</application>!). It looks like this:</para>

	  <programlisting>
[words]
word/translated-word
word2a;word2b/translated-word2a;translated-word2b
    .
    .

[phrases]
>category-name
phrase/translated-phrase
    .
    .
>category-name
phrase/translated-phrase
    .
    .
[notes]
&lt;any-text&gt;</programlisting>

	  <para>There should not be spaces around the '/' character.  In the
	    [words] or [phrases] section, you may have ';' characters on either
	    side of the '/' to indicate multiple translations of a word.
	    eg.</para>
	  <programlisting>ability/Faehigkeit;Begabung</programlisting>

	  <para>The dictionaries built with this program have certain limits:</para>

	  <programlisting>Max Word Size - 128 characters
Max Words per entry - 16
Max Entry Size - 1024

Max Words - 1,048,576 - Not entries
Max Word Data - 16Mb characters - Uncompressed (Compressed approx twice that amount)

Max Phrase Categories - 64
Data per Category - 64Kb characters - Uncompressed (Compressed approx twice that amount)

Notes - 64Kb characters - Uncompressed</programlisting>

	  <para>There is no limit to the number of entries, as such. There is a
	    limit to the number of total words in all entries (i.e. between
	    each semicolon) and the space used for all entries.</para>

	  <para>Supplemental Entries - 64Kb - Words added or edited on a
	    handheld are stored in a supplemental record. When the dictionary
	    is saved by this program the words are re-indexed and the
	    supplemental record is cleared.</para>

	</section>
      </section>
    </chapter>

    <chapter id="stylesheets">
      <title>Stylesheets</title>
      <para>In FreeDict stylesheets written in the XSLT language are employed
	to describe the transformation of entries into other formats.  XSLT was
	devised for conversion of XML data and as such is very well adapted to
	it.  Using stylesheets we hope to make the conversion process open and
	adjustable by users to their needs, because now "only" the stylesheets
	need to be adjusted in order to adjust the arrangement of information
	inside entries, no more knowledge of converter programming should be
	required.</para>

      <para>In practice though, it turns out that XSLT itself is no more
	easier than a programming language. In fact, because of its different
	approach, it is hard to get into. But we believe the decision to use it
	was a good one, because now the entry layout is independent of the
	converter tool. In previous versions of the converter the Perl code
	became unmaintainable. It was impossible to oversee what a code change
	would do to the layout.</para>

      <para>The reference for XSLT can be found near the place where
	<ulink url="http://www.w3.org/Style/XSL/">XSL</ulink> comes from.
        For learning though, some tutorial might be more useful :)</para>

      <para>TODO include commented tei2txt.xsl</para>
    </chapter>

</part>

<!-- Keep this comment at the end of the file. It is for emacs.
 Local Variables:
 sgml-parent-document: ("freedicthowto.xml" "book" "part")
 End:
 -->

