<?xml version="1.0" encoding="utf-8"?>
<!-- this file is included by freedicthowto.xml -->

<chapter id="writing">
  <title>Writing a FreeDict Dictionary</title>
  <abstract>
    <para>This chapter deals with the process of building a FreeDict format
      file from your gathered sources. We cover TEI DTD installation, SGML and
      XML catalog configuration, some introductory level XML, final formats and
      a couple of shortcuts.</para>
  </abstract>

  <section>
    <title>Introduction</title>
    <para>What we don't deal with here is the actual process of collating a
      translating dictionary. That task is potentially endless and will be very
      particular to your own circumstances. You need to develop your own
      approaches and processes for gathering source materials and checking the
      quality of your entries. Here are some of the "process" things you might
      look out for and check with small sample sets before you get too far
      along.</para>
    <orderedlist numeration="upperalpha">
      <listitem>
	<para>Your editor or word processor can output UTF-8 format TEXT - not
	  word processor or browser specific markup, nor anything other than
	  simple text that can handle the characters of the languages you are
	  writing for. Using different fonts, while helpful in a word
	  processor, generally won't work in plain text (or UTF-8) format. In
	  your final output version it almost certainly won't.
	  <indexterm><primary>word processor</primary></indexterm>
	  <indexterm><primary>fonts</primary></indexterm>
	</para>
      </listitem>
      <listitem>
	<para>If you are importing from a spreadsheet application, try
	  exporting the pages as simple Comma Separated Value format. You can
	  often use almost any character or set of characters as a "comma". You
	  may be able to convert it to Dictd format with a simple script (in
	  which case we have a shortcut for you), <xref linkend="dictd"/>.
	  <indexterm><primary>spreadsheet</primary></indexterm>
	</para>
      </listitem>
      <listitem>
	<para>If you are starting from scratch and writing your dictionary
	  mostly by hand, please consider using a template, and an XML editor
	  like (X)emacs. These make the process much less error prone and
	  tedious. See the <link linkend="tools">tools section</link>
	  for more information.
	  <indexterm>
	    <primary>XML</primary>
	    <secondary>Editor</secondary>
	    <tertiary>Xemacs and emacs</tertiary>
	  </indexterm>
	  <indexterm>
	    <primary>writing</primary>
	    <secondary>by hand</secondary>
	  </indexterm>
	  <indexterm>
	    <primary>Starting from Scratch</primary>
	  </indexterm>
	  <indexterm>
	    <primary>CSV</primary>
	    <secondary>Comma Separated Value</secondary>
	  </indexterm>
        </para>
      </listitem>
    </orderedlist>
  </section>

  <section>
    <title>The FreeDict Entry Format</title>
    <abstract><para>Though we claim to adhere to TEI P4 XML, Chapter 12 "Print
	Dictionaries", additional rules and restrictions apply.</para>
    </abstract>

    <para>At first sight the TEI guidelines are very complex.  At second sight
      they are still, but it is important to notice that they were written
      under the primary assumption to encode as much existing text as possible
      by tagging it up to a reasonable level of details. The wide variety of
      exisiting text makes the TEI tagset very permissible, allowing almost any
      tags to be used inside any other.</para>

    <para>This permissibility makes it difficult to process "pure TEI" with
      software to reformat TEI into other formats such as TeX, Formatting
      Objects or text.</para>

    <para>Besides being too permissible, the TEI Guidelines are incomplete
      for our needs, because they do not define any typologies. Typologies
      are needed for encoding different things in our dictionaries:</para>

    <itemizedlist>
      <listitem><para>the Part of Speech of headwords, ie. the contents of
	  <sgmltag>pos</sgmltag> elements. Should verbs be marked as 'v',
	  'verb' or 'Verb'?</para></listitem>
      <listitem><para>the Usage Domain of entry meanings - technology,
	  botanics etc.</para></listitem>
      <listitem><para>the type of Cross References - whether the reference
	  points to a synonym, an alternative spelling, a derived word
	  etc.</para></listitem>
    </itemizedlist>

    <para>Of course, these typologies should be used for many dictionaries,
      allowing us to keep the processing software simple. If required, they
      can be localized before being presented to a dictionary user.</para>

    <para>For these reasons, it is part of FreeDict's agenda to develop
      language neutral typologies for above mentioned things.</para>

    <table frame="all" id="typology-pos">
      <title>Part of Speech Typology (recommended contents of the
	<sgmltag>pos</sgmltag> element)</title>
      <tgroup cols="2">
	<thead>
	  <row>
	    <entry>Element Content</entry>
	    <entry>Meaning</entry>
	  </row>
	</thead>
	<tbody>
	  <row><entry>n</entry><entry>noun</entry></row>
	  <row><entry>v</entry><entry>verb (transitivity unknown)</entry></row>
	  <row><entry>vt</entry><entry>transitive verb</entry></row>
	  <row><entry>vi</entry><entry>intransitive verb</entry></row>
	  <row><entry>vti</entry><entry>transitive and intransitive verb</entry></row>
	  <row><entry>adv</entry><entry>adverb</entry></row>
	  <row><entry>adj</entry><entry>adjective</entry></row>
	  <row><entry>conj</entry><entry>conjunction</entry></row>
	  <row><entry>prep</entry><entry>preposition</entry></row>
	  <row><entry>int</entry><entry>interjection</entry></row>
	  <row><entry>pron</entry><entry>pronoun</entry></row>
	  <row><entry>art</entry><entry>article</entry></row>
	  <row><entry>num</entry><entry>numeral</entry></row>
	  <row><entry>int</entry><entry>interjection</entry></row>
	</tbody>
      </tgroup>
    </table>

    <para>It has been suggested to extend the TEI DTD with additional
      attributes to entries such as:</para>

    <programlisting>
      dictionary - which dictionary is the word in (eg.
                   eng-deu - so entries can be distributed on their own)
      author     - who edited the word last - it's nice to know who did the work
      version    - which version of the word
      date       - the time the word was last edited
      quality    - how good do we think that the translation is
                   this would give a hint about what words should be worked on next
      frequency  - how frequent is the word in the language (should also be present in sense?)
    </programlisting>

    <para>XXX compare with other terminological DTDs, link to this mail in archive</para>

    <para>Since TEI XML does not currently limit us, its extension is not
      actively pursued.</para>

    <section>
      <title>Best Practices</title>
      <itemizedlist>
	<listitem><para>Avoid to use more than one <sgmltag>orth</sgmltag>
	    element per entry.  Instead create separate entries and link them
	    to each other.</para></listitem>
	<listitem><para>Put question marks into <sgmltag>note</sgmltag>
	    elements of to be reviewed entries. Using this convention, other
	    editers will be able to find those entries
	    easily.</para></listitem>
      </itemizedlist>
    </section>
  </section>

  <section>
    <title>Two Approaches</title>
    <para>There are at least two approaches you might take to building a
      FreeDict format dictionary.  <link linkend="write-tei">Approach
	One</link> is to use the Text Encoding Initiative DTD from the
      beginning. This gives you the most flexibility.</para>
    <para><link linkend="dictd">Approach Two</link> involves producing a
      simply (and accurately) formatted plain file that you then process with
      some command line tools (and will probably have to touch up). This can
      be quicker if you are comfortable with it, but limits your options
      for lexicographic information.</para>
    <para>You may of course combine these or find any number of others, after
      all, it's your dictionary we just need it in a certain format :)
      <indexterm>
	<primary>lexicographic</primary>
	  <secondary>flexibility</secondary>
	</indexterm>
        <indexterm>
	  <primary>Text Encoding Initiative</primary>
	  <secondary>DTD</secondary>
	</indexterm>
	<indexterm>
	  <primary>File Formats</primary>
	</indexterm>
      </para>
    </section>

</chapter>

<chapter id="write-tei">
  <title>Writing Text Encoding Initiative XML files</title>
  <abstract>
    <formalpara>
      <title>The Text Encoding Initiative</title>
      <para>
	<quote>The TEI (Text Encoding Initiative) is an international research
	  effort established in 1987, intended to produce a community-based
	  standard for encoding and interchange of texts.</quote>
        <citation><ulink url="http://www.tei-c.org/Consortium/TEIcharter.html">http://www.tei-c.org/Consortium/TEIcharter.html</ulink></citation>
      </para>
    </formalpara>
    <para>This section gives examples and explanations of using the TEI format
      to write and deliver your dictionary. While it is not assumed you have
      the <link linkend="installTeiDTD">DTDs installed</link>, it may be best
      if you did.
      <indexterm>
        <primary>TEI</primary>
	<secondary>Text Encoding Initiative</secondary>
      </indexterm>
      <indexterm>
	<primary>TEI Consortium</primary>
      </indexterm>
    </para>
  </abstract>

  <section id="overview-tei">
    <title>Overview of the TEI Organisation</title>
    <formalpara>
      <title>In Brief</title>
      <para>
	<quote>In December of 2000, a new consortium was established to sustain
	  and develop the Text Encoding Initiative (TEI). Initially launched in
	  1987, the TEI is an international and interdisciplinary standard that
	  helps libraries, museums, publishers, and individual scholars
	  represent all kinds of literary and linguistic texts for online
	  research and teaching, using an encoding scheme that is maximally
	  expressive and minimally obsolescent.</quote>
	<citation><ulink url="http://www.tei-c.org/Consortium/index.html">http://www.tei-c.org/Consortium/index.html</ulink></citation>
	<indexterm>
	  <primary>interdisciplinary standards</primary>
	</indexterm>
      </para>
    </formalpara>

    <para>For a full explanation of the history, objectives and approaches of
      the Text Encoding Initiative it would be best if you visited them at
      <ulink url="&TEIUrl;">&TEIUrl;</ulink>. There you will find links and
      information to all sorts of TEI related stuff. Of note are the <ulink
	url="&TEIguide;">Guidelines</ulink>.</para> <para>The TEI consortium
      provides an approach to transcribing and creating documents in SGML (that
      was the old version of the TEI Guidelines, up to P3) and XML (from P4
      on), is as obsolescence proof as possible and allows free exchange of
      content and ideas. We will only be using a very small subset of this
      system's capacity.
      <indexterm>
	<primary>obsolescence proof</primary>
      </indexterm>
    </para>
  </section>

  <section id="UsingTEI">
    <title>The TEI DTDs</title>
    <subtitle>Using The TEI Dictionary DTD</subtitle>
    <para>In these sections we are going to step through a basic introduction
      to building a marked up dictionary based on the Text Encoding Initiative
      TEI2 P4 Dictionary Document Type Definition. First we deal with some
      technical information. If you are familiar with XML and DTDs, you might
      want to skip the next sections.
    </para>
    <para>The TEI system of markup can be approached in a number of ways.  Here
      we treat it pretty much like any other XML based markup system.  The
      consortium itself has not registered a formal XML name space (as of Feb
      2004), but does have unofficial Public and System identifiers we may use.
      If you don't know what all that means, don't worry.  Cut and paste
      examples are given below. While, in the following "Walking with" section
      there are some basic explanations of what this all means.  For a deeper
      understanding a visit to the W3C XML pages may help.  OASIS and
      associated XML portals usually have technical and introductory matter as
      well.</para>
    <para><ulink url="&w3c;">&w3c;</ulink></para>
    <para><ulink url="&oasiscover;">&oasiscover;</ulink></para>
    <para><ulink url="http://www.xml.com/">http://www.xml.com</ulink>
      <indexterm>
        <primary>OASIS</primary>
        <secondary>W3C</secondary>
      </indexterm>
    </para>
  </section>

  <section id="xmldecTEI">
    <title>The XML Declaration</title>
    <para><literal>&lt;?xml version='1.0' encoding="UTF-8" ?&gt;</literal></para>
    <para>The xml declaration is required on all valid xml documents.  For our
      purposes the above is all we need.  Be aware though, that this
      declaration can contain much more than we use here.  The
      <literal>encoding="UTF-8"</literal> <emphasis>attribute and value
	pair</emphasis> is sometimes considered optional as all XML documents
      assume or default to this unless told otherwise.  I believe it is better
      to add it now as future versions of XML will probably require it.</para>
    <para>It is the next section that defines our Text Encoding Initiative
      format in particular.  So far we have just told a validating parser that
      this document should be valid XML version 1.0.
      <indexterm><primary>XML Declaration</primary></indexterm>
    </para>
  </section>

    <section id="doctypeTEI">
      <title>The DOCTYPE Declaration</title>
    <para><programlisting>&lt;!DOCTYPE TEI.2 PUBLIC "&TEIsys;"
"http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [
 &lt;!ENTITY % TEI.XML          "INCLUDE" &gt;
 &lt;!ENTITY % TEI.dictionaries "INCLUDE" &gt;
      ]&gt;</programlisting></para>
    <para>The above is called the DOCTYPE declaration. The UPPER case words are
      all keywords that are used to set variables and types in parsing and
      validating tools.  The rest of the definition is very carefully
      structured to provide absolute definitions for a number of things.  We
      step through them here, as they are understood by the protocols involved.
      The entire set between the opening <literal>&lt;!</literal> and the
      closing <literal>&gt;</literal> is called the document prolog or internal
      DTD subset.  Functionally it defines or even redefines a
      specific named document type.
      <indexterm>
	<primary>DOCTYPE</primary>
	<secondary>SGML</secondary>
      </indexterm>
      <indexterm><primary>TEI.XML</primary></indexterm>
      <indexterm>
	<primary>TEI.dictionaries</primary>
      </indexterm>
      <indexterm>
	<primary>document prolog</primary>
	<secondary>INCLUDES</secondary>
      </indexterm>
    </para>
    <section id="TEIdoctype">
      <title>&lt;!DOCTYPE TEI.2</title>
      <para>The combined effect of this declaration opening string is to tell
	a program (or person) who knows, that this document is claiming to be
	compliant to a <emphasis>defined type of document</emphasis>.  In our
	case TEI.2.</para>
      <para>Other document types (e.g. <application>DocBook</application>) use
	this parameter to set the "level" the document sits at, possibly
	within a larger definition.  Text Encoding Initiative documents use
	a different system to achieve much the same end (see the following
	description of ENTITIES).
	<indexterm><primary>DocBook</primary></indexterm>
      </para>

      <para>Practically, another way of thinking about this is to consider
	that the opening ELEMENT of the following document has to be of the
	type TEI.2.  What that element is allowed to have as attributes or
	children elements is defined in the Document Type Definition for
	this DOCTYPE.
	<indexterm><primary>ELEMENT</primary></indexterm>
      </para>
    </section>
    <section id="TEIFPIid">
      <title>PUBLIC "&TEIsys;"</title>
      <para>This "phrase" is called the Formal Public Identifier. It starts
	with the word PUBLIC and in this case ends with a language identifier.
	The quoting is important as is the string <literal>-//</literal> at
	the start of the actual definition.</para>
      <para>This means that this is an informal definition, or an unregistered
	definition. A registered definition would start start with
	<literal>+//</literal>. Most Formal Public Identifiers use this type of
	definition, there is nothing unusual or pejorative about it.
	<indexterm><primary>PUBLIC "&TEIsys;"</primary></indexterm>
	<indexterm>
	  <primary>Formal Public Identifier</primary>
	  <secondary>FPI</secondary>
	</indexterm>
      </para>
      <para>The FPI must be exactly as shown or it means <quote>something
	  else</quote>, and probably not what we need it to. It is also one of
	the ways your <link linkend="catalog-setup">local system</link> knows
	which DTD to compare your document to, and so must be correct.</para>
      <note>
	<para>Text Encoding Initiative documents <emphasis>also use other
	    FPIs</emphasis>!  If you visit their website you will see many
	  other examples. The TEI2 zip file currently has HTML documentation
	  explaining all those options, as that is likely to be up to date
	  please refer to it rather than this for usages <emphasis>other than
	    FreeDict</emphasis>.  Please see <xref linkend="findingTEI"/> if
	  you haven't already :)</para>
      </note>
    </section>
    <section id="TEIsystemid">
      <title>"http://www.tei-c.org/Guidelines/DTD/tei2.dtd"</title>
      <para>This next string must be a URI (Universal Resource Identifier, RFC
	1630).  In this case it is in URL (Uniform Resource Locator, RFC 1738)
	format.  Here it functions as a portable System Identifier to the DTD.
	By convention it is often started on a new line, but that is not a
	requirement.</para>
      <para>This is just one way of doing this, and you may simply set the
	absolute path to the DTD on your system here and it should be OK. If
	you choose to install the DTDs only for your own use, you must set the
	system identifier to the absolute path to your copy of the DTD.
	<indexterm>
	  <primary>Uniform Resource Locator</primary><secondary>URL</secondary>
	</indexterm>
	<indexterm>
	  <primary>Universal Resource Identifier</primary><secondary>URI</secondary>
	</indexterm>
      </para>

      <example id="ownDTD"><title>System Identifiers</title>
	<programlisting>&lt;!DOCTYPE TEI.2 PUBLIC "&TEIsys;"
"/home/your-account/TEI/TEI2.dtd" [
 &lt;!ENTITY % TEI.XML          "INCLUDE" &gt;
 &lt;!ENTITY % TEI.dictionaries "INCLUDE" &gt;
	]&gt;</programlisting>
	<indexterm><primary>System Identifier</primary></indexterm>
      </example>
      <para>Be aware that this method in not portable. It may save you
	a whole lot of bother if you don't plan to use the TEI2 DTD for
	much else other than FreeDict.
	<indexterm><primary>Portability</primary></indexterm>
      </para>
    </section>
    <section id="sgml-intro-include-dictionaries-section">
      <title>&lt;!ENTITY % TEI.dictionaries "INCLUDE" &gt; ]&gt;</title>
      <para>The final section of the DOCTYPE declaration is also enclosed
	in square brackets.  We use it to include some ENTITIES. In this case
	these entities are contained within files that are a part of the
	TEI2 P4 DTD set.</para>
      <para>The first inclusion <literal>&lt;!ENTITY % TEI.XML          "INCLUDE" &gt;</literal>
	does some very clever "magic" remapping the DTD to XML compliant form.
	Without it the document will not be XML.</para>
      <para>The second included ENTITY
	<literal>&lt;!ENTITY % TEI.dictionaries "INCLUDE" &gt;</literal> adds
	the dictionary specific elements to the base DTD, and thereby makes
	this a dictionary as opposed to say a stage script or process design
	document.</para>
      <para>You may define your own extensions to the TEI set within these
	square brackets as well.  A careful study of the DTDs and associated
	files would be of great value if you are interested in gaining a
	greater understanding of SGML and XML.  Functionally this space is
	where a DTD is loaded for any document type.  You may also include
	other files in this place which can be useful for large document sets.
	<indexterm><primary>ENTITIES</primary></indexterm>
      </para>
    </section>
  </section>
  <section id="headerTEI">
    <title>The TEI Header</title>
    <abstract>
      <para>In this section we step through the structure and intent
	behind the TEI header section.</para>
    </abstract>
    <para>Headers usually give meta information, ie. information about
      something.  The TEI header gives general information about the following
      text like title, authors, publisher and revision history. It is defined
      in <ulink url="http://www.tei-c.org/P4X/HD.html">The TEI Guidelines,
	Chapter 5</ulink>.
    </para>

    <para>The TEI header is used to created the
      <literal>00-database-info</literal> entry in the dictd database files, as
      well as to convert it to meta information for other supported dictionary
      formats.  The contents of the title element go into the
      <literal>00-database-short</literal> entry; the content of the sourceDesc
      element goes into the
      <literal>00-database-url</literal> entry.</para>

    <variablelist><title>supported elements in the TEI header (contents of
	<sgmltag class="element">teiHeader</sgmltag>)</title>
      <para>In the following an XPath expression is given for elements that have
       	a different meaning depending on their parents.</para>
      <varlistentry>
	<term><literal>titleStmt/title</literal></term>
	<listitem>
	  <para>This becomes <literal>00-database-short</literal> when
	    converted to dictd database format and the <literal>id</literal>
	    property when converted to BEDic (see <xref
	      linkend="bedic"/>).</para>
	</listitem>
      </varlistentry>
      <varlistentry>
	<term><literal>titleStmt/respStmt/resp</literal></term>
	<listitem>
	  <para>A responsibility to this database.  Values with special
	    treatment are <literal>Author</literal> and
	    <literal>Maintainer</literal>.  Both should be shown on the
	    FreeDict website.  BEDic has a property for the maintainer, but not
	    the author.</para>
	</listitem>
      </varlistentry>
       <varlistentry>
	 <term><literal>titleStmt/respStmt/name</literal></term>
	<listitem>
	  <para>Name and email address of the person carrying the
	    responsibility named in <literal>../resp</literal>, ie. the
	    contents should follow the form <literal>FirstName LastName
	      &lt;user@host></literal>.</para>
	</listitem>
      </varlistentry>
       <varlistentry>
	<term><literal>edition </literal></term>
	<listitem>
	  <para>This becomes the release version number. It is shown on
	    the website and used in building filenames for releases.
	    It is recommended to use only numbers and dots. Also,
	    two levels of versioning should be enough, ie.
	    <literal>0.1</literal> is a good start.</para>
	</listitem>
      </varlistentry>
       <varlistentry>
	<term><literal>extent</literal></term>
	<listitem>
	  <para>Should contain the approximate number of headwords, including the
	    unit "headwords". It is put into <literal>00-database-info</literal>.
	    The headword count for the website is extracted from the .index file
	    of the dictd database format.</para>
	</listitem>
      </varlistentry>
      <varlistentry>
	<term><literal>notesStmt</literal></term>
	<listitem>
	  <para>The notes from in here are put into <literal>00-database-info</literal>
	    and should be available on the website as well. But presently
	    a "more info on this dictionary" page doesn't exist.</para>
	</listitem>
      </varlistentry>
      <varlistentry>
	<term><literal>sourceDesc/xptr </literal></term>
	<listitem>
	  <para>This becomes the source url on the website and is put
	    into <literal>00-database-url</literal> when converting to
	    dictd database format.</para>
	</listitem>
      </varlistentry>
      <varlistentry>
	<term><literal>revisionDesc</literal></term>
	<listitem>
	  <para>Description of the changes this database went through,
	    bugfixes are noted here as well. It would be nice
	    to generate the <filename>Changelog</filename> file of the
	    dictionary module out of this.</para>
	</listitem>
      </varlistentry>
      <varlistentry>
	<term><literal>revisionDesc/change/date</literal></term>
	<listitem>
	  <para>These dates should follow the format
	    <replaceable>YYYY-MM-DD</replaceable>.</para>
	</listitem>
      </varlistentry>
    </variablelist>

    <para>In the following you find a header taken from the
      <filename>kha-deu.tei</filename> file. You can copy
      it and replace element contents as they fit your database.</para>

    <programlisting>&kha-deu.tei.header;</programlisting>

    <para>That header is transformed to this <literal>00-database-info</literal> text:</para>

    <screen>
Khasi-German FreeDict Dictionary

Maintainer: Michael Bunk &lt;micha@luetzschena.de>

Edition: 0.1
Size: about 1000 headwords

Published by: FreeDict, 2002-2004
at: http://freedict.org/

Availability:

  GNU GENERAL PUBLIC LICENSE

Series: free dictionaries

Notes:

 * Thanks go to the supporters of this project: Karl-Heinz Grüßner,
   University of Tübingen; Rebekah Tham, CIEFL, Shillong; Brother Sngi,
   Sacred Heart College, Shillong; University of Leipzig, Depts of
   Indology and Computer Science; Rik Faith for dictd server &amp; protocol;
   TEI for their wonderful guidelines; Horst Eyermann for his generous
   freedict project; the open source community: thrust for trust &amp;
   development

Source(s):

  Home: http://freedict.org/

The Project:

  This dictionary comes to you through nice people making it available for
  free and for good. It is part of the FreeDict project,
  http://www.freedict.org / http://freedict.de. This project aims to make
  available many translating dictionaries for free. Your contributions are
  welcome!

Changelog:

 * 2007-02-02 $Id: writing-tei.xml,v 1.14 2007-03-16 08:59:20 micha137 Exp $:
   Converted TEI file into text format

 * 2005-Nov-24 Michael Bunk:
   Note that this version is quite final. I concentrate on the
   Khasi-English dictionary now.

 * 2003-05-01 Michael Bunk:
   Updated some things in XML header information

 * 2002-02-18 Michael Bunk:
   First Draft

 * before February 2002 Karl-Heinz Grüßner:
   compilation of wordlists
    </screen>
  </section>

  <section id="entry-examples">
    <title>Entry Examples</title>

    <example id="ex-entry-minimal">
      <title>A minimal entry</title>
      <programlisting>&lt;entry&gt;
  &lt;form&gt;&lt;orth&gt;dog&lt;/orth&gt;&lt;/form&gt;
  &lt;trans&gt;&lt;tr&gt;Hund&lt;/tr&gt;&lt;/trans&gt;
&lt;/entry&gt;</programlisting>

      <para>After formatting this entry might look like:
        <screen>dog
    Hund</screen>
      </para>
    </example>

    <example id="ex-entry-normal">
      <title>A more complete entry</title>
      <programlisting>&lt;entry&gt;
  &lt;form&gt;
    &lt;orth&gt;dog&lt;/orth&gt;
    &lt;pron&gt;dɔg&lt;/pron&gt;
  &lt;/form&gt;
  &lt;gramGrp&gt;&lt;pos&gt;n&lt;/pos&gt;&lt;/gramGrp&gt;
  &lt;sense&gt;
    &lt;trans&gt;
      &lt;tr&gt;Hund&lt;/tr&gt;&lt;gen&gt;m&lt;/gen&gt;
    &lt;/trans&gt;
    &lt;eg&gt;
      &lt;q&gt;The dog is barking.&lt;/q&gt;
      &lt;trans&gt;&lt;tr&gt;Der Hund bellt.&lt;/tr&gt;&lt;/trans&gt;
    &lt;/eg&gt;
    &lt;note>Dogs bite as well.&lt;/note>
  &lt;/sense&gt;
&lt;/entry&gt;</programlisting>

      <para>After formatting it might look as:
      <screen>dog [dɔg] n.
  Hund m.
  "The dog is barking." = "Der Hund bellt."
  (Dogs bite as well.)</screen>
      </para>
    </example>

    <para><sgmltag>orth</sgmltag> gives the orthography, ie. how the word is
      correctly written.  If there are several ways, you can use multiple
      <sgmltag>orth</sgmltag> elements.  <sgmltag>pron</sgmltag> gives the
      pronunciation (sic!), ie how the word is spoken. In FreeDict we use the
      IPA (International Phonetic Alphabet) that is also contained in Unicode.
      The FreeDict-Editor might help you in entering it here.  In a
      <emphasis>gramGrp</emphasis> element, all grammatical information is
      grouped. You can give the part of speech (here <emphasis>n</emphasis> for
      noun), the gender and also the number (singular, plural) of the headword
      (Number is not given in this example).  The values allowed for
      part-of-speech are prescribed in this document, so that one doesn't
      write <emphasis>n</emphasis> and the other one <emphasis>noun</emphasis>
      and the third one <emphasis>N</emphasis> or whatever.  See
      <xref linkend="typology-pos"/>.</para>

    <para>You can group the different senses of homographs with
      <emphasis>sense</emphasis>.  The numbering is optional. Translations are
      given in the <emphasis>tr</emphasis> element. Multiple
      <emphasis>tr</emphasis> elements may be given. For each grammatical
      information is optional.</para> <para>With <emphasis>eg</emphasis>
      (exempla grata) you can give examples of usage and optionally their
      translation.</para>

    <example id="ex-entry-eg">
      <title>Entry with a definition as well as a translation and an example
	sentence</title>
    <programlisting>
    &lt;entry>
      &lt;form>
	&lt;orth>ban&lt;/orth>
      &lt;/form>
      &lt;gramGrp>
	&lt;pos>prep&lt;/pos>
      &lt;/gramGrp>
      &lt;sense>
	&lt;trans>&lt;tr>to&lt;/tr>&lt;/trans>
	&lt;def>denotes infinitive of the following verb&lt;/def>
	&lt;eg>&lt;q>U nang ban thoh.&lt;/q>&lt;trans>&lt;tr>Come here!&lt;/tr>&lt;/trans>&lt;/eg>
      &lt;/sense>
    &lt;/entry&gt;</programlisting>
    </example>

    <example id="ex-entry-xref">
      <title>Entry with a cross reference to a synonym</title>
    <programlisting>
     &lt;entry>
      &lt;form>
	&lt;orth>pynhiar&lt;/orth>
      &lt;/form>
      &lt;gramGrp>
	&lt;pos>v&lt;/pos>
      &lt;/gramGrp>
      &lt;sense>
	&lt;trans>&lt;tr>abase&lt;/tr>&lt;/trans>
	&lt;xr type="syn">&lt;ref>pynrit&lt;/ref>&lt;/xr>
      &lt;/sense>
    &lt;/entry></programlisting>
    </example>



    <example>
      <title>Entry giving the usage domain to a translation</title>
    <programlisting>
    &lt;entry>
      &lt;form>
	&lt;orth>pungkjat&lt;/orth>
      &lt;/form>
      &lt;gramGrp>
	&lt;pos>n&lt;/pos>&lt;gen>f&lt;/gen>
      &lt;/gramGrp>
      &lt;sense>
	&lt;usg type="dom">bio&lt;/usg>
	&lt;trans>&lt;tr>leg&lt;/tr>&lt;/trans>
      &lt;/sense>
    &lt;/entry></programlisting>
    </example>

    <variablelist><title>Supported elements in entries (children
	 of <sgmltag>entry</sgmltag>)</title>
      <varlistentry>
	<term><literal>form</literal></term>
	<listitem>
	  <para>for grouping of <sgmltag>orth</sgmltag> and
	    <sgmltag>pron</sgmltag> elements</para>
	</listitem>
      </varlistentry>
      <varlistentry>
	<term><literal>orth</literal></term>
	<listitem>
	  <para>becomes a headword</para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>pron</literal></term>
	<listitem>
	  <para>pronunciation; optional</para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>gramGrp</literal></term>
	<listitem>
	  <para>grouping of <sgmltag>pos</sgmltag>, <sgmltag>num</sgmltag>
	    and <sgmltag>gen</sgmltag> which give part of speech, number
	    and genus; giving grammatical information is optional;
	    <sgmltag>pos</sgmltag>, <sgmltag>num</sgmltag>
            and <sgmltag>gen</sgmltag> are allowed after <sgmltag>tr</sgmltag>
            as well</para>
	</listitem>
      </varlistentry>

     <varlistentry>
	<term><literal>sense</literal></term>
	<listitem>
	  <para>becomes a numbered sense; optional</para>
	</listitem>
      </varlistentry>

     <varlistentry>
	<term><literal>usg type="dom"</literal></term>
	<listitem>
	  <para>domain to which a sense applies
	    <note>
	      <para>For increasing machinary dictionary usability, the contents
		of this element should be standardized accross dictionaries. A
		table could translate the content to a specific language, if
		required.</para>
	    </note>
	  </para>
	</listitem>
      </varlistentry>

     <varlistentry>
	<term><literal>trans</literal></term>
	<listitem>
	  <para>groups informtion relating to a translation equivalent; the
	    translation equivalent itself is given in a <sgmltag>tr</sgmltag>
	    element; other information regarding a single translation
	    equivalent would be grammatical information, esp.
	    <sgmltag>gen</sgmltag> (the part of speech is always the same as in
	    the headword, I say); optional by DTD; usage recommended
	    <warning>
	      <para>This element was used incorrectly in connection with the
		<sgmltag>tr</sgmltag> element in FreeDict. Do not copy those
		mistakes!</para>
	    </warning>
	  </para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>tr</literal></term>
	<listitem>
	  <para>contains a literal translation equivalent; when building a
	    reverse index, the contentes of these elements will be used instead
	    of those of <sgmltag>orth</sgmltag> elements for indexed
	    headwords</para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>def</literal></term>
	<listitem>
	  <para>contains a definition; is taken over verbatim; in a bilingual
	    dictionary, grammatical particles sometimes cannot be translated,
	    so a definition of their function is more appropriate</para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>eg/q</literal></term>
	<listitem>
	  <para>"exempla grata"; example sentence; optional; the example itself
	    must be contained in a <sgmltag>q</sgmltag> element to mark it as a
	    quote.  A translation of the example can also be given.</para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>xref</literal></term>
	<listitem>
	  <para>a cross reference; optional; transformed to
	    {} in dictd database format and {sa} in BEDic format</para>
	</listitem>
      </varlistentry>

      <varlistentry>
	<term><literal>note</literal></term>
	<listitem>
	  <para>free notes; the attribute "resp" is reserved to contain the
	    translator string (id or name or email), only used in
	    <command>xdf2tei.pl</command>; normally notes are rendered inside
	    parentheses, so don't include parentheses here</para>
	</listitem>
      </varlistentry>
    </variablelist>

    <para>Please refer to the TEI Guidelines and take a look at the XML markup
      of a dictionary for more examples.</para>

  </section>

  <section id="template">
    <title>TEI Dictionary Template</title>

    <para>You can use the <firstterm>FreeDict Dictionary TEI XML
	Template</firstterm> as a starting point for your own dictionaries. You
      can download it from <ulink url="la1-la2.template.tei">here</ulink>.</para>

    <programlisting>&la1-la2.template.tei;</programlisting>
  </section>

  <section id="quality">
    <title>Dictionary Quality</title>

    <para>As dictionaries are used as authoritative sources for people looking
      up the spelling of words or learning foreign languages, it is important
      that they maintain a high quality standard.</para>

    <para>The following quality criteria are important:</para>

    <variablelist><title>Quality Criteria for Dictionaries</title>
      <varlistentry>
	<term>Correctness</term>
        <listitem>
	  <para>The reason for this we mentioned in the introductory paragraph.</para>
        </listitem>
      </varlistentry>

       <varlistentry>
	<term>Headword Count</term>
        <listitem>
	  <para>It is frustrating not to find a word in a dictionary.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
	<term>Usability</term>
        <listitem>
	  <para>This is mainly a question for the platforms that our dictionaries
	    are provided for. Dictionaries should be easy to install and word lookup
	    should be easy as well. Electronic dictionaries have a great advantage
	    in the speed of entry lookups and can provide lookup strategies that
	    paper dictionaries can't. Paper dictionaries have the advantage to
	    be quite portable, up to a number of kg that depends on the capacity of
	    the bearer. PDA can replace nowadays a whole bunch of dictionaries
	    while maintaining portability.</para>
        </listitem>
      </varlistentry>
    </variablelist>

    <section id="improving-quality">
      <title>Improving Quality</title>

      <para>So what can be done for above criteria? Indirectly you can always
	support the people working on dictionaries, sparing resources for them.
	To directly improve, the following can be done:</para>

      <variablelist>
	<title>Means to Improve Dictionary Quality</title>

	<varlistentry>
	  <term>Revise entries manually</term>
	  <listitem>
	    <para>Having no comfortable editor presently makes this a bit hard.
	      It requires profound knowledge of the languages of the dictionary,
	      so this task is reserved for experts.</para>
	  </listitem>
	</varlistentry>

        <varlistentry>
          <term>Spellcheck the dictionary</term>
          <listitem>
            <para>When you write a dictionary for a language where no spellchecker
	      wordlist exists yet, of course you can't do this. But mostly
	      dictionaries will be from some language into English. The English
	      part of the dictionary can be spellchecked! This requires a
	      suitable tool.</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term>Check sanity/completeness of entries</term>
          <listitem>
            <para>Can help to spot entries without part of speech information
	      or carrying editorial marks requesting clarification. Requires a supporting
	      tool.</para>
	  </listitem>
        </varlistentry>

        <varlistentry>
          <term>Report and fix bugs</term>
          <listitem>
            <para>This is a natural activity for Open Source Software.</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term>Grow the dictionary</term>
          <listitem>
	    <para>Provided there is a way to enter/submit new entries, the
	      question arises from from where to get new entries or information
	      to extend existing entries. This is a quite complex topic, so it
	      might go into its own section one day when it has grown up (are
	      there parallels to "How to grow a language" from Guy
	      Steele?).</para>

	    <para>Having a miss during word lookup creates a likely candidate
	      to be added to the dictionary if the query was not misspelled.
	      Usually the translation can be found in another dictionary (that
	      is what I do when I have a miss). This combination of headword
	      and translation can be added to the dictionary you want to
	      grow.</para>

	    <para>Doing it systematically can be called copying a dictionary.
	      Watch out for author's rights here. Nobody can own words, but
	      compiling a dictionary can be quite some work, so acknowledge
	      this!</para>

	    <para>Often, having parts can help get you going. Having wordlists
	      of the headwords or translation equivalents of one of the
	      languages of the dictionary can help growing a dictionary. With a
	      wordlist you only have to answer slightly more easy questions
	      like "What is the translation of word
	      <replaceable>XXX</replaceable> in the other language?" or "What
	      is the Part of Speech of word
	      <replaceable>XXX</replaceable>?".</para>

	    <para>Sometimes wordlists are quite easy to get. You can for
	      example extract them from existing dictionaries or spellchecker
	      databases. The tool <command>index2wordlist.pl</command> in the
	      <filename>tools/testing</filename> directory can make a word
	      list out of a dictd database index file. The <command>aspell
		dump</command> command can give you the wordlist of a database
	      of the aspell spellchecker.</para>

	    <para>For languages where you cannot reuse exisiting word lists,
	      eg. when you are spearheading the development of the first-ever
	      dictionary of a minority language, the situation is slightly more
	      difficult.</para>

	    <para>If electronic documents - preferably websites -
	      in that language exist, you can use a Natural Language Processing
	      technique that employs <firstterm>seed</firstterm> words as
	      input. The seeds you have to give should be specific to your
	      language, ie. they should not be used in other languages.  Then
	      you can identify the electronic documents containing those seeds.
	      They are likely to be in your language. From the documents in
	      your language you can extract additional words which you can
	      reuse to find more documents in your language and more words in
	      turn. In hypertexts you can exploit a locality feature of
	      documents in a certain language: The links are likely to lead to
	      documents in the same language. So you can get more words from
	      there as well.</para>

	     <para>An implementation of this technique was done by Prof. Kevin
	       Scannell with <ulink
		 url="http://borel.slu.edu/crubadan/">Crudaban</ulink>, a
	       crawler that uses the <ulink url="http://www.google.com/apis/">Google
		 API(s)</ulink> to find websites.</para>

          </listitem>
        </varlistentry>


      </variablelist>
    </section>
  </section>

</chapter>

<!-- Keep this comment at the end of the file. It is for emacs.

 Why emacs' psgml extension cannot validate using the parent document?

 Local Variables:
 mode: xml
 sgml-doctype: "freedicthowto.xml"
 sgml-parent-document: ("freedicthowto.xml" "book" "part" "chapter")
 sgml-trace-entity-lookup: t
 End:
 -->


