Chapter 2. XML Markup

FreeDict dictionaries are marked up using the XML version of the Text Encoding Initiative DTD, Chapter 12 (Dictionaries).

This may seem a little daunting at first, but please read on as we are working quickly to make this much easier. Full instructions are given in the "Writing a FreeDict Dictionary" section. We are also developing and testing tools that may be suitable for automating most of this.

There are many advantages to using a standard content based approach like this:

Advantages of using the TEI XML markup format

  1. Inherits most of the advantages of XML including:

    • content based rather than layout based

    • application independent

    • platform independent

    • further processing readily possible across the entire FreeDict collection

    • enables full use of existing or customised XML technologies

  2. Standardises input and output formats.

  3. Protects against obsolescence

  4. TEI has comprehensive DTDs available.

    • The Dictionary DTD is just one of a very wide conceptual set.

    • Elements already exist for lexicographic, etymological, phonetic and other particularities of dictionaries.

    • The TEI XML combination allows processing, development and use beyond the immediate scope of the FreeDict translating dictionaries.

  5. TEI technologies are reasonably well understood and used in academic circles.

Like for anything else, using TEI XML bears disadvantages:

Disadvantages of using the TEI XML markup format

  1. High memory requirements, for storage as well as for processing

  2. The TEI DTD is too permissive. It allows too complex content models for its elements, because it was written to capture as many existing texts as possible. Since almost all elements are allowed inside others, writing software to further process TEI data becomes complex. FreeDict uses its own subset of the TEI DTD. This subset wil be defined in this Howto, once it is stable. Till then it is described only.

  3. TEI is missing typologies. Eg. TEI does not prescribe to encode the Part of Speech of a noun as "noun" or "n" or anything else. Another missing typology is one for Cross References, eg. "synonym", "hypernym" etc. FreeDict has to define them.

  4. XML data requires more than a text editor for easy maintenance due to its verbosity. Eg. you cannot enter entries speedily when you have to enter all tags manually. To ease this, FreeDict develops the (as of yet unreleased) FreeDict-Editor.