Chapter 5. Writing a FreeDict Dictionary

Table of Contents

Introduction
The FreeDict Entry Format
Best Practices
Two Approaches

Abstract

This chapter deals with the process of building a FreeDict format file from your gathered sources. We cover TEI DTD installation, SGML and XML catalog configuration, some introductory level XML, final formats and a couple of shortcuts.

Introduction

What we don't deal with here is the actual process of collating a translating dictionary. That task is potentially endless and will be very particular to your own circumstances. You need to develop your own approaches and processes for gathering source materials and checking the quality of your entries. Here are some of the "process" things you might look out for and check with small sample sets before you get too far along.

  1. Your editor or word processor can output UTF-8 format TEXT - not word processor or browser specific markup, nor anything other than simple text that can handle the characters of the languages you are writing for. Using different fonts, while helpful in a word processor, generally won't work in plain text (or UTF-8) format. In your final output version it almost certainly won't.

  2. If you are importing from a spreadsheet application, try exporting the pages as simple Comma Separated Value format. You can often use almost any character or set of characters as a "comma". You may be able to convert it to Dictd format with a simple script (in which case we have a shortcut for you), Chapter 7, The Dictd Approach.

  3. If you are starting from scratch and writing your dictionary mostly by hand, please consider using a template, and an XML editor like (X)emacs. These make the process much less error prone and tedious. See the tools section for more information.

The FreeDict Entry Format

Abstract

Though we claim to adhere to TEI P4 XML, Chapter 12 "Print Dictionaries", additional rules and restrictions apply.

At first sight the TEI guidelines are very complex. At second sight they are still, but it is important to notice that they were written under the primary assumption to encode as much existing text as possible by tagging it up to a reasonable level of details. The wide variety of exisiting text makes the TEI tagset very permissible, allowing almost any tags to be used inside any other.

This permissibility makes it difficult to process "pure TEI" with software to reformat TEI into other formats such as TeX, Formatting Objects or text.

Besides being too permissible, the TEI Guidelines are incomplete for our needs, because they do not define any typologies. Typologies are needed for encoding different things in our dictionaries:

  • the Part of Speech of headwords, ie. the contents of pos elements. Should verbs be marked as 'v', 'verb' or 'Verb'?

  • the Usage Domain of entry meanings - technology, botanics etc.

  • the type of Cross References - whether the reference points to a synonym, an alternative spelling, a derived word etc.

Of course, these typologies should be used for many dictionaries, allowing us to keep the processing software simple. If required, they can be localized before being presented to a dictionary user.

For these reasons, it is part of FreeDict's agenda to develop language neutral typologies for above mentioned things.

Table 5.1. Part of Speech Typology (recommended contents of the pos element)

Element ContentMeaning
nnoun
vverb (transitivity unknown)
vttransitive verb
viintransitive verb
vtitransitive and intransitive verb
advadverb
adjadjective
conjconjunction
preppreposition
intinterjection
pronpronoun
artarticle
numnumeral
intinterjection

It has been suggested to extend the TEI DTD with additional attributes to entries such as:

      dictionary - which dictionary is the word in (eg.
                   eng-deu - so entries can be distributed on their own)
      author     - who edited the word last - it's nice to know who did the work
      version    - which version of the word
      date       - the time the word was last edited
      quality    - how good do we think that the translation is
                   this would give a hint about what words should be worked on next
      frequency  - how frequent is the word in the language (should also be present in sense?)
    

XXX compare with other terminological DTDs, link to this mail in archive

Since TEI XML does not currently limit us, its extension is not actively pursued.

Best Practices

  • Avoid to use more than one orth element per entry. Instead create separate entries and link them to each other.

  • Put question marks into note elements of to be reviewed entries. Using this convention, other editers will be able to find those entries easily.

Two Approaches

There are at least two approaches you might take to building a FreeDict format dictionary. Approach One is to use the Text Encoding Initiative DTD from the beginning. This gives you the most flexibility.

Approach Two involves producing a simply (and accurately) formatted plain file that you then process with some command line tools (and will probably have to touch up). This can be quicker if you are comfortable with it, but limits your options for lexicographic information.

You may of course combine these or find any number of others, after all, it's your dictionary we just need it in a certain format :)