Chapter 7. The Dictd Approach

Table of Contents

The Dictd Database Format
Converting dictd database format files into TEI
Sorting a .index file

Abstract

This section explains how to build most of the TEI format dictionary from a file formatted the same way as a dictd compliant .dict file. Beware that the .index file is ignored!

It relies on you laying out your dictionary with a very simple format. You will still have to construct the TEI header, but most of the work will be done for the actual content of the dictionary.

The Dictd Database Format

If you have any dictionaries in dictd database format installed, you may open one of the dictionary-name.dict.dz files to have a look at the format and contents. You will need the tool dictunzip that comes with dictd or gunzip to uncompress a .dz file. The dictzip compression extends the gzip compression with special data, so the uncompression can be done by gzip, where the header data is discarded.

Beyond the dictd header section you will notice that the file is a text file with a simple and predictable format.

When a dictd dictionary is built using dictfmt, two files are created. The dictionary-name.dict file, the one we are interested in here, contains the data that is presented to the user when she asks for the translation or definition of a word. The second file, dictionary-name.index, is a listing of the position and length of the definitions in the .dict file. Together they form an indexed database of headwords and definitions.

Here is a <comment/> commented snippet from the freedict-eng-lat.dict file.

Example 7.1. A freedict-eng-lat.dict snippet

	  00-database-info     <comment/> A formatted string dictd knows about
   3. Apr. 2000 Database was converted to TEI format and checked
   into CVS 9.Jan.2000Phonetics added (H.Ey) - machine generated
   from MBRODICT( http://tcts.fpms.ac.be/synthesis/mbrdico
   )1.Jan 2000This Database was generated from ergane
   (http://www.travlang.com).- Thanks!Copyright (C) 1999 Horst
   Eyermann (Horst@freedict.de)This program is free software;
   you can redistribute it and/or modify it under the terms of
   the GNU General Public License as published bathe Free
   Software Foundation; either version 2 of the License, or(at
   your option) any later version. This program is distributed
   in the hope that it will be useful,but WITHOUT ANY WARRANTY;
   without even the implied warranty of MERCHANTABILITY or
   FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
   License for more details.You should have received a copy of
   the GNU General Public License along with this program; if
   not, write to the Free Software Foundation, Inc., 59 Temple
   Place - Suite 330, Boston, MA 02111-1307, USA.
<comment/> A space (not TAB) indented info block

00-database-short
   English-Latin Freedict Dictionary
<comment/> The dictionary name, usually different than the file name
00-database-url
   http://www.freedict.org
   <comment/> The website of the origin of this dictionary

ABC /eibiːsiː/
     abecedarium

Abyssinian /əbisiniən/
     Ãthiops

Academy /əkædəmiː/
     Academia

Achaea /ətʃiə/
     Achaia

Achaia  /ətʃiə/
     Achaia

Acheron /ətʃerən/
     Acheron
     
Actium  /ækʃaim/
     Actium

Adam  /ædəm/
     Adam

Adriatic  /ədriætik/
     Hadria

Adriatic Sea  /ədriætiksiə/
     Hadria

Aeneas  /əniːz/
     Ãneas

Aeolus  /iːələs/
     Ãolus

	  <comment/> Snipped

zither /ziðər/
     cithara

zone /zoun/
     zona

	  <comment/> Notice the empty lines between entries
	

So, an entry has this format: Blank line above. Headword starts on the beginning of the line (column 0), the translation starts on the next line that is indented more than column 0.

Like so:

Headword
    Translation

Headword2
    Translation2
	  


Example 7.2. The dictd .index format

The corresponding .index file is built by the dictfmt tool and looks like this:

00-database-info        Q       QM
00-database-short       Qd      3
00-database-url RV      q
abacus  BZv     3
abbess  Ban     e
abbey   BbG     b
abbot   Bbi     a
abbreviate      Bb9     BB
ABC     SA      i
abdicate        Bc/     q
abdication      Bdq     q
abdomen BeV     BW
abductor        Bfs     k
aberration      BgR     x
abet    BhD     8
abhor   BiA     s


When running the dictd (or Serpento) dictionary server, these files are used for matching queries with headwords.

Converting dictd database format files into TEI

Somehow you have your headwords and related translations written in the simple format described above. You might need to convert a spread sheet or some other document into this format. As there are many possibilities we can not give you a description to do that.

Otherwise you may have an existing dictd dictionary file or finally you may be starting from scratch. In that case we recommend you to use a template as demonstrated later. If you have much lexicographic, etymological or other information to add to your dictionary, we strongly suggest you to use a template or a fully fledged XML editor.

Download the dict2tei.py python script from the tools package at the FreeDict servers at Sourceforge.

Follow the instructions included in the package to install and run with your file.

All you need to do is something like: dict2tei.py -f your-dict-format.dict -o same-working-name and the rest should happen automatically.

Now hopefully all you have to do is markup any extra entries and add the TEI header information. Please see the Writing TEI and Installing TEI sections.

Sorting a .index file

Sometimes a match lists headwords that yield no entry when they are looked up. In such case, it is likely that the index is sorted incorrectly. For a word to be looked up, the way the index is sorted and the way the dict server looks for entries have to be exactly the same.

In such case it can be sorted again, using a command such as:

LC_ALL=C sort -t $'\t' -k1,1 -bdf broken.index >working.index

Note the LC_ALL=C: Leaving it out can produce a broken index.