Chapter 6. Writing Text Encoding Initiative XML files

Table of Contents

Overview of the TEI Organisation
The TEI DTDs
The XML Declaration
The DOCTYPE Declaration
<!DOCTYPE TEI.2
PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.tei-c.org/Guidelines/DTD/tei2.dtd"
<!ENTITY % TEI.dictionaries "INCLUDE" > ]>
The TEI Header
Entry Examples
TEI Dictionary Template
Dictionary Quality
Improving Quality

Abstract

The Text Encoding Initiative. The TEI (Text Encoding Initiative) is an international research effort established in 1987, intended to produce a community-based standard for encoding and interchange of texts.” [http://www.tei-c.org/Consortium/TEIcharter.html]

This section gives examples and explanations of using the TEI format to write and deliver your dictionary. While it is not assumed you have the DTDs installed, it may be best if you did.

Overview of the TEI Organisation

In Brief. In December of 2000, a new consortium was established to sustain and develop the Text Encoding Initiative (TEI). Initially launched in 1987, the TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.” [http://www.tei-c.org/Consortium/index.html]

For a full explanation of the history, objectives and approaches of the Text Encoding Initiative it would be best if you visited them at http://www.tei-c.org/. There you will find links and information to all sorts of TEI related stuff. Of note are the Guidelines.

The TEI consortium provides an approach to transcribing and creating documents in SGML (that was the old version of the TEI Guidelines, up to P3) and XML (from P4 on), is as obsolescence proof as possible and allows free exchange of content and ideas. We will only be using a very small subset of this system's capacity.

The TEI DTDs

Using The TEI Dictionary DTD

In these sections we are going to step through a basic introduction to building a marked up dictionary based on the Text Encoding Initiative TEI2 P4 Dictionary Document Type Definition. First we deal with some technical information. If you are familiar with XML and DTDs, you might want to skip the next sections.

The TEI system of markup can be approached in a number of ways. Here we treat it pretty much like any other XML based markup system. The consortium itself has not registered a formal XML name space (as of Feb 2004), but does have unofficial Public and System identifiers we may use. If you don't know what all that means, don't worry. Cut and paste examples are given below. While, in the following "Walking with" section there are some basic explanations of what this all means. For a deeper understanding a visit to the W3C XML pages may help. OASIS and associated XML portals usually have technical and introductory matter as well.

http://www.w3.org/

http://xml.coverpages.org/

http://www.xml.com

The XML Declaration

<?xml version='1.0' encoding="UTF-8" ?>

The xml declaration is required on all valid xml documents. For our purposes the above is all we need. Be aware though, that this declaration can contain much more than we use here. The encoding="UTF-8" attribute and value pair is sometimes considered optional as all XML documents assume or default to this unless told otherwise. I believe it is better to add it now as future versions of XML will probably require it.

It is the next section that defines our Text Encoding Initiative format in particular. So far we have just told a validating parser that this document should be valid XML version 1.0.

The DOCTYPE Declaration

<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [
 <!ENTITY % TEI.XML          "INCLUDE" >
 <!ENTITY % TEI.dictionaries "INCLUDE" >
      ]>

The above is called the DOCTYPE declaration. The UPPER case words are all keywords that are used to set variables and types in parsing and validating tools. The rest of the definition is very carefully structured to provide absolute definitions for a number of things. We step through them here, as they are understood by the protocols involved. The entire set between the opening <! and the closing > is called the document prolog or internal DTD subset. Functionally it defines or even redefines a specific named document type.

<!DOCTYPE TEI.2

The combined effect of this declaration opening string is to tell a program (or person) who knows, that this document is claiming to be compliant to a defined type of document. In our case TEI.2.

Other document types (e.g. DocBook) use this parameter to set the "level" the document sits at, possibly within a larger definition. Text Encoding Initiative documents use a different system to achieve much the same end (see the following description of ENTITIES).

Practically, another way of thinking about this is to consider that the opening ELEMENT of the following document has to be of the type TEI.2. What that element is allowed to have as attributes or children elements is defined in the Document Type Definition for this DOCTYPE.

PUBLIC "-//TEI P4//DTD Main Document Type//EN"

This "phrase" is called the Formal Public Identifier. It starts with the word PUBLIC and in this case ends with a language identifier. The quoting is important as is the string -// at the start of the actual definition.

This means that this is an informal definition, or an unregistered definition. A registered definition would start start with +//. Most Formal Public Identifiers use this type of definition, there is nothing unusual or pejorative about it.

The FPI must be exactly as shown or it means “something else”, and probably not what we need it to. It is also one of the ways your local system knows which DTD to compare your document to, and so must be correct.

[Note]Note

Text Encoding Initiative documents also use other FPIs! If you visit their website you will see many other examples. The TEI2 zip file currently has HTML documentation explaining all those options, as that is likely to be up to date please refer to it rather than this for usages other than FreeDict. Please see the section called “Finding and downloading the DTDs” if you haven't already :)

"http://www.tei-c.org/Guidelines/DTD/tei2.dtd"

This next string must be a URI (Universal Resource Identifier, RFC 1630). In this case it is in URL (Uniform Resource Locator, RFC 1738) format. Here it functions as a portable System Identifier to the DTD. By convention it is often started on a new line, but that is not a requirement.

This is just one way of doing this, and you may simply set the absolute path to the DTD on your system here and it should be OK. If you choose to install the DTDs only for your own use, you must set the system identifier to the absolute path to your copy of the DTD.

Example 6.1. System Identifiers

<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
"/home/your-account/TEI/TEI2.dtd" [
 <!ENTITY % TEI.XML          "INCLUDE" >
 <!ENTITY % TEI.dictionaries "INCLUDE" >
	]>

Be aware that this method in not portable. It may save you a whole lot of bother if you don't plan to use the TEI2 DTD for much else other than FreeDict.

<!ENTITY % TEI.dictionaries "INCLUDE" > ]>

The final section of the DOCTYPE declaration is also enclosed in square brackets. We use it to include some ENTITIES. In this case these entities are contained within files that are a part of the TEI2 P4 DTD set.

The first inclusion <!ENTITY % TEI.XML "INCLUDE" > does some very clever "magic" remapping the DTD to XML compliant form. Without it the document will not be XML.

The second included ENTITY <!ENTITY % TEI.dictionaries "INCLUDE" > adds the dictionary specific elements to the base DTD, and thereby makes this a dictionary as opposed to say a stage script or process design document.

You may define your own extensions to the TEI set within these square brackets as well. A careful study of the DTDs and associated files would be of great value if you are interested in gaining a greater understanding of SGML and XML. Functionally this space is where a DTD is loaded for any document type. You may also include other files in this place which can be useful for large document sets.

The TEI Header

Abstract

In this section we step through the structure and intent behind the TEI header section.

Headers usually give meta information, ie. information about something. The TEI header gives general information about the following text like title, authors, publisher and revision history. It is defined in The TEI Guidelines, Chapter 5.

The TEI header is used to created the 00-database-info entry in the dictd database files, as well as to convert it to meta information for other supported dictionary formats. The contents of the title element go into the 00-database-short entry; the content of the sourceDesc element goes into the 00-database-url entry.

supported elements in the TEI header (contents of teiHeader)

In the following an XPath expression is given for elements that have a different meaning depending on their parents.

titleStmt/title

This becomes 00-database-short when converted to dictd database format and the id property when converted to BEDic (see the section called “Bedic”).

titleStmt/respStmt/resp

A responsibility to this database. Values with special treatment are Author and Maintainer. Both should be shown on the FreeDict website. BEDic has a property for the maintainer, but not the author.

titleStmt/respStmt/name

Name and email address of the person carrying the responsibility named in ../resp, ie. the contents should follow the form FirstName LastName <user@host>.

edition

This becomes the release version number. It is shown on the website and used in building filenames for releases. It is recommended to use only numbers and dots. Also, two levels of versioning should be enough, ie. 0.1 is a good start.

extent

Should contain the approximate number of headwords, including the unit "headwords". It is put into 00-database-info. The headword count for the website is extracted from the .index file of the dictd database format.

notesStmt

The notes from in here are put into 00-database-info and should be available on the website as well. But presently a "more info on this dictionary" page doesn't exist.

sourceDesc/xptr

This becomes the source url on the website and is put into 00-database-url when converting to dictd database format.

revisionDesc

Description of the changes this database went through, bugfixes are noted here as well. It would be nice to generate the Changelog file of the dictionary module out of this.

revisionDesc/change/date

These dates should follow the format YYYY-MM-DD.

In the following you find a header taken from the kha-deu.tei file. You can copy it and replace element contents as they fit your database.

  <teiHeader>
    <fileDesc>
      <titleStmt>
	<title>Khasi-German FreeDict Dictionary</title>
	<respStmt>
	  <resp>Maintainer</resp>
	  <name>Michael Bunk &lt;micha@luetzschena.de></name>
	</respStmt>
      </titleStmt>
      <editionStmt>
	<edition>0.1</edition>
      </editionStmt>
      <extent>about 1000 headwords</extent>
      <publicationStmt>
	<publisher>FreeDict</publisher>
	<availability>
	  <p>GNU GENERAL PUBLIC LICENSE</p>
	</availability>
	<date>2002-2004</date>
	<pubPlace>http://freedict.org/</pubPlace>
      </publicationStmt>
      <seriesStmt>
	<title>free dictionaries</title>
      </seriesStmt>
      <notesStmt>
	<note>Thanks go to the supporters of this project:
	  Karl-Heinz Grüßner, University of Tübingen; Rebekah Tham,
	  CIEFL, Shillong; Brother Sngi, Sacred Heart College,
	  Shillong; University of Leipzig, Depts of Indology and
	  Computer Science; Rik Faith for dictd server &amp;
	  protocol; TEI for their wonderful guidelines; Horst
	  Eyermann for his generous freedict project; the open source
	  community: thrust for trust &amp; development</note>
      </notesStmt>
      <sourceDesc>
	<p>Home: <xptr url="http://freedict.org/"/></p>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
      <projectDesc>
	<p>This dictionary comes to you through nice people making it
	  available for free and for good. It is part of the FreeDict project,
	  http://www.freedict.org / http://freedict.de. This
	  project aims to make available many translating dictionaries
	  for free. Your contributions are welcome!</p>
      </projectDesc>
    </encodingDesc>
    <revisionDesc>
      <change>
	<date>2005-Nov-24</date>
	<respStmt>
	  <name>Michael Bunk</name>
	</respStmt>
	<item>Note that this version is quite final. I concentrate on
	  the Khasi-English dictionary now.</item>
      </change>
      <change>
	<date>2003-05-01</date>
	<respStmt>
	  <name>Michael Bunk</name>
	</respStmt>
	<item>Updated some things in XML header information</item>
      </change>
      <change>
	<date>2002-02-18</date>
	<respStmt>
	  <name>Michael Bunk</name>
	</respStmt>
	<item>First Draft</item>
      </change>
      <change>
	<date>before February 2002</date>
	<respStmt>
	  <name>Karl-Heinz Grüßner</name>
	</respStmt>
	<item>compilation of wordlists</item>
      </change>
    </revisionDesc>
  </teiHeader>

That header is transformed to this 00-database-info text:

Khasi-German FreeDict Dictionary

Maintainer: Michael Bunk <micha@luetzschena.de>

Edition: 0.1
Size: about 1000 headwords

Published by: FreeDict, 2002-2004
at: http://freedict.org/

Availability:

  GNU GENERAL PUBLIC LICENSE

Series: free dictionaries

Notes:

 * Thanks go to the supporters of this project: Karl-Heinz Grüßner,
   University of Tübingen; Rebekah Tham, CIEFL, Shillong; Brother Sngi,
   Sacred Heart College, Shillong; University of Leipzig, Depts of
   Indology and Computer Science; Rik Faith for dictd server & protocol;
   TEI for their wonderful guidelines; Horst Eyermann for his generous
   freedict project; the open source community: thrust for trust &
   development

Source(s):

  Home: http://freedict.org/

The Project:

  This dictionary comes to you through nice people making it available for
  free and for good. It is part of the FreeDict project,
  http://www.freedict.org / http://freedict.de. This project aims to make
  available many translating dictionaries for free. Your contributions are
  welcome!

Changelog:

 * 2007-02-02 $Id: teiheader2txt.xsl,v 1.6 2006/05/21 12:57:18 micha137 Exp $:
   Converted TEI file into text format

 * 2005-Nov-24 Michael Bunk:
   Note that this version is quite final. I concentrate on the
   Khasi-English dictionary now.

 * 2003-05-01 Michael Bunk:
   Updated some things in XML header information

 * 2002-02-18 Michael Bunk:
   First Draft

 * before February 2002 Karl-Heinz Grüßner:
   compilation of wordlists
    

Entry Examples

Example 6.2. A minimal entry

<entry>
  <form><orth>dog</orth></form>
  <trans><tr>Hund</tr></trans>
</entry>

After formatting this entry might look like:

dog
    Hund


Example 6.3. A more complete entry

<entry>
  <form>
    <orth>dog</orth>
    <pron>dɔg</pron>
  </form>
  <gramGrp><pos>n</pos></gramGrp>
  <sense>
    <trans>
      <tr>Hund</tr><gen>m</gen>
    </trans>
    <eg>
      <q>The dog is barking.</q>
      <trans><tr>Der Hund bellt.</tr></trans>
    </eg>
    <note>Dogs bite as well.</note>
  </sense>
</entry>

After formatting it might look as:

dog [dɔg] n.
  Hund m.
  "The dog is barking." = "Der Hund bellt."
  (Dogs bite as well.)


orth gives the orthography, ie. how the word is correctly written. If there are several ways, you can use multiple orth elements. pron gives the pronunciation (sic!), ie how the word is spoken. In FreeDict we use the IPA (International Phonetic Alphabet) that is also contained in Unicode. The FreeDict-Editor might help you in entering it here. In a gramGrp element, all grammatical information is grouped. You can give the part of speech (here n for noun), the gender and also the number (singular, plural) of the headword (Number is not given in this example). The values allowed for part-of-speech are prescribed in this document, so that one doesn't write n and the other one noun and the third one N or whatever. See Table 5.1, “Part of Speech Typology (recommended contents of the pos element)”.

You can group the different senses of homographs with sense. The numbering is optional. Translations are given in the tr element. Multiple tr elements may be given. For each grammatical information is optional.

With eg (exempla grata) you can give examples of usage and optionally their translation.

Example 6.4. Entry with a definition as well as a translation and an example sentence

    <entry>
      <form>
	<orth>ban</orth>
      </form>
      <gramGrp>
	<pos>prep</pos>
      </gramGrp>
      <sense>
	<trans><tr>to</tr></trans>
	<def>denotes infinitive of the following verb</def>
	<eg><q>U nang ban thoh.</q><trans><tr>Come here!</tr></trans></eg>
      </sense>
    </entry>

Example 6.5. Entry with a cross reference to a synonym

     <entry>
      <form>
	<orth>pynhiar</orth>
      </form>
      <gramGrp>
	<pos>v</pos>
      </gramGrp>
      <sense>
	<trans><tr>abase</tr></trans>
	<xr type="syn"><ref>pynrit</ref></xr>
      </sense>
    </entry>

Example 6.6. Entry giving the usage domain to a translation

    <entry>
      <form>
	<orth>pungkjat</orth>
      </form>
      <gramGrp>
	<pos>n</pos><gen>f</gen>
      </gramGrp>
      <sense>
	<usg type="dom">bio</usg>
	<trans><tr>leg</tr></trans>
      </sense>
    </entry>

Supported elements in entries (children of entry)

form

for grouping of orth and pron elements

orth

becomes a headword

pron

pronunciation; optional

gramGrp

grouping of pos, num and gen which give part of speech, number and genus; giving grammatical information is optional; pos, num and gen are allowed after tr as well

sense

becomes a numbered sense; optional

usg type="dom"

domain to which a sense applies

[Note]Note

For increasing machinary dictionary usability, the contents of this element should be standardized accross dictionaries. A table could translate the content to a specific language, if required.

trans

groups informtion relating to a translation equivalent; the translation equivalent itself is given in a tr element; other information regarding a single translation equivalent would be grammatical information, esp. gen (the part of speech is always the same as in the headword, I say); optional by DTD; usage recommended

[Warning]Warning

This element was used incorrectly in connection with the tr element in FreeDict. Do not copy those mistakes!

tr

contains a literal translation equivalent; when building a reverse index, the contentes of these elements will be used instead of those of orth elements for indexed headwords

def

contains a definition; is taken over verbatim; in a bilingual dictionary, grammatical particles sometimes cannot be translated, so a definition of their function is more appropriate

eg/q

"exempla grata"; example sentence; optional; the example itself must be contained in a q element to mark it as a quote. A translation of the example can also be given.

xref

a cross reference; optional; transformed to {} in dictd database format and {sa} in BEDic format

note

free notes; the attribute "resp" is reserved to contain the translator string (id or name or email), only used in xdf2tei.pl; normally notes are rendered inside parentheses, so don't include parentheses here

Please refer to the TEI Guidelines and take a look at the XML markup of a dictionary for more examples.

TEI Dictionary Template

You can use the FreeDict Dictionary TEI XML Template as a starting point for your own dictionaries. You can download it from here.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE TEI.2
 PUBLIC "-//TEI P4//DTD Main Document Type//EN"
 "http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [
<!ENTITY % TEI.dictionaries "INCLUDE">
<!ENTITY % TEI.XML "INCLUDE">
<!ENTITY % TEI.linking "INCLUDE">
<!ATTLIST xptr url CDATA #IMPLIED>
<!ATTLIST xref url CDATA #IMPLIED>
]>
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
	<title>Language1-Language2 FreeDict Dictionary</title>
	<respStmt>
	  <resp>Author</resp>
	  <name>Someone Somewho &lt;her@email.address&gt;</name>
	</respStmt>
	<respStmt>
	  <resp>Maintainer</resp>
	  <name>Someone Else &lt;or@the-same.example.com&gt;</name>
	</respStmt>
      </titleStmt>
      <editionStmt>
	<edition>0.1</edition>
      </editionStmt>
      <extent>less than 1000 headwords</extent>
      <publicationStmt>
	<publisher>FreeDict</publisher>
	<availability>
	  <p>GNU GENERAL PUBLIC LICENSE</p>
	</availability>
	<date>200X</date>
	<pubPlace>http://freedict.org/</pubPlace>
      </publicationStmt>
      <seriesStmt>
	<title>free dictionaries</title>
      </seriesStmt>
     <notesStmt>
       <note type="status">new dictionary project</note>
       <note>Thanks to ...</note>
     </notesStmt>
     <sourceDesc>
       <p>Home: <xptr url="http://www.somewhere.in.cspace/"/></p>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
        <projectDesc>
          <p>This dictionary comes to you through nice people making it
          available for free and for good. It might be part of the FreeDict project,
          http://www.freedict.org / http://freedict.de. This
          project aims to make available many translating dictionaries
          for free. Your contributions are welcome!</p>
      </projectDesc>
    </encodingDesc>
   <revisionDesc>
       <change>
	<date>200x-mm-dd</date>
	<respStmt><name>Someone</name></respStmt>
	<item>Start</item>
      </change>
      <change>
	<date>2005-04-02</date>
	<respStmt><name>Peter Gossner / Michael Bunk</name></respStmt>
	<item>FreeDict Dictionary TEI XML Template V0.1</item>
      </change>
    </revisionDesc>
  </teiHeader>

  <text>
    <body>

      <!-- Sample Entry -->
      <entry>
        <form>
	  <orth>A</orth>
	  <orth>a</orth>
	</form>
	<gramGrp>
	  <pos>n</pos>
	  <gen>m</gen>
	</gramGrp>
        <sense>
	  <def>the first letter of the Language1 Alphabet</def>
	</sense>
      </entry>
      
    </body>
  </text>
</TEI.2>

Dictionary Quality

As dictionaries are used as authoritative sources for people looking up the spelling of words or learning foreign languages, it is important that they maintain a high quality standard.

The following quality criteria are important:

Quality Criteria for Dictionaries

Correctness

The reason for this we mentioned in the introductory paragraph.

Headword Count

It is frustrating not to find a word in a dictionary.

Usability

This is mainly a question for the platforms that our dictionaries are provided for. Dictionaries should be easy to install and word lookup should be easy as well. Electronic dictionaries have a great advantage in the speed of entry lookups and can provide lookup strategies that paper dictionaries can't. Paper dictionaries have the advantage to be quite portable, up to a number of kg that depends on the capacity of the bearer. PDA can replace nowadays a whole bunch of dictionaries while maintaining portability.

Improving Quality

So what can be done for above criteria? Indirectly you can always support the people working on dictionaries, sparing resources for them. To directly improve, the following can be done:

Means to Improve Dictionary Quality

Revise entries manually

Having no comfortable editor presently makes this a bit hard. It requires profound knowledge of the languages of the dictionary, so this task is reserved for experts.

Spellcheck the dictionary

When you write a dictionary for a language where no spellchecker wordlist exists yet, of course you can't do this. But mostly dictionaries will be from some language into English. The English part of the dictionary can be spellchecked! This requires a suitable tool.

Check sanity/completeness of entries

Can help to spot entries without part of speech information or carrying editorial marks requesting clarification. Requires a supporting tool.

Report and fix bugs

This is a natural activity for Open Source Software.

Grow the dictionary

Provided there is a way to enter/submit new entries, the question arises from from where to get new entries or information to extend existing entries. This is a quite complex topic, so it might go into its own section one day when it has grown up (are there parallels to "How to grow a language" from Guy Steele?).

Having a miss during word lookup creates a likely candidate to be added to the dictionary if the query was not misspelled. Usually the translation can be found in another dictionary (that is what I do when I have a miss). This combination of headword and translation can be added to the dictionary you want to grow.

Doing it systematically can be called copying a dictionary. Watch out for author's rights here. Nobody can own words, but compiling a dictionary can be quite some work, so acknowledge this!

Often, having parts can help get you going. Having wordlists of the headwords or translation equivalents of one of the languages of the dictionary can help growing a dictionary. With a wordlist you only have to answer slightly more easy questions like "What is the translation of word XXX in the other language?" or "What is the Part of Speech of word XXX?".

Sometimes wordlists are quite easy to get. You can for example extract them from existing dictionaries or spellchecker databases. The tool index2wordlist.pl in the tools/testing directory can make a word list out of a dictd database index file. The aspell dump command can give you the wordlist of a database of the aspell spellchecker.

For languages where you cannot reuse exisiting word lists, eg. when you are spearheading the development of the first-ever dictionary of a minority language, the situation is slightly more difficult.

If electronic documents - preferably websites - in that language exist, you can use a Natural Language Processing technique that employs seed words as input. The seeds you have to give should be specific to your language, ie. they should not be used in other languages. Then you can identify the electronic documents containing those seeds. They are likely to be in your language. From the documents in your language you can extract additional words which you can reuse to find more documents in your language and more words in turn. In hypertexts you can exploit a locality feature of documents in a certain language: The links are likely to lead to documents in the same language. So you can get more words from there as well.

An implementation of this technique was done by Prof. Kevin Scannell with Crudaban, a crawler that uses the Google API(s) to find websites.