<?xml version='1.0' encoding="UTF-8"?>
<!-- This file is included by freedict-howto.xml -->

<part id="requirements">
  <title>Basic Requirements</title>
  <partintro>
    <para>This chapter describes the <indexterm><primary>basic requirements</primary></indexterm>
      basic requirements for inclusion in the FreeDict Project.
      Before we can accept a dictionary, we
      ask that it meets a common set of specifications. This section gives an overview 
      of those requirements. Essentially, a dictionary must meet clear
      copyright and licensing guidelines and must be presented in a particular XML format.</para>
  </partintro>

  <chapter id="Licensing">
    <title>Licensing and <indexterm><primary>Copyright</primary></indexterm>Copyright</title>
    <para>The FreeDict project including all content, documentation and tools is
      <indexterm><primary>Free Software</primary></indexterm>Free Software. Any dictionary
      to be considered for archiving and release by this project must first be Free, Open Source
      Software.</para>
    <para>The developers of a dictionary may always hold the copyright to their particular
      dictionary. There are a number of available licenses that can readily achieve this for you.</para>
    <para>This means you must release the dictionary under the terms and conditions of the GNU
      General Public License (GPL Version 2 or later),  or a License of
      <ulink url="http://www.gnu.org/licenses/license-list.html">compatible type</ulink>.</para>
    <para>Alternatively you may formally release the dictionary to the
      <indexterm><primary>public domain</primary></indexterm>Public Domain. You should understand
      that in doing so you have no easy way of stopping anybody from capturing your content and
      locking it up under their own Copyright and/or License. We don't recommend this.</para>
    <para>We do recommend the <ulink url="http://www.gnu.org/licenses/gpl.html">GPL Version 2</ulink>
      or the <ulink url="http://www.gnu.org/copyleft/fdl.html">GNU Free Documentation License</ulink>.
      Of course it's your choice. Both of these licenses (and those that are compatible with them),
      protect you and your Copyright. Briefly; they say you are releasing the dictionary:<simplelist>
	<member>1/ In the hope that it will be useful,</member>
	<member>2/ With no warranty to it's fitness or otherwise for any use, and</member>
	<member>3/ You allow free use and access to the work, provided the user passes on the same
          right if any substantial part of the work is included in their own.</member></simplelist></para>
   
    <para>Please understand that FreeDict will not agree to any license that does not meet or agree
      with the aims and scope of the GPL2<indexterm><primary>GPL2</primary></indexterm>. For current
      examples of licenses that do agree (and some that don't), please see:
      <ulink url="http://www.gnu.org/licenses/license-list.html">http://www.gnu.org/licenses/license-list.html</ulink>.</para>    
  </chapter>

  <chapter id="tei-xml">
    <title>XML Markup</title>
    <para>FreeDict dictionaries are marked up using the XML version of the
      <ulink url="http://www.tei-c.org/">Text Encoding Initiative</ulink> DTD,
      Chapter 12 (Dictionaries).</para>
    <para>This may seem a little daunting at first, but please read on as we are
      working quickly to make this much easier. Full instructions are given in
      the "<link linkend="writing">Writing a FreeDict Dictionary</link>" section.
      We are also developing and testing <link linkend="tools">tools</link>
      that may be suitable for automating most of this. 
      <indexterm>
	<primary>Text Exchange Initiative</primary>
	<secondary>TEI DTD</secondary>
      </indexterm>
    </para>

    <para>There are many advantages to using a standard content based approach
      like this:</para>

    <orderedlist numeration="upperroman">
      <title>Advantages of using the TEI XML markup format</title>
      <listitem>
	<para>Inherits most of the advantages of XML including:
	 <itemizedlist>
	  <listitem>
	   <para>content based rather than layout based</para>
	  </listitem>
	  <listitem>
	   <para>application independent</para>
	  </listitem>
	  <listitem>
	    <para>platform independent</para>
	  </listitem>
	  <listitem>
	    <para>further processing readily possible across the entire FreeDict collection</para>
	  </listitem>
	  <listitem>
	    <para>enables full use of existing or customised XML technologies</para>
          </listitem>
	 </itemizedlist>
	</para>
      </listitem>
      <listitem>
	<para>Standardises input and output formats.</para>
      </listitem>
      <listitem>
	<para>Protects against obsolescence</para>
      </listitem>
      <listitem>
	<para>TEI has comprehensive DTDs available.
	 <itemizedlist>
	  <listitem>
	    <para>The Dictionary DTD is just one of a very wide conceptual set.</para>
	  </listitem>
	  <listitem>
	    <para>Elements already exist for lexicographic, etymological, phonetic
		and other particularities of dictionaries.</para>
	  </listitem>
	  <listitem>
	    <para>The TEI XML combination allows processing, development and use
		 beyond the immediate scope of the FreeDict translating dictionaries.</para>
	  </listitem>
	 </itemizedlist>
	</para>
      </listitem>
      <listitem>
	<para>TEI technologies are reasonably well understood and used in academic circles.</para>
      </listitem>
    </orderedlist>

    <para>Like for anything else, using TEI XML bears disadvantages:</para>

    <orderedlist numeration="upperroman">
      <title>Disadvantages of using the TEI XML markup format</title>
      <listitem>
	<para>High memory requirements, for storage as well as for processing</para>
      </listitem>
      <listitem>
	<para>The TEI DTD is too permissive. It allows too complex content
	  models for its elements, because it was written to capture as many
	  existing texts as possible. Since almost all elements are allowed
	  inside others, writing software to further process TEI data becomes
	  complex. FreeDict uses its own subset of the TEI DTD. This subset wil
	  be defined in this Howto, once it is stable. Till then it is
	  described only.</para>
      </listitem>
      <listitem>
	<para>TEI is missing typologies. Eg. TEI does not prescribe to encode
	  the Part of Speech of a noun as "noun" or "n" or anything else.
	  Another missing typology is one for Cross References, eg. "synonym",
	  "hypernym" etc. FreeDict has to define them.</para>
      </listitem>
      <listitem>
	<para>XML data requires more than a text editor for easy maintenance
	  due to its verbosity. Eg. you cannot enter entries speedily when you
	  have to enter all tags manually. To ease this, FreeDict develops the
	  (as of yet unreleased)
	  <application>FreeDict-Editor</application>.</para>
      </listitem>
    </orderedlist>      
    </chapter>

    <chapter id="Unicode">
      <title>Unicode</title>
      <formalpara id="Unicode-howto">
	<title>From the Unicode HOWTO</title>
	<para>
	  <quote>People in different countries use different characters to represent
            the words of their native languages. Nowadays most applications, including
            email systems and web browsers, are 8-bit clean, i.e. they can operate on
            and display text correctly provided that it is represented in an 8-bit
            character set, like ISO-8859-1.</quote>
          <citation>Bruno Haible: Linux Unicode Howto</citation>
          <indexterm>
	    <primary>Bruno Haible</primary></indexterm>
          <indexterm>
	    <primary>Unicode</primary></indexterm>
        </para>
      </formalpara>
      <formalpara>
        <title>What is Unicode?</title>
        <para>From: <citation><ulink url="http://www.unicode.org/standard/WhatIsUnicode.html"
          id="link-unicodeOrg">http://www.unicode.org/standard/WhatIsUnicode.html</ulink></citation>
	  <literallayout>
	    <quote>Unicode provides a unique number for every character,
	      no matter what the platform,
	      no matter what the program,
	      no matter what the language.</quote>
	  </literallayout>
        </para>
      </formalpara>
      <para>Essentially, Unicode encoding supplies the unique character number that is
        typed to the page each time you enter a letter on your keyboard. A font-set then
        renders the character onto the screen so that you may read it. It's important to
        understand that Unicode is not a font-set, it is a protocol for the mappings that
        font-sets use to render fonts on your screen or on to a printed page.</para>
      <highlights>
	<tip>
	  <para>All new dictionaries should be written using the Unicode character set.
            If you use a text editor that is unicode capable, all should be well.</para>
	  <para>UTF-8 is a way of wrapping up all real-world characters in a portable
            and efficient way. This includes most 8 bit and many <emphasis>16 bit or 2 byte</emphasis>
            character sets. Your current character sets are probably included, so it may be
            as simple as putting
	    <literal>&lt;?xml version="1.0" encoding="UTF-8" standalone="no"?&gt;</literal>
	    as the XML declaration of your final document.</para>
        </tip>
      </highlights>
      <para>You should ensure that your dictionary does not rely on any particular font set
        and is equally functional when rendered as simple text. Remember, fonts are just
        "pretty renderings" of real characters. Most modern Text Editors (e.g. Xemacs, emacs,
        Vim, GEdit, Kedit, notepad) should be fine.</para>
      <para>This is not the place for a full explanation of Unicode. Please see Markus Kuhn's
        excellent summary at <ulink url="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
        id="link-utf8">http://www.cl.cam.ac.uk/~mgk25/unicode.htm</ulink>. The Linux Unicode HOWTO
        is well worth visiting: <ulink
        url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html">ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html</ulink>
        <indexterm>
	  <primary>encoding</primary>
	  <secondary>UTF-8</secondary>
	</indexterm>
	<indexterm>
	  <primary>Web Browsers</primary>
	  <secondary>Unicode Compliance</secondary>
	</indexterm>
	<indexterm>
	  <primary>Universal Character Set</primary>
	  <secondary>UCS-2</secondary></indexterm>
	<indexterm>
	  <primary>UTF-8</primary>
	</indexterm>
	<indexterm>
	  <primary>ISO IEC</primary>
	</indexterm>
	<indexterm>
	  <primary>ISO 10646</primary>
          <secondary>ISO</secondary>
	</indexterm>
      </para>
     
      <itemizedlist spacing="normal" id="uft8quick">
	<title>UTF-8 Quick notes</title>
	<listitem>
	  <para>The <emphasis>Universal Character Set</emphasis> is defined by
            <emphasis>ISO/IEC standard ISO 10646</emphasis>.
          </para>
        </listitem>
	<listitem>
          <para>Unicode is a parallel standard that functionally defines a few
            "coal face" or real life, protocols.</para>
	</listitem>
	<listitem>
	  <para>Unicode is kept in synchronisation with ISO/IEC 10646.</para>
	</listitem>
	<listitem>
	  <para>UTF-8 is one of four ways to save Unicode into bytes. Another common one is UTF-16.</para>
	</listitem>
	<listitem>
	  <para>UTF-8 is the default standard for XML documents and is an excellent
            choice as it contains the character mappings sets from (almost) all known
            languages, while being fully compliant with current and earlier computer
            standards. Note most web browsers and protocols assume a UTF-8 compliant encoding.</para>
	</listitem>
	<listitem>
	  <para>All ISO type character sets in general use are covered by UTF-8 (including ASCII)
            as the KJC Family (Korean, Japanese and some of the Chinese family of characters).
            If your editor has a UTF-8 option please set it on.
	  </para>
	</listitem>
      </itemizedlist>
      <tip>
	<para>If you have automated some or all of your dictionary construction, please be
          careful to maintain character type compatibility throughout the process. C coders
          should use the <quote>wide character type</quote>. Most scripting languages now
          also support UTF-8 (Python, Perl, PHP, Java and Ruby at least). Shell scripts usually
          adopt the local environment settings. Please check your <application>gawk</application>
	  and <application>sed</application> are mapping cleanly.
          You may need very recent versions. 
	  <indexterm>
	    <primary>Scripting Languages</primary>
	    <secondary>Python, Perl, Java, PHP, Ruby</secondary>
	    <tertiary>Awk Sed Sh ENV</tertiary>
	  </indexterm>
	</para>
      </tip>
      <formalpara id="Unicode-more">
	<title>More Information</title>
	<para>If you are on a Linux (or similar) system, try <command>man 7 unicode</command>.
          You may also have some unicode tools on board: <command>man 1 unicode</command>.
          You may also need to set your <envar>LANG</envar> environment
	  settings, most Linux type systems support doing this on a per instance basis,
	  that is you may run a number of language and locales concurrently. Examples
	  are given later.
          <indexterm><primary>locale</primary></indexterm>
	  <indexterm><primary>LANG</primary></indexterm>
	  <indexterm><primary>Linux Manual Page</primary></indexterm>
	</para>
      </formalpara>
  </chapter>
  
</part>

<!-- Keep this comment at the end of the file. It is for emacs.
 Local Variables:
 sgml-parent-document: ("freedicthowto.xml" "book" "part")
 End:
 -->

