Page Up | Cambridge Academic Links: Content Page | About ?

The TEI Dictionary Schema for XML: its worth and weakness for the lexicographer


Since 1998, a Greek Lexicon project has been underway at Cambridge, and is due for completion in 2016. From the start, we decided to make extensive use of digital resources, both for the linguistic research which we undertook, and for the writing environment itself. Our major digital research tool is a pre-searched text database, which displays all the inflected forms for each Greek headword (this is described in Fraser 2008a).

Our project and its XML requirements

We needed a system for both print and online publication, which would also function as an authorial tool. As we had limited funds, all our resources had to be focussed on the writing team, and so we needed an authorial tool which would avoid double-handling (writing plus encoding), and which accurately reflected our entry structure (so would already be familiar to our writers), and whose elements had familiar names, which would take up minimal physical space in the editing window. Our solution was a "hands-on" approach, which I call tag and type, as described below.

Reasons for rejecting the TEI schema

Our reasons for rejecting the TEI approach can be explained in general or in specific terms. The general reason is that an off-the-peg system needs to be adapted for each project, while a tailor-made one can be designed to fit the lexicographic requirements. The specific reason may be stated in mathematical terms: that a beautiful equation is preferable to an ugly one. Rather than an unnecessary addition, elegance appears to be a sign of efficiency, and messiness is a signal of a weak structure and redundancy. I did in fact initially decide to use the TEI system, and my mind was changed only as it became evident that the coding was redundant, cluttered, and over-complex. These weaknesses stem from its dependence on mixed-content elements, all of which have a large number of child elements.

So what's wrong with mixed content?

In XML terms, the second structure corresponds to mixed-content elements, where the children can be repeated any number of times, and their order is not specified. A schema composed purely of mixed-content elements therefore has an inherent weakness.

Unfortunately, this was the structure chosen by the TEI. In 2003, every element is mixed-content, and every one of them has over 60 children. The result may be seen in the element "tr", designed to contain a translation.

This coding signifies that the element "tr" may contain any amount of text inside it (PCDATA) and also 63 child elements, which may appear any number of times, in any order, interspersed in any way with the text.

In TEI Release 4, the situation had become worse: "tr" now contained 106 children. In the latest P5 Release, considerable reorganisation has taken place, and the "tr" element appears to have been eliminated, but the similar "def" (definition) element has 55 children, still showing the same dependence on unmanageably-large mixed-content elements. This is particularly disappointing because it discards at one stroke what is generally thought to be the greatest strength of XML: that it can add structure to a text.

An alternative to TEI: self-design

How easy is it to design one's own XML structure? In our experience, the hard work was less in designing the digital structure than in finalising the lexicographic principles. Once that was done, the structure was relatively simple to create, because, the more structure was coded into the elements themselves, the less heavy-lifting remained to be done by XML attributes or XSL styling. It proved straightforward to configure a suitable structure, because our lexicographic requirements turned out to be expressible in less than 100 elements.

In this system, the writers select and enter the elements as they compose. Each writer has an XML editor set in a 'tags-on' view, which shows only those elements which are available at any given point in the entry. The process of composition is therefore a series of choices, presented in turn. The first choice is to select an entry structure for the part of speech: a noun, name, adjective, verb, adverb or preposition. After selecting the part of speech, the next choice available is the headword, then the inflection, part of speech, and so on, and the writer selects each in turn and enters text into the chosen element. The entry is therefore built up in the editing window in its natural order, with the text visible and automatically formatted, and the tag positions are also visible, so the writer can move about easily in the entry, making additions and corrections. At any time, a PDF view can also be generated.

Another advantage of the tailor-made approach is that the development of the DTD and the associated XSL stylesheet depends on the the development of the lexicographic approach. The two processes can be undertaken in parallel, with the writing team fully involved in the digital development, by writing new entries throughout the period, and suggesting improvements to the digital output. Our writers could be fully involved in the XML design process, because that involved configuring a small number of XML elements, each of which was named according to our own preferred terminology (another reason not to use an 'off-the-peg' system).

Using TEI and other existing schemas

Due to the loss of structure and the extra complexities which the TEI schema introduces into the authorial environment, as described above, it does not seem to provide a suitable structure for a new dictionary. However, the inclusivity of its structure does mean that it will be capable of capturing the structure of dictionaries which have already been written, and just need to be encoded into XML. Clearly, many dictionaries do use TEI, and in a number of publications have described how it was possible to adapt the TEI structure to their lexicographic requirements.

One benefit of an off-the-peg system is likely to be cross-compatibility in linking to other material. This was a major consideration for our own project: we were initially concerned that we might have problems with the digital edition of our lexicon, because that will be posted on Perseus, who use TEI-conformant coding. But bilingual lexicography always relies on the possibility of translation, and Perseus tell us that linking will in fact be possible.4 Our choice of a self-designed system appears to be compatible with the digital as well as the print platform.

The future and the advantage of an author-centred approach

I'm keen to see the end of a perceived division between humanities academics as scholars – in literature, linguistics, epigraphy, archaeology and so on – or as IT specialists. Because we are all involved in publishing, we all need to be comfortable with digital text-formatting, and the XML environment is a very accessible one. I'm deeply grateful to programmers who encouraged me to engage with the technical details of text composition, because the experience has shown me that the best results can be achieved when the technical details are designed to serve the scholarly task in hand: by configuring our IT systems to our academic methodology, rather than the other way round. And the principles of XML are not difficult to grasp, if you know how web pages are tagged, and XSL, though rather more challenging, can be regarded as an exercise in organising conditional clauses.
1 This may help explain why, of the pre-existing structures we examined, the OED's schema came closest to meeting our requirements, because it is also a historical and literary dictionary, albeit a monolingual one.

2 The original name was intended to emphasise that the lexicon writers choose the XML tags themselves, rather than the tags being added to the composed text. The name has been changed to highlight the order in which the writers work: first selecting the appropriate tag, and then composing their text.

3 In the Preface of the Oxford English Dictionary, Second Edition, 1989, the editors Simpson and Weiner described the development of the SGML system in these terms: "The structure designed by Sir James Murray and used by him and all his successors for writing dictionary entries was so regular that it was possible to analyze them as if they were sentences of a language with a definite syntax and grammar. They could therefore be parsed."

4 I'm most grateful to Bridget Almas of Perseus for explaining the technical details: our source XML will be transformed into the Perseus TEI-Analytics schema (a subset of TEI P5) using the Abbot toolset, and scripts will be written to identify cross-reference points between the lexicon and the Perseus material, and to insert the tags for linking.

| Top of Page | Bruce Fraser, April 2011 | Page Up |