Practical Aspects of Digital Preservation

Ellis Weinberger
CEDARS Project Officer
Cambridge University Library

1 Introduction & outline

1.1 Introduction

As part of its Electronic Libraries programme, the Joint Information Systems Committee is funding the Consortium of University Research Libraries to manage the Exemplars in Digital Archives (CEDARS) project. The project aims to address strategic, methodological, and practical issues relating to the preservation of digital materials. CEDARS will consider collection management policies, and concerns associated with intellectual property rights, technical methods, and preservation-specific metadata. The project will also design a digital preservation archiving system which will be pilot tested at participating institutions.

Work on the project on both technical and conceptual levels has been based on the Open Archival Information System (OAIS) model, which is being developed by the International Standards Organisation (ISO) and the Consultative Committee for Space Data Systems (CCSDS). CEDARS is consulting and co-operating with related projects at the Arts and Humanities Data Service and at the British Library.

1.2 Outline

The following five steps, which are necessary for the long-term preservation and use of any digital object, will be discussed in this paper.

Digital objects, such as CD-ROMs and online databases, do not keep very well. Good ink, on good paper, stored in a cool, dry, dark room can be read for about a thousand years. Digital objects, stored in a cool, dry, dark room can be used for about 5 years. The ability to use digital objects is lost because the hardware becomes unavailable, the software is no longer distributed, or the medium bearing the data deteriorates.

In order to explain the concepts of digital preservation, a fictitious example will be used.

2 Example

2.1 Professor Huddy

Cornwall is independent, and rich. Professor Huddy is Professor of Cornish Language and Literature at the University College of Cornwall at Truro. The Professor comes from an old Cornish family, and has many friends in Cornwall. Professor Huddy studied at Exeter and Dublin, and specializes in medieval Cornish drama.

2.2 The Professor's life-work

For the past twenty years the Professor has been collecting transcriptions of Cornish-language manuscripts in digital form. The transcriptions are of works which the Professor believes have made a significant contribution to the development of Cornish literature. They are from both older texts and more recent literature, and have been migrated from one word-processing format to another over the years.

2.3 The digital object: `Lyenn'

Lately, all the transcriptions have been collected on a CD-ROM entitled `Lyenn', meaning `literature' in Cornish, which has been published by the University Press. The CD-ROM has been mounted on the university network for use by members of the university. The texts are currently in HTML, and can be rendered by any browser supporting HTML 3.2. To render is to transform the stored digital object into a form that a person can use.

3 Selection

Dr Angove, head of collection management at the University Library, University College of Cornwall at Truro, has to decide whether to preserve this digital object, containing these transcriptions from Cornish-language manuscripts. Preparing this digital object for preservation, and preserving it for posterity, will cost a lot of money, probably more than it cost to publish the object. Dr Angove needs to be sure that this will be a prudent use of resources. The cost of preparing a digital object for preservation may be significantly higher than the cost of preparing a book for use in a library, perhaps by a factor of 10 (Hendly, 1998; Puglia, 1999).

Dr Angove needs to check that the digital object contributes to the existing subject strengths and the aims of the library, and fulfills the needs of the users. In addition, the intellectual property rights issues, and the technical preservation details, need to be addressed.

3.1 Reasons for selection

Dr Angove decides, in consultation with subject specialists, that since the transcripts were made by a senior member of the university, and the subject covered by the digital object directly supports one of the prime aims of the university, namely, study of the Cornish language, and because many members of the university, as well as many visiting scholars, are already using the digital object, the library should preserve the digital object.

3.2 Preservation capability

Dr Janes, who met Professor Huddy when they were both studying at Exeter, is Director of the computing service at the University College of Cornwall at Truro. Dr Janes has personnel with training and experience, and has the funding, to preserve this type of digital object. Dr Janes is using the CEDARS guide to best practice for a digital archive, and the university's digital store complies with the CEDARS model.

4 Intellectual property rights

4.1 Many holders of copyright

Many of the authors of texts included in `Lyenn' died over a hundred years ago, but some of them are still alive, and the intellectual property rights of Professor Huddy, who transcribed, described, and edited the documents, also have to be considered.

The University Press could have released a CD-ROM containing the set of documents and software to view the documents. The rights to the documents would have belonged to one group of people, and the rights to the software could have belonged to a separate software publisher. The assistance of the software publisher may be very useful when preserving a digital object.

Dr Angove has to confirm that all the people who hold rights in the content of the digital object are contacted, and give their consent to preservation of the object.

The legal department of the University Library is able to carry out rights negotiations with the University Press, and with other holders of intellectual property rights.

4.2 Obtain permission if possible

There may be cases where the holders of rights cannot be located, and the digital object is in danger of being lost. In these cases, the librarian may need to consider preserving the object first, and obtaining access permission afterwards.

5 Metadata

In order to preserve a digital object, it is necessary to gather a great deal of detailed information about the object, and to code the information into metadata. The metadata will help enable the archive to preserve the stored digital object and provide access to it for posterity.

5.1 Underlying abstract form

One of the most important pieces of information about any digital object is its underlying abstract form (Holdsworth and Sergeant, 1999). The underlying abstract form relates to the digital object in the same way that the work relates to the book. The underlying abstract form of the object has to be determined, and extracted from its current physical manifestation, so that it can be preserved in a form which is not dependent on a specific hardware format.

The underlying abstract form of this digital object, containing the Cornish-language transcriptions, is a set of text files. The files are in an ISO standard character set, ISO 8859-1, and a standard format, HTML 3.2. This will increase the likelihood that standard and common software for rendering the object can be used.

5.2 Rendering

In some cases, the metadata will have to specify that the stored digital object requires an emulator for a specific operating system, and an emulator for a specific release of a specific application software package, in order to be rendered. This will usually be the case when the original digital object was in a proprietary format. When open standards formats are used, digital objects can probably be preserved by migrating them to newer hardware and software.

5.3 Metadata structure

The structure of the metadata is based on the OAIS structure (CCSDS Secretariat, 1999). The Information Package is divided into two parts, Content Information and Preservation Description Information. The rendering information is part of the Representation Information metadata, which is part of the Content Information.

The administration of the CEDARS archive is supported by the Preservation Description Information metadata, which includes Reference Information, for example: title; Context Information, for example: reason for preservation; Provenance Information, for example: rights management; and Fixity Information, for example: checksum algorithm.

5.4 Subject access

Professor Huddy has already described each document, and these descriptions, together with information from the digital object itself, can be used to prepare resource discovery metadata. This metadata, together with a CEDARS digital object registry number, can be shared with search engines, OPACs, and subject gateways, to enable interested scholars to find this resource.

5.5 Intellectual property rights

The metadata covering rights management contains the history of submission negotiations, details to help identify intellectual property rights holders, descriptions of the types of user who are permitted to use this digital object, and details of the permitted actions with this digital object.

Actions by archive staff which are needed to ensure long-term preservation are assumed to be permitted, since the digital objects are deposited in an archive designed to ensure long-term preservation.

6 Technical preservation

6.1 The underlying abstract form and the bit stream

In order to help ensure the long-term preservation of the object, the underlying abstract form, in this case the set of text files, will be concatenated and compacted, and stored as a bit stream. The stored object should be kept in at least two copies in at least two locations, using at least two different hardware formats. Dr Angove has one copy on a server in the University Library. Dr Janes has another networked copy on a tape filing system in the climate-controlled basement machine room of the Faculty of Cornish Language and Literature. Other copies of the Archival Information Packages may be stored at CEDARS model archives elsewhere. The digital object will be stored separately from, but refer to, both the metadata which describes it, and the rendering software necessary to use it.

6.2 Maintain detailed rendering information

The information in the metadata will be regularly reviewed and revised by the CEDARS model archives, depending on changes in copyright legislation and rendering software. Since the metadata and the rendering software are separate from the stored digital object, advances in rendering and emulation technology can be used by the CEDARS digital preservation archives without the need to alter the original stored object.

7 Retrieval

When a scholar is searching for information on Cornish-language manuscripts, and comes across a reference to Professor Huddy's work in a subject gateway, search engine, or OPAC, the resource discovery metadata will direct the query to a CEDARS archive engine, which will look at the registry number, and direct the query to a CEDARS archive storage location. The storage location will use the metadata to choose the appropriate rendering software to supply either the digital object, or information about the object, to the scholar. In this case, because of the intellectual property rights metadata, as agreed with Professor Huddy, the system will tell the scholar to travel to Truro to view the document.

8 Summary

The CEDARS digital preservation archive, derived from the OAIS model, assumes distributed preservation archives, with digital objects located using resource discovery metadata in subject gateways, OPACs, and search engines. The objects are accessed through distributed name servers, which point to digital objects stored as bit streams containing concatenated or compacted objects based upon the object's underlying abstract forms. The digital objects are preferably produced using open standards formats. The objects are rendered by software whose specification is given in metadata, and the users view the digital objects using software recommended in the metadata.

Preserving digital objects is a complex and expensive process. Preserving digital objects which are produced by scholars today ensures that future scholars will be able to build on today's work.

I would like to thank Mr Peter Fox, Dr David Holdsworth, Ms Patricia Killiard, Mr Alan Morrison, Professor Stefan Reif, Ms Kelly Russell, Dr Derek Sergeant, Mr Andrew Stone, and Mr John Wells.

