Digital preservation : terminology , techniques , testing and trust

The importance and magnitude of the problem facing society about preserving our digitally encoded intellectual and cultural capital is not in doubt. However, there are a number of fundamental challenges which must be overcome in order to provide adequate solutions. This paper will describe the progress which has been made so far in solving these challenges and the further progress which must be made if we are to safeguard our digital holdings. For example, the terminology used in different disciplines is a barrier to sharing ideas and improving practice; the OAIS reference model has provided a partial solution to this but more can be done. There is a fundamental divide in mind-sets and approaches between those dealing with rendered objects and those dealing with digital objects which are processed rather than rendered which causes much confusion; this paper will provide a view which resolves these differences. Claims for the various digital preservation techniques abound, yet which ones actually work? A fundamental approach to providing rather convincing evidence, which should be used to test any such claims, will be described. Claims that a repository is preserving the digital holdings with which it has been entrusted are hard to test; how can depositors in or funders of repositories know which repositories can be trusted? This talk will describe the mechanism which is being developed to ISO certify repositories and what is left to do.


WHAT IS SPECIAL ABOUT DIGITAL OBJECTS
One might say that digital objects give rise to special concerns because the '1's and '0's which make up binary things are difficult to see.In the case of CD-ROMs one can, with a microscope, see the pits in the surface.However, those little pits on the disk are not the bits.
To get to the bits one needs to unravel the various levels of bit-stuffing, the error correction codes and logical addressing; and above that there are the files and file systems.These things are handled by the electronics of the CD-ROM reader or of the computer hard disk, where one would be looking at magnetic domains rather than pits, and they expose a relatively simple electronic interface that talks to the rest of the computer systems in terms of bits.Such electronic interfaces illustrate a type of virtualisation which is widely used to allow equipment from many manufacturers to be used in computers.However the underlying technology of such disks changes relatively quickly and so do the interfaces, as a result one cannot usually use an old type of disk in a new computer.This applies both to the well known example of floppy disks, CD-ROMs and to internal spinning hard disks.
It may be argued that a better, simpler way, which has been proven to hold information for hundreds of years involves simply writing the '1's and '0's on acid-free paper, carve the '1's and '0's on stone or write very tiny characters on Silicon Carbide sheets (Aoki, 2008).
Those techniques would get around some problems, although one might only want to use them for really, really, really important digital objects since they sound as if they could be very expensive.Therefore, they are not solutions for the family photographs, although they may be very good for simple text documents (although in that case one might as well simply print the characters out rather than the '1's and '0's).However, there are some more fundamental problems with these approaches.For example, they are not even the solution for things like spreadsheets where one needs to know what the columns and cells mean.Similarly scientific data, as we will see, need a great deal of additional information in order to be usable.

THREATS TO DIGITAL OBJECTS OF IMPORTANCE TO YOU
Take a moment to think about the digital objects which affect your life.These days at home we have family photographs and videos, letters, emails, bank records, software licences, identity certificates, spreadsheets of budgets and plans, encrypted private data and zip files containing some or all of these things.One might have more complex things such as Word documents with linked-in spreadsheets or databases.Widening the picture now to one's work and leisure, the list might include architectural plans, engineering designs, scientific data from many sources, models and analysis results.
Many will already have had the experience of finding a digital object (let's say for simplicity a file) for which one no longer remembers the details or for which one no longer has the software one used to use.In the case of images or documents there is, at the moment, a reasonable chance of finding some way of viewing them, and that may be perfectly adequate, although one might for example want to know who the people in the photograph are.This would be equivalent to storing a book or photograph on a shelf and then picking it up after many years and still being able to view the symbols or images on the page as before, although the reader may not be able to understand the meaning of those symbols.
On the other hand, many may also have had the experience of finding a spreadsheet, still being able to view all the numbers, text and formulae, and yet be unable to remember what the various cells and columns mean.Thus despite knowing the format of the file and having the appropriate software, the information is essentially lost!Looking yet further afield, consider the digital record of a cultural heritage site such as the Taj Mahal as measured 10 years ago.
In order to know whether visitors have damaged the site one would need to compare those measurements with current day measurementswhich may have been measured by different instruments or stored in a different way.Thus one needs to be able to combine data of various types.
Based on the comparison one may decide that urgent remedial work is needed and that site visits should be restricted.However, before expending valuable resources there must be confidence that the old data has not been altered and that it is indeed what it is claimed to be.
Other complications may arise.For example many digital objects cannot, or at least should not, be freely distributed.Even photographs taken for some purpose which have some passers-by in the scene perhaps should not be used without the permission of those passers-by -but that may depend upon the different legal systems of the country the photograph was taken in, the country where the photograph is held and the country in which it is being distributed.As time passes, legal systems change.Is it possible to determine the legal position easily?
Thinking about another everyday problem -many Web links no longer work.This will probably get worse over time; yet Web links are often used as an intrinsic part of virtual collections of things.How will we cope with being unable to locate what we need, after even a quite short time?
Of course we may deposit our valuable digital objects in what we consider a safe place.But how do we know that it is indeed safe and can counter the threats noted above.Indeed what happens when that organisation which provides the safe place loses its funding or is taken-over and changes its name or simply goes out of business?As a case in point the domain name 'casparpreserves.eu',within which the CASPAR (CASPAR, 2006) web site belongs, is owned by the author of this paper; what will happen to that domain name in 50 years' time?
Increasingly one finds research papers, for example in on-line versions of journals, which have links to the data on which the research is based.In such a case some or all of the above issues may threaten their survival.
Another peculiarity of digital data is that it is easy to copy and to change.Therefore, how can one know whether any digital object we have is what it is claimed to be -how can we trust it?A related question is -how was a particular digital object made?A digital object is produced by some process -usually some computer application with certain inputs.In fact it could have been the product of a multitude of processes with a multitude of input data.How can we tell what these processes and inputs were, and whether these processes and inputs were what we believe them to have been.Or alternatively perhaps we want to produce something similar, but using a slightly changed process -for example because the calibration of an instrument has changed -how can this be done?
To answer these kinds of questions for a physical object, such as a velum parchment or a painting, one can do physical tests to give information about age, chemical composition or surface contaminants.While none of these provide complete answers to the questions, because for example one needs documentation about ownership of paintings, at least they provide a reality check; yet none of these techniques is available for digital objects.Of course these techniques can be applied to the physical carriers of the bits, but those bits can usually be changed without detection.One can think of technologiesfor example the carvings on stone -where it might be easy to detect changes in the bits by changes in the physical medium, but things are never so clearcut, for there one is crossing the boundary between the physical and the digital and blurring the distinction.
The PARSE.Insight project (PARSE.Insight, n.d.) undertook a massive survey which showed a majority view, across continents, disciplines and roles, of the threats to digital objects.Table 1 summarises these threats, supplemented by general statements of what is needed for their solution and some specific solutions taken from the CASPAR project.The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity.
Ability to bring together evidence from diverse sources about the Authenticity of a digital object.
Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity.
Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future.
Ability to deal with Digital Rights correctly in a changing and evolving environment.

Digital Rights and Access Rights tools allow one to virtualise and preserve the DRM and Access Rights information which exist at the time the Content Information is submitted for preservation.
Loss of ability to identify the location of data.
An ID resolver which is really persistent.Persistent Identifier system: such a system will allow objects to be located over time.
The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future.
Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation.
Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another.
The ones we trust to look after the digital holdings may let us down.
Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term (see RAC).
The Audit and Certification standard to which CASPAR has contributed will allow a certification process to be set up.

PRESERVATION TERMINOLOGY
One often hears or reads that the solution is 'metadata', i.e. data about data.There is some truth in that but one needs to ask some pertinent questions, not the least of which are 'what types of "metadata"?' and 'how much "metadata"?'.For example, it is clear that by 'metadata' many people simple refer to ways of classifying or finding something -which is not enough for preservation.Without being able to answer these questions one might as well simply say we need 'extra stuff'.
The answers will be based on the approach provided by the Open Archival Information System Reference Model (ISO 14721, 2002), an international standard which is used as the basis of much, perhaps most, of the work in this area.
Indeed it has been said that it is 'now adopted as the "de facto" standard for building digital archives' (National Science Foundation, 2007).

OAIS terminology
The OAIS Reference Model in fact contains a number of separate models for different aspects of archives, including the Functional Model and Information Model.It also has data flow and context diagrams.The terminology introduced in these models, such as Ingest and Archival Storage is defined in order to make it easier for the functions of one repository to be compared to another.Archives often show that their functions can be mapped to the OAIS Functional Model.However, being able to describe a repository using OAIS terminology is not proof that the repository is OAIS compliant.Instead it is an illustration of the flexibility of the OAIS terminology.
The most important of these models, and the one which is needed for OAIS conformance, is the Information Model, shown in Figure 2.

Representation Information
The UML diagram (Figure 2) means that: • an Information Object is made up of a Data Object and Representation Information.• a Data Object can be either a Physical Object or a Digital Object.An example of the former is a piece of paper or a rock sample.• a Digital Object is made up of one or more Bits.• a Data Object is interpreted using Representation Information.

• Representation
Information is itself interpreted using further Representation Information.

Definition of the Designated Community
It will be noticed that the Representation Information has a loop back to itself.This is a reminder that Representation Information is itself Information, and will itself need its own Representation Information.The realisation of this recursion leads us the concept of Designated Community and to the answers to the key questions: (1) how much Representation Information is needed and (2) how can preservation be tested?Designated Community : is defined by OAIS as: An identified group of potential Consumers who should be able to understand a particular set of information.The Designated Community may be composed of multiple user communities.A Designated Community is defined by the archive and this definition may change over time.
Why is this a key concept?To answer that question we need to ask another fundamental question, namely 'How can we tell whether a digital object has been successfully preserved?' -a question which can be asked repeatedly as time passes.Clearly we can do the simple things like checking whether the bit sequences are unchanged over time, using one or more standard techniques such as digital digests.However, just having the bits is not enough.The demand for the ability for the object to be 'interpreted, understood and used' is broader than that -and of course it can be tested.
But surely there is another qualification; for is it sensible to demand that anyone can 'interpret, understand and use' the digital object -say a four year old child?
Clearly we need to be more specific.But how can such a group be specified, and indeed who should choose?This seems a daunting task -who could possibly be in a position to do that?The answer that OAIS provides is a subtle one.The people who can should be able to 'interpret, understand and use' the digital object, and whom we can use to test the success or otherwise of the 'preservation', are defined by the people who are doing the preservation.
The advantage of this definition is that it leads to something that can be tested.So if an archive claims 'we are preserving this digital object for astronomers' we can then call in an astronomer to test that claim.The disadvantage is that the preserver could choose a definition which made life easy for him/her -what is to stop that?The answer is that there is nothing to prevent that but who would rely on such an archive?
As long as the archive's definition is made clear then the person depositing the digital objects can decide whether this is acceptable.The success or failure of the archive will be determined by the market.Thus in order to succeed the archive will have to define its Designated Community(ies) appropriately.Different archives, holding the same digital object, may define their Designated Communities as being different.This will have implications for the amount and type of 'metadata' which is needed.
By being able to 'understand' a piece of information is meant that one can do something useful with it; it is not intended to mean that one understands all of its ramifications.For example, in a criminal investigation one may have a database with digitally encoded times of telephone calls; here we would be satisfied if we could say 'the telephone call was made at 12:05 pm on 1 January 2009, UK time', but to understand then that this implied that the person who made the call was the murderer is beyond what OAIS means by being able to 'understand' the data.
An important clarification is needed here, namely that the definition of the Designated Community is left to the preserver.The same digital object held in different repositories could be being preserved for different Designated Communities, each of which could consist of many disjoint communities.
The quid pro quo is that those funding or entrusting digital objects to the repository can judge whether the definition of the Designated Community is appropriate for their needs.a) high level knowledge b) various types of descriptions including a the way in which complex objects may be viewed as a composite of simpler objects.Some of these objects may be discipline specific whereas others are rather general.For example an image is a fairly general concept -essentially an array of numbers, whereas an Astronomical image is an image plus an astronomical co-ordinate system and a way to map to physical measurements.
Details of the simple objects down to the bit level must also be captured.
Note that here, as well as elsewhere, virtualisation techniques can be applied.Further details of this and many other aspects of preservation can be found on the CASPAR web site and in particular the CASPAR Conceptual Model (CASPAR, 2007) • The digital objects must be stored, indicated here as a Preservation Object Data Store.
Subsequently the process must be reversed, for example: • Information must be extracted using the Representation Information at various levels • Access constraints must be understood and respected It is worth noting that much of these descriptions and extra pieces of information (metadata) will themselves be digitally encoded and will therefore also need to be preserved, using the same techniques.

OAIS conformance
OAIS defines a number of responsibilities by which to judge conformance, which may be summarised as an OAIS must (these are the revised versions of these responsibilities): • Negotiate for and accept appropriate information from information Producers.• Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.• Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided, thereby defining its Knowledge Base.• Ensure that the information to be preserved is Independently Understandable to the Designated Community.In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information.
• Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, including the demise of the archive, ensuring that it is never deleted unless allowed as part of an approved strategy -there should be no adhoc deletions, • Make the preserved information available to the Designated Community and enable the information to be disseminated as copies of, or as traceable to, the original submitted Data Object., with evidence supporting its Authenticity.

PRESERVATION WORKFLOWS AND TECHNIQUES
A number of other workflows arise from the support components identified by CASPAR and the UK Digital Curation Centre (http://www.dcc.ac.uk), which may be summarized in Figure 4.

Workflows for use of digital objects
The following workflow, extracted from Figure 4, illustrates the way in which digital objects may be used and understood by users.
The basic idea is that Representation Information must be associated with the Data Object.Identifiers (called here Curation Persistent Identifiers -CPID) which can be associated with any data object, point to the appropriate Representation Information in a Registry/Repository, as illustrated in Figure 5.The Representation Information returned by the Registry/Repository itself is a digital object with its own CPID.This is not meant to imply that there must be a single, unique, Registry/Repository, nor even a single definitive piece of Representation Information for any particular piece of digitally encoded information.The Representation Information may be packed with the Data Object or may be otherwise cached locally.
The issue which must be considered next is maintaining the Representation Network.This is crucial because the allows the Data Object to remain understandable despite changes in hardware, software, environment and the Knowledge base of the Designated Community.As a result of these changes 'gaps' will arise between the available Representation Network and the Designated Community's Knowledge Base.The way in which these are filled is addressed in the next section.

Workflows for maintaining the Representation Network
The Registry/Repository is supplemented by the Knowledge Manager -more specifically a Representation Information Gap manager which identifies gaps which need to be filled, based on information supplied to the Orchestration component.Of course the information on which this is based does not come out of thin air.People (initially) must provide this information and the Orchestration Manager collects this information and distributes.
Support for automation in identifying such 'gaps' , based on information received, is illustrated in Figure 6 which shows users (u1, u2…) with user profiles (p1, p2… -each a description of the user's Knowledge Base) with Representation Information (m1, m2,…) to understand various digital objects (o1, o2…).
Take for example user u1 trying to understand digital object o1.To understand o1, Representation Information m1 is needed.The profile p1 shows that user u1 understands m1 (and therefore its dependencies m2, m3 and m4) and therefore has enough Representation Information to understand o1.When user u2 tries to understand o2 we see that o2 needs Representation Information m3 and m4.Profile p2 shows that u2 understands m2 (and therefore m3).However, there is a gap, namely m4, which is required for u2 to understand o2.
For u2 to understand o1, we can see that Representation Information m1 and m4 need to be supplied.Further details are available in the CASPAR Conceptual Model (CASPAR, 2007) and in Tzitzikas (2007) and Tzitzikas & Flouris (2007).This illustrates one of the areas in which Knowledge Management techniques are being applied within CASPAR, in addition to the capture of Semantic Representation Information.

TESTING CLAIMS
We can consider some of the things which can change over time and hence against which an archive must safeguard the digitally encoded information.

Hardware and software changes
Use of many digital objects relies on specific software and hardware, for example applications which run on specific versions of Microsoft Windows which in turn runs on Intel processors.Experience shows that, while it may be possible to keep hardware and software available for some time after it has become obsolete, it is not a practical proposition into the indefinite future.However, there are several projects and proposals which aim to emulate hardware systems and hence run software systems.

Environment changes
These include changes to licences or copyright and changes to organisations, affecting the usability of digital objects.External information, ranging from the DNS to DTDs and Schema, vital to the use and understandability, may also become unavailable.

Termination of the Archive
Without permanent funding, any archive will, at some time, end.It is therefore possible for the bits to be lost, and much else besides, including the knowledge of the curators of the information encoded in those bits.Experience shows that much essential knowledge, such as the linkage between holdings, operation of specialised hardware and software and links of data files to events recorded in system logs, is held by such curators but not encoded for exchange or preservation.Bearing these things in mind it is clear that any repository must be prepared to hand over its holdingtogether with all these tacit pieces of informationto its successor(s).

Changes in what people know
As described earlier the Knowledge Base of the Designated Community determines the amount of Representation Information which must be available.This Knowledge Base changes over time.
The CASPAR project ( 2009) undertook a number of scenarios in which accelerated lifetime tests were carried out.A number of archives were involved, including data from the science, contemporary performing arts and cultural heritage domains.Table 2 summarises these scenarios.The first document lists the metrics against which a digital repository may be judged.It is anticipated that this list will be used for internal metrics or peerreview of repositories, as well as for the formal ISO audit process.It must be recognised that the audit process cannot be specified in very fine, rigid, detail.An audit process must depend upon the experience and expertise of the auditors.For this reason the second document sets out the system under which the audit process is carried out; in particular the expertise of the auditors and the qualification which they should have is specified.In this way the document specifies how auditors are accredited and thereby helps to guarantee the consistency of the audit and certification process.For this reason the RAC Working Group refers to accreditation and certification processes.
At the time of writing both documents are in an advanced state of preparation Because of the close links between the metrics and OAIS concepts and terminology it is important that the two remain consistent, and cross-membership of the working groups ensures this.

SUMMARY AND CONCLUSIONS
This paper has sought to illustrate a number of recent developments in our understanding of the important issues in digital preservation including threats to the preservation of digitally encoded information and solutions to these threats.One important theme through this paper has the importance of evidence.Evidence and testability is at the core of the OAIS view of preservation and also at the core of the techniques developed in the CASPAR project.Solid evidence should be demanded before we entrust our valuable, and often irreplaceable, digitally encoded intellectual capital to any preservation solution.
A second theme is the central role of knowledge and information.This is especially clear for digital objects such as scientific data where maintaining the ability to render the object is not sufficient -one must also maintain the understandability and usability of these objects by members of the Designated Community.This requirement introduces significant complications to digital preservation and yet these requirements cannot be ignored.The CASPAR project has gone some way to indicate that solutions are possible.

Figure 4 :
Figure 4: Preservation Workflows Figure 4 contains a number of information flows; some sequences of these flows making up workflows important for digital preservation and two of these are described next.

Figure 5 :
Figure 5: Use of Registry/Repository of Representation Information

Table 1 :
Threats and solutions The Representation Information will include such things as software source code and emulators.

Table 2 :
CASPAR accelerated lifetime testsThe need for a standard for certification of archives was included in that list and the RLG/NARA work which produced TRAC was the first step in that process.The next step was to bring the output of the RLG/NARA working group back into CCSDS.This has been done and the Digital Repository Audit and Certification (RAC) Working Group has been created (Digital Repository Audit and Certification Wiki, n.d.).Both may be read by anybody but, in order to avoid hackers, only authorised users may add to them.The openness of the development process is particularly important and the latter site contains the notes from the weekly virtual meetings as well as the live working version of the draft standards.Besides developing the metrics, which started from the TRAC document, the working group also has been working on the strategy for creating the accreditation and certification process.Review of existing systems which have accreditation and certification standard processes it became clear that there was a need for two documents:1.Metrics for Audit and Certification of Digital Repositories.2. Requirements for Bodies Providing Audit and Certification of Digital Repositories.