What's the Problem: Requirements for a Digital Library

Sarr Blumson

Abstract
Reading about digital libraries often feels like listening to the blind men describe the elephant. This essay suggests that these different views result from different visions of the problems that a digital library should solve. We suggest that the digital library community needs a systematic "requirements analysis" to identify these different problems, and then outline a first pass at this analysis.

Introduction - Requirements Analysis

The first stage in an engineering design problem is usually described as requirements analysis [Press]: the process of analyzing what it is exactly that the system being designed is supposed to do. This process begins with two questions: What problem are we trying to solve, and what constituencies are we trying to serve?

This paper is motivated by the belief that the digital library community has not thought carefully enough about these questions. While some of the literature discusses requirements, for example [Gladney], these discussions seem to assume a set of problems and proceed to describe the system requirements for dealing with them. The difficulty with these efforts is that none of them seem to assume the same set of problems, with the result that there is little clear discussion of how these different assumptions interact and conflict. The goal of this paper is to begin that discussion.

We begin with a (by no means exhaustive) set of problems for which a digital library (by some definition) might be of some use:

Information Overload
Reducing Acquisition Costs
Aiding in Preservation
Allowing Wider Access to Materials
Aiding the handicapped
Facilitating Scholarly Collaboration
Allowing More Varied (eg, Multimedia and Interactive) Documents

The next section of this essay will discuss each of these problem areas in more depth. This will be followed by some exploration of how they conflict and where these conflicts might be reconcilable. Because we are focusing on requirements there will be very little discussion of solutions, except where this is necessary to illustrate potential conflicts.

In this discussion, "digital library" is left as an undefined primitive term. In this context it is, perhaps, really equivalent to the term "library." In a sense this essay is about exactly what is added by the word "digital."

Problem Areas

Information Overload

Perhaps the first vision of a digital (although he would not have used that word) library was that of Vannevar Bush [Bush], who saw technology as a means of coping with the explosion of scholarly (particularly scientific) literature that was already becoming overwhelming. He saw the possibilities both of more compact storage and of fine grained cross referencing as a way of coping with a rapidly growing body of knowledge. Bush's vision focused on indexing and locating information, but the possibility of using electronic means to improve the presentation and comprehensibility of information, such as scientific visualization tools, are also relevant.

Reducing Acquisition Costs

A related but differently focused issue is the problem of skyrocketing acquisition costs for libraries [Hawkins]. The same forces that are creating more material than scholars can absorb are creating more material than libraries can afford to buy. When libraries can communicate electronically (that is, instantly) it becomes possible for groups of libraries could share their acquisitions without reducing the availability of materials to their customers. A group of libraries could have one "complete" collection among them, rather than one complete collection each.

There is some question whether these cost savings would actually be realized. To the extent that publishing costs are dominated by preparation costs, reducing the number of copies that a publisher sold would force the publisher to increase the per copy cost by the same amount, keeping the total cost to the libraries constant. Of course paper production costs are not, in fact, zero (although CD-ROM production costs are close) so some savings would be possible.

Aiding in Preservation

Libraries are, among other things, the conservators of civilization; it was the monastic libraries that held and protected the archives of our civilization through the dark ages until we were again ready to use them. This has made us understand that our view of what is valuable and what is not is transitory. This is why libraries would like to preserve everything.

This creates some difficult problems. The most obvious is that everything is quite a lot, and the volume is ever increasing. Another is that no materials are permanent, and many of the materials on which we record civilization are particularly subject to decay; in this sense the switch from stone tablets to paper was probably not good news to archivists.

Electronic storage offers some hope of a solution to both problems. Insofar as electronic storage can capture materials at all they are much more compact, and offer at least the appearance of permanence. It also allows unique and fragile documents to be studied without risk of damage or loss.

Both of those qualifications are important. Electronic media offer relatively sophisticated mechanisms for displaying visual and auditory objects, but olfactory objects are not so easy (although the techniques for output are relatively well understood, capture is not) and tactile communication is far beyond current abilities. Vision and sound are enough to capture most library materials with a fairly high degree of satisfaction, but for many objects these would not be adequate if the only copy in the world. For some, particularly for those whose origin is electronic in the first place, they would.

The question of permanence requires some attention as well. All magnetic storage devices are subject to damage by magnetic fields in their environment and by background radiation. Optical devices are more permanent in principle, but are relatively new and our experience with them is limited. All forms of storage are, like paper, subject to physical wear and potential physical damage whenever they are actually read. While electronic storage media permit comparatively rapid bulk copying, the large volume of data involved makes the prospect of copying the entire contents of a large library on to fresh media every few years is frightening. This copying would also raise the risk of damaging the original copy as it was being read.

Allowing Wider Access to Materials

Even in the United States, many people live in areas where physical access to a quality library is difficult; in most parts of the world convenient access is rare. For both budgetary and political reasons it is quite likely that in the near term the number of people who have easy physical access to reasonably complete libraries will decline rather than increase.

Electronic libraries offer a possibility for widespread access to libraries even where physical access is impossible. If all access is by telephone then physical location doesn't matter. Of course this ignores many issues of cost and network capacity. We will have more to say about that later.

Aiding the handicapped

For many people physical access to libraries is difficult or impossible because of difficulties in physical mobility, having nothing to do with distance. They could benefit from the possibility of electronic access even if physical libraries were pervasive.

For others, physical limitations make it difficult to deal with conventional library materials. For some people holding a book, especially a large, hard bound reference volume, is difficult. Others can't see, or can't hear. Rubinsky [Rubinsky] discusses the ability to transform documents that electronic media offer. Type fonts can be changed and enlarged and page color and layout changed to accommodate particular vision limitations, or reading aloud of documents could be automated to make documents that are not of widespread interest (and therefore not normally available as sound recordings) available to blind individuals on demand. Similarly, sound recordings could be transformed into visual objects for the deaf.

This power is seen as a threat to some artists and publishers, who view any transformation as damaging to communication[Arnett]. This is clearly true to some extent. Many texts lose much of their aesthetic qualities when the original typography and layout are changed, and paintings lose much of their quality when reduced to pixels and a limited color space. In this context, however, transformation allows some communication in situations where it could otherwise not occur it all, which seems like a positive step.

Facilitating Scholarly Collaboration

Electronic documents provide an opportunity to expand scholarly collaboration in several dimensions. For example, it raises the possibility that research might be described completely, by including connections to supporting data that is (as is already common in the sciences) or could be stored electronically. Another possibility is that documents could be annotated by readers at their original source, making the comments of colleagues available to all. If Fermat's famous marginal note had been seen by others prior to his death, someone might have asked him how to prove his theorem. If his notes were part of a real time collaboration system, they might have developed the proof together.

Allowing More Varied (eg, Multimedia and Interactive) Documents

This is left to the end of the list because it is, in some sense, circular to say that libraries cannot store electronic media unless they are electronic. Clearly, libraries are going to have to store and make available some forms of digital media, because they are going to be produced for reasons having nothing to do with the needs of libraries. This is important to a general discussion, however, in that it changes the economics underlying libraries if some of the necessary infrastructure will be present independently of any of the other changes we are discussing.

Interactions

In this section we will explore some of the implications of these different requirements, focusing on the ways in which they are in conflict. In an attempt to structure this exploration we will consider several dimensions; these are essentially arbitrary but hopefully will be useful.

Fixed vs Fluid

Levy and Marshall [Levy92] describe one dimension of the difference between conventional and digital libraries by suggesting that documents in a conventional library are fixed and immutable, while documents in a digital library are fluid and changeable. To some extent, as Levy notes in a later essay [Levy94], this is really a question of degree. It has always been possible to annotate materials in a library, although whether this is viewed as history or vandalism depends on who is doing the annotating. Conversely, digital documents can actually be more fixed, since the "real" copies are never in the physical possession of anyone and cannot be altered. Still, digital storage of documents offers the opportunity to choose how fluid documents will be in a way that traditional media do not. But, our different constituencies would not make the same choices.

Fluidity is the curse of an archivist, because it blurs the criteria for what to archive. If a document is subject to change, what belongs in the archive? The "original" (assuming it can be identified)? The "final" (perhaps in the sense of most recent)? Or perhaps every version. This would not be pleasing news to the archivist who was hoping that digital storage would reduce the resources required for archiving.

On the other hand, fluidity would seem to fulfill a dream of scholarly collaboration. It offers an opportunity for a document to reflect the shared ideas of a group, and to live and grow as the group's knowledge lives and grows. However, it does raise issues of security (who can make changes) and permanence (remembering what was there before the change) that scholars don't have to deal with in a conventional library.

Fluidity in a slightly different sense, the ability to transform the presentation of a document, is what offers opportunities to the handicapped; it offers the possibility of changing appearance (or even media) in a way that makes things accessible to people with perceptual handicaps. But even this is a potential concern for archivists (which version is the "real" one?) and is also seen as a threat by some authors and publishers who view presentation as an artistic issue. Fluidity also limits schemes that identify each copy by subtle differences in appearance to detect unauthorized copying [Low].

These issues affect the choice of storage medium. Fluid documents require a storage medium that can easily be modified; CD-ROM simply will not work. Storage devices that are easily modified, such as magnetic disks, are prone to catastrophic failures and have extremely short, by library standards, lifetimes: on the order of five to ten years. A library based on these devices would require constant backup and maintenance to ensure the continued availability of materials stored on these devices.

The rapid rate of change of computer technology is an issue as well. Every scholar who wonders what to do with the boxes of punched cards stacked in her office, holding the archives of years of research, understands that preserving the media is only half the problem.

For our other constituencies fluidity is probably not of great interest one way or another, as long as it is free. Of course, it isn't, so in a finite budget the expense of supporting fluidity mean some reduction in other services.

Cost vs Detail

Many other design decisions that will be made in creating a digital library will involve cost tradeoffs. One example is in the level of detail that a digital library should record. High quality photographic images are the most expensive form of document to store and display. On the other hand, they can preserve details that cannot be captured any other way. For archiving some documents that were originally in paper form, this may be crucial. For other documents, especially those that were originally in electronic form, page description languages, such as PostScript (tm), may provide a complete description, at a somewhat reduced cost.

Neither imaging nor page description languages provide much opportunity for transformation (magnification and some limited color manipulation is possible, but font manipulation, for example, is not), and both make text searching difficult or impossible. Markup languages, such as SGML, provide another level of reduction in storage costs, and preserve the text of a document in a way that permits straightforward searching and indexing, which is important to those looking for help in dealing with information overload. They do, however, still require specialized software for display. This expense disappears when documents are stored as plain text. This still permits searching, but loses a great deal of information regarding the structure of text. It also removes any possibility of including non textual material.

Michael Hart, of Project Gutenberg, has argued (for example in [Hart]) for the dissemination of plain texts to make materials available to the widest possible audience at the lowest possible cost. For public libraries, at least, this is a strong argument, but the cost, particularly the loss of illustrations, is high. A case can also be made that preserving some structural markup facilitates certain kinds of aids for the handicapped. Telling a reading program to skip to the next chapter, for example, is much easier if the document includes some clear indication where the next chapter is.

It should be noted that while the storage and viewing of images is expensive, images provide the cheapest way to transfer existing paper documents to electronic form. Optical character recognition (OCR) has proven error prone [LOC] and manual forms of entry are enormously labor intensive (and still error prone). A question worth investigating is whether OCR is adequate for building a searchable index for documents whose "real" form is image.

Systems that detect unauthorized copying by making subtle identifying changes in word or line spacing in each authorized copy, as mentioned in the previous section, depend on a storage medium that preserves these fine details of documents while making them difficult to detect without prior knowledge. These schemes require an image representation.

Conclusions

This essay begins, but by no means completes, an examination of some of the different constituencies that might make use of a digital library, the problems they have for which a digital library might offer some help, and the ways in which the needs of these different constituencies may be in conflict. The discussion here is somewhat random and chaotic, which reflects its incompleteness; the examination is an ongoing process which needs continue for as long as libraries continue to evolve, hopefully forever.

References

[Press]
[Gladney] Gladney, Henry M.; Fox, Edward A.; Ahmed, Zahid; Ashany, Ron;Belkins,Nicholoas J; and Zemankova, Maria; Digital Library, Gross Structure and Requirements: Report from a March '94 Workshop, Digital Libraries '94,Proceedings, Hypermedia Research Library, Department of Computer Science, Texas A&M University, College Station, TX 77843-3112.
[Bush] Bush, Vannevar, As We May Think, Atlantic Monthly, Volume 176, Number 1, 1945 (101-108).
[Hawkins] Hawkins, Brian L., Creating the Library of the Future:Incrementalism Won't Get Us There!, in New Scholarship, New Serials, Proceedings of the North American Serials Interest Group, Gail McMillan and Marilyn L. Norstedt (Editors), Hayworth Press (1994), p. 17.
[Rubinsky] Rubinsky, Yuri, Electronic Texts the Day After Tomorrow, Proceeding of the Second Symposium on Electronic Publishing, Ann L. Okerson (Editor), Association of Research Libraries, Washington, D.C., 1993.
[Levy92] Levy, David and Marshall, Catherine, What Color Was George Washington's White Horse, in Digital Libraries '94
Levy94Levy, David, Fixed or Fluid? Document Stability and New Media, Proceedings of ECHT 94.
[Low] Low, S.H., Maxemchuk, N. F., Brassil, J. T. and O'Gorman, L., Document Marking and Identification Using Both Line and Word Shifting, submitted for publication.
[Arnett]Arnett, Nick. Message on www-talk Mailing List
[Hart]Hart, Michael. SAVE THE INTERNET!
[LOC] Proceeding of the 1992 Library of Congress Workshop on Electronic Texts