Web Server Transaction Logs

Project Proposal
David S. Carter
University Library Associate Project
June 5, 1995

Statement of the Problem

Many libraries are creating World-Wide Web documents as information 
services for their patrons. In order to effectively gauge the effectiveness of 
these documents, librarians should perform research on their usage, as has 
been done in the past with other types of information products (i.e. OPACs, 
CD-ROM databases, etc.). However, the tools available for analyzing data 
from HTTP (Web) servers are at best crude and at worst useless. In the past, 
librarians have found it helpful to conduct transaction log analyses to 
examine user behavior with information systems. I will write a program 
which will create transaction logs from HTTP server access logs, then 
demonstrate how this transaction log can be used to analyze the usage.



Literature Review

Transaction Logs

Transaction logs have been used by librarians for over a quarter of a century 
to unobtrusively monitor user behavior with information systems (Peters, 
Kaske & Kurth, 1993). A transaction log is the output product of transaction 
monitoring. The transaction monitoring of an information system is defined 
as "the automatic logging of the type, content, or time of transactions made 
by a person from a terminal with that system" (Peters, Kurth, Flaherty, 
Sandore, & Kaske, 1993, p. 38).

To date, most transaction log analysis (TLA) has been done with OPACs and 
CD-ROM databases (Peters, 1993). However, Peters (1993) points out that 
"aggregate usage patterns of new types of IR systems, such as Gopher, are 
useful and enlightening" (p. 46). Peters goes on to say that "studies of 
reference service providers' use of IR systems still need to be undertaken" (p. 
57) and that TLA studies need to move into examining the use of IR systems 
over the Internet.

Why use TLA? Kaske (1993) writes that "The central goal of [TLA] is to 
acquire new knowledge, which will help the library managers, systems 
designers/developers, and researchers better understand how online 
information systems are used by library patrons and staff" (p. 79). Crawford 
(1987) identifies the two main purposes of TLA to be performing statistical 
analysis of system performance and use, and undertaking analysis of 
searching behavior and problems. Sandore (1993) identifies many ways in 
which the results of TLA can be applied to improve information systems. 
These include anticipating the evolution of system use and demands, 
determining user preference for experimental changes, monitoring the use of 
help systems, determining instructional needs, and monitoring user 
searching patterns. Wallace (1993) demonstrated how TLA can identify 
bibliographic instruction needs and point out weaknesses in information 
system design. Young (1992) illustrated the use of TLA as a collection 
management tool. 

The unobtrusive nature of TLAs, while in many respects a strength, can also 
be a weakness. Kurth (1993) states: "Transaction log data effectively describe 
what searches patrons enter and when they enter them, but they don't 
reflect, except through inference, who enters the searches, why they enter 
them, and how satisfied they are with their results" (p. 98). Kurth further 
goes on to explain that errors in TLA can arise through limitations of the 
online system, the inability to isolate and characterize individual users, and 
decisions and biases of the researcher analyzing the logs. To account for some 
of the shortcomings in TLA, Cochrane & Markey (1983) suggest combining 
TLA with another type of analysis (either questionnaire or protocol) to 
provide a more complete picture which can draw on the strengths of both 
types of studies.



The World-Wide Web

The World-Wide Web (WWW, or simply the Web) is a network hypertext 
protocol which employs hypertext markup language (HTML) to link 
documents to each other (Nickerson, 1992). It was developed back in 1989 by 
Tim Berners-Lee, a research at the Swiss research facility CERN, but the use 
of the WWW took off in 1993 with the introduction by the National Center for 
Supercomputing Applications' (NCSA) Mosaic program, a graphical WWW 
client (or browser) which was available for UNIX, Macintosh, and Windows 
systems and, most importantly, was free (Andreessen & Bina, 1994). 

Polly & Cisler (1994) point out two weakness of the use of the Web as an 
information system: slowness and "chaotic disorganization [sic]" (p. 34). While 
the issue of speed will have to be taken up by computer scientists and 
engineers, the disorganization of the Web is a prime target for librarians to 
tackle. Powell (1994) was one of many to identify the uses to which libraries 
could put the WWW in the creation of library information systems. To date, 
hundreds of libraries around the world have created and mounted various 
documents on the Web, from simple informational 'handouts' to Internet 
resource subject guides to a library (the Internet Public Library) which exists 
solely on the Web (Goldberg, 1995). It is hoped that applying TLA to WWW 
systems will go a step or two towards evaluating and improving library 
WWW information systems.



Archimedes

Before the arrival of the WWW on the big scene, the Engineering Library at 
the University of Michigan developed a HyperCard stack called Archimedes 
which went into public release in March 1991 (Ottaviani, 1995). Archimedes 
used HyperCard, a Macintosh hypertext authoring program, to present users 
with information about the library and the collection. Importantly, the 
versions of Archimedes which resided on dedicated terminals on the library 
floor kept track of its use by patrons and generated a transaction log and 
summary statistics. In 1994 the staff of the Engineering Library developed a 
set of WWW documents which, among other functions, duplicated the content 
of Archimedes in the new Web environment; in 1995 the public Archimedes 
stations were replaced by a dedicated Netscape station on the library floor 
which defaulted to the library's new WWW home page and information 
system. It was the desire to be able to gather the same transaction 
information and statistics as was possible with Archimedes in the new WWW 
system which was one of the main impetuses for this project.



Goals

This project has three goals:

1. Devise a program to extract transaction logs from NCSA HTTP server 
access logs. This log will contain the following information:

   * accessing host (and any available user information)
   * start and end time of transaction
   * test for image loading
   * number of requests and length of transaction
   * document path and results of requests

The program will allow the optional selection or exclusion of certain hosts 
from the transaction log (e.g. to exclude staff workstations, etc.). The program 
will use host information combined with a time gap of inactivity (user-
definable or default setting) to determine that start and end points of a 
transaction.



2. Devise one or more programs which use the transaction log to analyze the 
information contained therein. Ideas include:

   * exploratory statistics (raw numbers, averages, distributions
   * entry and exit point analysis
   * compare public 'on-floor' usage with general usage


3. Develop documentation so that others can use the programs, write analysis 
programs of their own, and modify the transaction log generating program as 
their own research needs require.



Methodology

All programming will be done in Perl, a widely used, relatively easy, cross 
platform, free language. The program will be designed to use the access logs 
generated by NCSA's HTTP server, one of the most popular and widely used 
Web servers. I will test the program on access logs from the Engineering 
Library's Web documents and the Internet Public Library's server. The 
program will be run on both a Macintosh using MacPerl and a Sun 
Sparcstation using standard UNIX Perl.




Limitations

Besides the limitations inherent in transaction logs, the transaction logs I 
develop will be hindered by two additional limitations inherent in WWW 
systems:

1. Caching: Most browsers cache text and images, so when a user returns 
to a previously viewed page, the server is often not accessed; thus, the 
server access log contains no record of the return to the cached 
document and this action cannot be registered in the transaction log. 
This means that the transaction logs will, by necessity, be incomplete.

2. Links to other servers: It is in the nature of WWW documents that links 
are often made to other documents which do not reside on the same 
server. When a user follows such a link, no indication is made to the 
server, and thus the user's behavior outside of the particular document 
space cannot be monitored or studied.

It remains to be seen how detrimental of an impact either or both of these 
factors will have on Web server transaction log analysis. Hopefully this 
project will help to illuminate the possible effects of these limitations.



Cited References


Andreessen, M. & Bina, E. (1994). NCSA Mosaic: a global hypermedia 
system. Internet Research, 4(1), 7-17.

Cochrane, P. A. & Markey, K. (1983). Catalog use studiesÑsince the 
introduction of online interactive catalogs: impact on design for subject 
access. Library & Information Science Research, 5(4), 337-363.

Crawford, W. (1987). Patron access: issues for online catalogs. Boston: G. K. 
Hall.

Goldberg, B. (1995). Virtual patrons flock into the Internet Public Library. 
American Libraries, 26, 387-388.

Kaske, N. K. (1993). Research methodologies and transaction log analysis: 
issues, questions, and a proposed model. Library Hi Tech, 11(2), 79-85.

Kurth, M. (1993). The Limits and limitations of transaction log analysis. 
Library Hi Tech, 11(2), 98-104.

Nickerson, G. (1992). World Wide Web: hypertext from CERN. Computers in 
Libraries, 12(11), 75-77.

Ottaviani, J. S. (1995). Archimedes: analysis of a HyperCard reference tool. 
College & Research Libraries, 56(2), 171-182.

Peters, T. A. (1993). The History and development of transaction log analysis. 
Library Hi Tech, 11(2), 41-66.

Peters, T. A., Kurth, M., Flaherty, P., Sandore, B., & Kaske, N. A. (1993). An 
Introduction to the special section on transaction log analysis. Library Hi 
Tech, 11(2), 38-40.

Peters, T. A., Kaske, N. K., & Kurth, M. (1993). Transaction log analysis. 
Library Hi Tech Bibliography, 8, 151-183.

Polly, J. A. & Cisler, S. (1994). What's wrong with Mosaic? Library Journal, 
119(7), 32-34.

Powell, J. (1994). Adventures with the World Wide Web: creating a hypertext 
library information system. Database, 17(2), 59-66.

Sandore, B. (1993). Applying the results of transaction log analysis. Library 
Hi Tech, 11(2), 87-97.

Wallace, P. M. (1993). How do patrons search the online catalog when no 
one's looking? Transaction log analysis and implications for bibliographic 
instruction and system design. RQ, 33(2), 239-252.

Young, I. R. (1992). The Use of a general periodicals bibliographic database 
transaction log as a serials collection management tool. Serials Review, 
18(4), 49-60.

Return to: Project | statanal | Clark | Dave's Home Page | IPL | Engineering Library
rev. Aug 19, 1995

Copyright 1995 David S. Carter, All rights reserved

superman@umich.edu
http://www.sils.umich.edu/~superman/