Clark

A system for WWW server transaction log analysis

Don't Use Clark!

Use Clark 2.0 instead!

See the main page for details.

What is it?

Clark is a Perl script which takes an access log (aka transaction log, event log, etc.) from an HTTP server in common log format and transforms it into a transaction log, with events broken down by user.

Clark was developed in conjunction with the University of Michigan Engineering Library, the University of Michigan School of Information and Library Studies, and the Internet Public Library. It may be used free of charge, and you may feel free to modify it for your personal use. However, you may not redistribute it. Please see the licensing agreement for detailed information.

Why Clark?

For the long story, please read my research report. The short answer is that transaction log analysis (TLA) has been used effectively in the library and information science fields for three decades, and I thought that it would be a good thing to be able to subject information systems on Web servers to TLA. Also, it would be nice to have a better indication of how a website was being used besides just saying "our server had 10,000 hits last month."

How to Use Clark

Setting Up

First you will need to get a copy of the most recent version of Clark and read the licensing agreement. Although Clark can be used on any platform, it is recommended that you run it on a Unix workstation (like a Sun Sparcstation), especially if your log files are large.

For Unix users: place the file in the same directory as your access log files. Set the file to executable (chmod u+x clark.pl). Next, you need to find where your Perl compiler/interpreter is. Type which perl at the command prompt; Unix will spit back a pathname at you (something along the line of /user/bin/perl). Use your favorite text editor to go into the clark.pl file and change the first line to #!pathname where pathname is what you determined above.

For non-Unix users: check with your Perl documentation for how to set up a script to run.

Finally, make sure that you have plenty of space available for the output file; the output file will take up approximately half the space of your input file.

Running the script

A typical Clark session will look something like this (parts in italics are entered by the user):

compsun4% clark.pl
Input Filename: july01.1995
Output Filename: july01.out
Reading july01.1995 
Read 9126 lines of data
Gap size, in minutes [30]: 30
Would you like to Exclude hosts, Only Include certain hosts, or neither (e/i/n)[ n]? n
Processing data -- please be patient
Writing transaction log to july01.out 
9126 requests were processed
9126 met the given criteria
874 transactions were logged, using a gap size of 30 minutes
compsun4%

Explanations:

The first two lines ask you for file names. The Input Filename is the access log file generated by your http server; the Output Filename is where you want the transaction log generated by Clark to be put. Warning: The current version of Clark does not check to see if your input file is valid; it will go along merrily parsing the heck out of garbage if you let it. However, Clark will ask if it is okay to overwrite your output file if the output filename already exists.

Clark will then take a few seconds to read in your access log file.

Next, Clark asks you for the Gap Size. Clark uses the Gap Size to determine when to start a new transaction. For example, if the gap size is set for 30 minutes, after 30 minutes elapses after a request by a particular host, Clark assumes that any new request from that same host is the beginning of a new transaction. Preliminary tests have shown that changing the Gap Size from 30 to 15 minutes can change the number of transactions registered by as much as 10%. The default Gap Size is 30 minutes.

Clark then asks if you would like to exclude hosts or only include certain hosts. To exclude certain hosts from analysis, enter e then enter the hostname and/or IP address of the machine you wish to exclude from analysis; enter a null line to terminate entry. Clark will thereafter ignore any events from the host(s) you specified. (You would want to exclude hosts if, for example, you have certain machines that are for staff or development use that you do not want to be used in your TLA.) To include only certain hosts in the transaction log, enter i then enter the hostname(s) and/or IP address(es) similarly. Clark will thereafter ignore any events from any hosts that you did not specify. To include all events in the transaction log, enter n or just enter a null line.

Now Clark will set about processing the transaction log. This can take a long time, and time increases geometrically with size of your access log. For example, on a Sun Sparc20, an access log of about 10,000 events takes about 40 min. of processor time, while an access log of about 1,000 events takes only about 2 min. of processor time; of course, actual running time will depend upon the load of your processor. Times on other computers will of course vary; it is recommended that you do not use a personal computer to process access logs of more than 1,000 entries.

When Clark finishes, it writes the transaction log to the specified file, then gives some brief summary statistics: how many events were registered, how many events met your criteria, and how many transactions were logged.

Anatomy of a Clark Transaction Log

Here's a typical entry from a Clark Transaction Log:

*303
gk-east.usps.gov
- 
- 
12
01/Jul/1995
01/Jul/1995
11:21:02
11:28:06
424
1
"GET / HTTP/1.0" 200 655
"GET /images/rad.logo.gif HTTP/1.0" 200 17592
"GET /images/newmarbledirectory.gif HTTP/1.0" 200 39245
"GET /cgi-bin/sils_imagemap/images/newmarbledirectory.map?221,89 HTTP/1.0" 302 0
"GET /ref/ HTTP/1.0" 200 1304
"GET /images/ipl.logo.small.gif HTTP/1.0" 200 963
"GET /images/refpict.gif HTTP/1.0" 200 53248
"GET /cgi-bin/sils_imagemap/images/refpict1.map?193,22 HTTP/1.0" 302 0
"GET /ref/RR/GEN/ HTTP/1.0" 200 1503
"GET /ref/RR/GEN/Dict-rr.html HTTP/1.0" 200 9667
"GET /ref/RR/GEN/Enc-rr.html HTTP/1.0" 200 3793
"GET /ref/RR/GEN/Atlas-rr.html HTTP/1.0" 200 4070

Explanations:

Line 1: Sequential transaction number, preceeded by an asterisk (*)
Line 2: Host name or IP address
Line 3: identid information (- if none)
Line 4: username, if authenticated (- if not)
Line 5: Number of events in transaction
Line 6: Start Date
Line 7: End Date
Line 8: Start Time
Line 9: End Time
Line 10: Transaction length, in seconds
Line 11: Check digit for image loading: 1 if .gif files were requested, 0 if not
Lines 12 - end: Events from that host in the transaction, in chronological order.

A blank line separates transactions.

Now What?

Now you have a real honest to goodness transaction log for your webserver. You'll probably spend some time just looking at it, because it can be rather interesting, but eventually you'll want to do something else with it. You can use one of my homecooked analysis programs (such as statanal, others coming soon) or write something of your own.

So go and do good research!

Known bugs/problems

Hey, it's only a beta release--and it's free so stop whining. But here are some of the problems I know about that I will try to fix soon:

Doesn't monitor image loading properly; only checks for .gif and not .GIF
Should refer to events rather than requests
Hangs up and/or crashes mercilessly when processing large files in MacPerl
Doesn't check for valid imput files; assumes that the user knows what s/he is doing
Slow as all get out; there's got to be a better way to speed things up, but every trick I've tried has ended up increasing processing time rather than decreasing it.

rev. April 3, 1996

superman@umich.edu
http://www.sils.umich.edu/~superman/