statanal

Exploratory statistics from Clark transaction logs


What is it?

statanal is a Perl script which takes a transaction log generated by Clark and derives exploratory statistical values for the number of events in a transaction and transaction length.

statanal was developed in conjunction with the University of Michigan Engineering Library, the University of Michigan School of Information and Library Studies, and the Internet Public Library. It may be used free of charge, and you may feel free to modify it for your personal use. However, you may not redistribute it. Please see the licensing agreement for detailed information.

Why statanal?

For the long story, please read my research report. The short answer is that I needed to show that Clark could be used for more than making interesting logs. statanal is the first of many scripts I plan on developing to make use of Clark logs.

How to Use statanal

Setting Up

First you will need to get a copy of the most recent version of statanal and read the licensing agreement. Although statanal can be used on any platform, it is recommended that you run it on a Unix workstation (like a Sun Sparcstation), especially if your log files are large.

For Unix users: place the file in the same directory as your access log files. Set the file to executable (chmod u+x statanal.pl). Next, you need to find where your Perl compiler/interpreter is. Type which perl at the command prompt; Unix will spit back a pathname at you (something along the line of /user/bin/perl). Use your favorite text editor to go into the statanal.pl file and change the first line to #!pathname where pathname is what you determined above.

For non-Unix users: check with your Perl documentation for how to set up a script to run.

Running the script

A typical statanal session will look something like this (parts in italics are entered by the user):
compsun4% statanal.pl
Input Filename: july01.out
Output Filename for events: july01.events
Output Filename for time: july01.time
Reading july01.out 
Read 19614 lines of data
Would you like to Exclude hosts, Only Include certain hosts, or neither (e/i/n)[n]? n
Running the Statistics

S T A T I S T I C A L   S U M M A R Y

875 transactions were processed
517 transactions loaded images

Event Statistics
  Min: 1
   Max: 122
   Median: 7   Q1: 3   Q3: 13   Quartile Skewness: 1.6000000000000000888 

  Mean: 10.441647597254004154 
  Std. Dev.: 11.863040118587521832 
  Pearson's Second Skewness: 0.87034543325740321151 
  Writing ordered data set to july01.events 

Time Statistics (in seconds)
  Min: 0
   Max: 20548
   Median: 134.5   Q1: 33.5   Q3: 512   Quartile Skewness: 1.1400208986415882872 


  Mean: 634.80549199084668999 
  Std. Dev.: 1425.2695240449124867 
  Pearson's Second Skewness: 1.0530755416090997745 
  Writing ordered data set to july01.time 


 Analysis Completed
compsun4% 
Explanations:

The first three lines ask you for file names. The Input Filename is the transaction log file generated by Clark; the Output Filename for events is where you want the ordered dataset for number of events in each transaction stored; likewise for Output Filename for time. Warning: The current version of statanal does not check to see if your input file is valid; it will go along merrily parsing the heck out of garbage if you let it. However, statanal will ask if it is okay to overwrite your output files if the output filenames already exist.

statanal will then take a few seconds to read in your access log file.

statanal then asks if you would like to exclude hosts or only include certain hosts. To exclude certain hosts from analysis, enter e then enter the hostname and/or IP address of the machine you wish to exclude from analysis; enter a null line to terminate entry. statanal will thereafter ignore any transactions from the host(s) you specified. (You would want to exclude hosts if, for example, you have certain machines that are for staff or development use that you do not want to be used in your TLA.) To include only certain hosts in the transaction log, enter i then enter the hostname(s) and/or IP address(es) similarly. statanal will thereafter ignore transactions from any hosts that you did not specify. To include all events in the analysis, enter n or just enter a null line.

Now statanal will proceed with calculating the statistics. Unlike Clark, statanal runs darn quick. After it has finished, statanal prints a report to your standard output and saves an ardered dataset for number of events and transaction length to the files to specified. You can then use these ordered datasets in the statistical software package of your choice.

Explanation of the statistics

The first group of statistics are quartile statistics. Min, max and median should be self-explanatory. Q1 and Q3 are the first and third quartiles. Quartile skewness = (Q3 - 2*Q2 + Q1)/(Q3 - Q1).

The second qroup of statistics are the traditional mean and standard deviation, along with Pearson's second skewness = 3*(Mean - Median)/Std dev.

The quartile statistics are much more immune to extreme values than mean and standard deviation.

Known bugs/problems

Hey, it's only a beta release--and it's free so stop whining. But here are some of the problems I know about that I will try to fix soon:
Return to: Project | Report | statanal | Dave's Home Page | IPL | Engineering Library
rev. Aug 19, 1995

Copyright 1995 David S. Carter, All rights reserved

superman@umich.edu
http://www.sils.umich.edu/~superman/