Cluster Analysis

Problem Statement

Cluster documents you compared in hw4. You may use to generate proximity matrix and R to generate a dendrogram of the result. Comment on the dendrogram and tell us where you might cut it and why that would give a meaningful result.

Abstract

I clustered documents based on the terms they contain to see if they form into recognizable patterns of similarity. A matrix is created using dissim and the resulting matirx is plotted into a dendrogram and a heatmap for visual inspection of the patterns.

Data

I am still using the same set of documents as before: those authored by administartion staff and individual authors. However, I have randomly selected only a subset (84 documents) to make the dendrogram readable.

Steps

First I generated a list list of document and and term ids. (See the file). Then I used a C program (dissim) provided by my professor to create a dissimilarity matrix.


    ~mcq/mdscaler/dissim -r 84 -c 1404 < docids-termids > matrix.txt

The command above produces the following matrix file.

Next I downloaded the matrix file to work with R installed on my PC. The following series of commands are ran on R.


  > setwd("C:/Program Files/R/R-2.6.0/hw6") 

  > table <- read.table("matrix.txt", header="FALSE") 

  > matr <- as.matrix(table) 

  > scalematr<- t(scale(t(matr))) 

  > hr<-hclust(as.dist(1-cor(t(scalematr),method="pearson")),method="complete") 

  > library(lattice) 

  > library(stats)
  > par("ps"=8)
  > plot(hr)

The commands above produce the dendrogram. Adding the next commands would produce the heatmap.


    > as.dendrogram(hr) 

    > hc <- hclust(as.dist(1-cor(t(scalematr),method="spearman")),method="complete") 

    > heatmap(matr,Rowv=as.dendrogram(hr),Colv=as.dendrogram(hc),col=my.colorFct(),scale="row")

The output of these commands are a dendrogram and a heatmap

Comments

The plots are not very useful as they stand now because it is difficult to recognize the documetns only by their IDs. I have to figure out how to label the documents and determine their authors (admin or student). Then it would be much easier to see clustering patterns if any exist.