Messeret Gebre-Kristos
Exploratory Data Analysis
Home › Cluster Analysis

Cluster Analysis

Problem Statement

Cluster documents you compared in hw4. You may use to generate proximity matrix and R to generate a dendrogram of the result. Comment on the dendrogram and tell us where you might cut it and why that would give a meaningful result.

Abstract

I clustered documents based on the terms they contain to see if they form into recognizable patterns of similarity. A matrix is created using dissim and the resulting matirx is plotted into a dendrogram and a heatmap for visual inspection of the patterns.

Data

I am still using the same set of documents as before: those authored by administartion staff and individual authors. However, I have randomly selected only a subset (84 documents) to make the dendrogram readable.

Steps

First I generated a list list of document and and term ids. (See the file). Then I used a C program (dissim) provided by my professor to create a dissimilarity matrix.

~mcq/mdscaler/dissim -r 84 -c 1404 < docids-termids > matrix.txt The command above produces the following matrix file.

Next I downloaded the matrix file to work with R installed on my PC. The following series of commands are ran on R.

> setwd("C:/Program Files/R/R-2.6.0/hw6")
> table <- read.table("matrix.txt", header="FALSE")
> matr <- as.matrix(table)
> scalematr<- t(scale(t(matr)))
> hr<-hclust(as.dist(1-cor(t(scalematr),method="pearson")),method="complete")
> library(lattice)
> library(stats) > par("ps"=8) > plot(hr)
The commands above produce the dendrogram. Adding the next commands would produce the heatmap. > as.dendrogram(hr)
> hc <- hclust(as.dist(1-cor(t(scalematr),method="spearman")),method="complete")
> heatmap(matr,Rowv=as.dendrogram(hr),Colv=as.dendrogram(hc),col=my.colorFct(),scale="row")
The output of these commands are a dendrogram and a heatmap

Comments

The plots are not very useful as they stand now because it is difficult to recognize the documetns only by their IDs. I have to figure out how to label the documents and determine their authors (admin or student). Then it would be much easier to see clustering patterns if any exist.