1 The Girvan-Newman betweenness clustering algorithm (50pts)
Select a network of up to 200 nodes, preferably one where you suspect there might be interesting structure. For 80% credit, you can use the poliblogrecip.gdf file for the top 20 conservative and 20 liberal blogs, but for max credit, you need to use a file of your own (which can be shared with a few other people). Your network should be undirected. Load the data file into Guess and run the script betweennessclustering.py. The GUESS tool has a button which will remove the edge with the highest betweenness (marked red) at each click, and another button to keep removing edges until the next community is broken off (or all edges are removed). You need to decide when to stop removing edges (fyi, this is just an educational tool implementing a slow algorithm and it cannot cluster big data sets). Please answer the following:
A. Does the algorithm allow you to identify underlying communities in the network?
B. Is the removal of a leaf node a good stopping criterion for the algorithm?
C. Which nodes are not very embedded in their communities? How does the algorithm reveal this fact?
D. Are any nodes 'misclassified' by being placed in a cluster where you think they may not belong according to a node attribute? How can such a misclassification be informative?
E. Turn in one image that shows community structure in this network *I*.
2. Resilience (50pts)
For this task, you can use your own (or shared) data for full credit or the gnutella network gnutella2.gdf (analyzed in Pajek in assignment 2) for 80% credit. The Guess toolbars, downloadable as 'resilience_degree.py' and 'resilience_betweenness.py' from cTools will work on modestly sized networks (~1000 nodes) that are undirected. If your network is directed, you will need to either make an undirected version of it for these scripts to work (or modify the scripts). The resilience toolbars will let you specify the % of nodes to be removed and whether it is random failure (nodes are selected at random) or targeted attack (the highest degree nodes or nodes with highest betweenness are removed). It will also compute the size of the largest component and display the network after the nodes are removed. You may also do this assignment in Pajek or any other software, but there you are on your own.
Please answer the following about the network (turn in 1 image of the original network, and 1 image of the network at less than 1/2 of its original size according to one of the attack strategies).
A. What percentage of the nodes need to be removed to shrink the giant component to 1/2 of its size in degree targeted vs. betweenness targeted vs. random failure? Comment on this result with respect to the degree distribution and community structure (or lack thereof) in your network.
B. Now construct a random network with the same number of nodes and edges (you can do this by selecting 'Empty' when starting up Guess and then typing
>>> makeSimpleRandom(numberofnodes,numberofedges)
How do the percentages of nodes removed compare in the intentional attack and random failure in order to reduce the size of the largest component in this network by 1/2? Compare this with the answer to 2A: how does the resilience of your network (observed in A) compare to that of this equivalent random graph? |