Archive

Archive for September, 2011

Update 27 Sep 2011

September 27th, 2011 2 comments

In our last meeting I took down these actions:

  • Ressurect NGMAST paper and use Enron + MIT-NOV – argue that MIT is not big enough to make assumptions

I have started another paper in the Clique SVN here: https://erdos.ucd.ie/svn/clique/papers/BubbleH – at the moment its just the same as the NGMAST paper.

  • Write Thesis outline and get started on background chapters – message dissemination, and social networks side (from milgram onwards etc)

I created a thesis document in the repository too: https://erdos.ucd.ie/svn/clique/papers/MattStabelerThesis – I have started to compile some notes into background chapters, based on my transfer report. I have also started a rough outline of chapters. The latest version in PDF is here.

  • Speak to Conrad about finding a non-overlapping Hierarchical CFA

Since the feedback in CLIQUE mini-workshop, I asked Fergal, Conrad and Aaron again about community finding algoriths. They mentioned the Blondel one and the InfoMap one again.

  • Get a decent sub-set of STUDIVZ dataset

To get a sub-set of nodes in the STUDIVZ dataset I started by picking N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3500 nodes (12% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%). Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially.

Currently, i’m testing the setup with 5 seed nodes and 2 levels of network, with the hope that there will be some overlap.

Conrad suggests that we take them non-randomly – by first reducing our set of nodes to those with high activity (either number of posts, or total length of posts), then using the network L hops from the remaining nodes.

Enron Dataset Analysis

September 2nd, 2011 2 comments

Comparing BubbleH and Bubble RAP for the Enron dataset produced the following results. The plot shows the results for the best run of each parameter, best is determined, in this case, by delivery ratio. i.e. I selected the run with the highest delivery ratio for each CFA and each routing type, and used these as the basis for the plot. This was done automatically, and this initial version does not take into account secondary ordering – meaning that in a run with two identical best delivery ratios, the first to be encountered is picked, ignoring secondary data such as cost or latentcy.

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE,  for both BubbleRAP and BubbleH

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE, for both BubbleRAP and BubbleH

WE can see that in terms of Delivery Ratio, BubbleH outperforms BubbleRAP when there is overlap (LinkClustering, HGCE) and when there is Hierarchical Data, it performs better (HGCE). When there is little or no overlap, BubbleH and BubbleRAP perform identically, as we know/expect. I wonder if we can explicitly test the effect of Hierarchy  by finding an algorithm that partitions into hierarchy (i.e. without overlap)?

[TABLE=4]

Next to do: Test Hypothesis – Does HGCE deep Hierarchy beat HGCE flat HEIRARCHY.

[TABLE=9]

The table above shows the statistics for Enron, HGCE, BubbleH,  for DATASET_EXPLORE2 which is a new run using the original, simple datasets (without multiple runs over concatenated datasets). it still needs depth,width data adding.

MIT-NOV Hierarchy Analysis

September 1st, 2011 2 comments

I re-ran the simulations for a simplified set of paramets for MIT-NOV and Cambridge, to make better sense of the data, I ranked each parameter to HGCE (which correspond to different hierarchichal structures) for each metric – see table below:

[TABLE=5]

This table shows the relative rankings for Delivery Ratio, Cost, Hops and Latency for each parameter of M to HGCE (values were derived from the optimum parameters to KCLIQUE based on structure of its output communities, the same parameters were used for each CFA). (Table columns can be sorted by clicking on column headings).

Below is an image of the associated Community Hierarchies

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

Show below are the four metrics in barchart form, for comparison of actual values:

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

From the above we can see that generally, delivery ratio improves with more depth to the hierarchy, however, in this instance a shallow, broad structure does best overall. When ranking by latency, it appears that broad shallow structures perform best.

thought: should we run HGCE multiple times on the same data to see what the range of different structures it comes up with are? Also, we should get another hierarchical CFA (e.g. the other version of link clustering): Did this, and will post results soon.

thought: is there a way of scoring heirarchy based on depth, width and population of communities?