Archive

Archive for the ‘what i’ve been doing’ Category

Update 27 Sep 2011

September 27th, 2011 2 comments

In our last meeting I took down these actions:

  • Ressurect NGMAST paper and use Enron + MIT-NOV – argue that MIT is not big enough to make assumptions

I have started another paper in the Clique SVN here: https://erdos.ucd.ie/svn/clique/papers/BubbleH – at the moment its just the same as the NGMAST paper.

  • Write Thesis outline and get started on background chapters – message dissemination, and social networks side (from milgram onwards etc)

I created a thesis document in the repository too: https://erdos.ucd.ie/svn/clique/papers/MattStabelerThesis – I have started to compile some notes into background chapters, based on my transfer report. I have also started a rough outline of chapters. The latest version in PDF is here.

  • Speak to Conrad about finding a non-overlapping Hierarchical CFA

Since the feedback in CLIQUE mini-workshop, I asked Fergal, Conrad and Aaron again about community finding algoriths. They mentioned the Blondel one and the InfoMap one again.

  • Get a decent sub-set of STUDIVZ dataset

To get a sub-set of nodes in the STUDIVZ dataset I started by picking N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3500 nodes (12% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%). Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially.

Currently, i’m testing the setup with 5 seed nodes and 2 levels of network, with the hope that there will be some overlap.

Conrad suggests that we take them non-randomly – by first reducing our set of nodes to those with high activity (either number of posts, or total length of posts), then using the network L hops from the remaining nodes.

Enron Dataset Analysis

September 2nd, 2011 2 comments

Comparing BubbleH and Bubble RAP for the Enron dataset produced the following results. The plot shows the results for the best run of each parameter, best is determined, in this case, by delivery ratio. i.e. I selected the run with the highest delivery ratio for each CFA and each routing type, and used these as the basis for the plot. This was done automatically, and this initial version does not take into account secondary ordering – meaning that in a run with two identical best delivery ratios, the first to be encountered is picked, ignoring secondary data such as cost or latentcy.

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE,  for both BubbleRAP and BubbleH

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE, for both BubbleRAP and BubbleH

WE can see that in terms of Delivery Ratio, BubbleH outperforms BubbleRAP when there is overlap (LinkClustering, HGCE) and when there is Hierarchical Data, it performs better (HGCE). When there is little or no overlap, BubbleH and BubbleRAP perform identically, as we know/expect. I wonder if we can explicitly test the effect of Hierarchy  by finding an algorithm that partitions into hierarchy (i.e. without overlap)?

[TABLE=4]

Next to do: Test Hypothesis – Does HGCE deep Hierarchy beat HGCE flat HEIRARCHY.

[TABLE=9]

The table above shows the statistics for Enron, HGCE, BubbleH,  for DATASET_EXPLORE2 which is a new run using the original, simple datasets (without multiple runs over concatenated datasets). it still needs depth,width data adding.

MIT-NOV Hierarchy Analysis

September 1st, 2011 2 comments

I re-ran the simulations for a simplified set of paramets for MIT-NOV and Cambridge, to make better sense of the data, I ranked each parameter to HGCE (which correspond to different hierarchichal structures) for each metric – see table below:

[TABLE=5]

This table shows the relative rankings for Delivery Ratio, Cost, Hops and Latency for each parameter of M to HGCE (values were derived from the optimum parameters to KCLIQUE based on structure of its output communities, the same parameters were used for each CFA). (Table columns can be sorted by clicking on column headings).

Below is an image of the associated Community Hierarchies

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

Show below are the four metrics in barchart form, for comparison of actual values:

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

From the above we can see that generally, delivery ratio improves with more depth to the hierarchy, however, in this instance a shallow, broad structure does best overall. When ranking by latency, it appears that broad shallow structures perform best.

thought: should we run HGCE multiple times on the same data to see what the range of different structures it comes up with are? Also, we should get another hierarchical CFA (e.g. the other version of link clustering): Did this, and will post results soon.

thought: is there a way of scoring heirarchy based on depth, width and population of communities?

Odd results – an update

July 13th, 2011 2 comments

Just before Pádraig went away, we realised that the results we were getting for Bubble and BubbleH were exactly the same in some cases. We also wanted to see whether hierarchy was making any difference to results, and so we set these goals for the period whilst Pádraig was away:

  • work out why bubbleH and bubble are the same in some cases
  • get enron and wall posts dataset
  • pick the most hierarchical looking clustering  in these
  • see whether BubbleH works better than BubbleRap
  • (concentrate on HGCE and KCLIQUE)

The first question: Is Bubble (no global parent) always the same result as BubbleH (global parent)?

[TABLE=3]

* in the case of Social Sensing Study LinkClustering all are same for Delivery Ratio, apart from where the threshold is 0.0. (with Social Sensing we used multiple tuning values, hence multiple results, the others used only one set of parameters)

Answer: Not always.

Is the answer overlap?
I think that these results are down to the structure of the communities. InfoMap ALWAYS produces the same results for BubbleH and BubbleRAP. I wonder if this is this down to the fact that InfoMap partitions the network, and therefore there are no overlaps? Could it be that for the most part, KCLIQUE creates non-overlapping communities and hence the results? HGCE creates a large number of highly overlapping communities, which are also hierarchical. LinkClustering also creates a large number of communities and whilst edges cannot overlap, nodes can belong to multiple communities. Or is it really that the inherent Hierarchy in community structure is causing results to differ?

The question is also, why do BubbleH and BubbleRAP get EXACTLY the same results if there is no overlap? Well, this is because in the end, there is no difference between them when there is no complicated structure to the network. Even if BubbleH is using the global parent community, that is EXACTLY the same as using the Global Ranking in BubbleRAP, so when there is no overlap, each node belongs to EXACTLY one community, and has a local rank in both BubbleH and BubbleRap. The global parent in BubbleH is the same as using the global rank in BubbleRAP. In fact, we could re-write BubbleH to incorporate this explicitly, and do away with a forced global parent, but this is just implementation detail, the end results will be the same.

Second Part: get enron and wall posts dataset

I used Conrad’s version of the Enron Dataset, as he had taken the time to remove irregularities, and in case of future papers, he would have an in depth knowledge of how he processed the data, saving me a lot of time!

The connected time graph is below, showing a decent number of clusters, hopefully with some nice hierarchy!

Connected Time Graph of the Enron dataset.

Connected Time Graph of the Enron dataset.

I explored this dataset in the same way as the previous ones, by experimenting with settings for the different algorithms, InfoMap is easy to visualise, and so below is the InfoMap clustering of the dataset:

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

I also used Conrads Studivz wall post dataset, (see here), this dataset is huge, and so I haven’t worked out how to run the full set of simulations. I was able to create a connected time edgelist (connected time is based on wall post length, 1000ms per character). Below is the graph of all connections, with node size and colour related to degree, edges removed for clarity.

network_degree_colournetwork_degree

network_degree_colour_section

Studivz connected time graph close-up with edges

Enron Dataset

In order to get any clusters at all out of KCLIQUE and LinkClustering, I had to plug in some very small threshold values, this is probably due to the way in which I created the events (An event occurs when an email is sent, the duration of every event is set to 2 seconds), so for the most part, nodes are not connected, and therefore there are small overall connected times. (Conrad’s dataset did not include the content of the messages, so I was not able to give contact events any length based on message size). To simplify things,  I used the interesting parameters from KCLIQUE and LinkClustering, to inform the parameters for HGCE, specifically the –minCliqueEdgeWeight parameter (more info about HGCE parameters), which excludes edges based on weight, effectively thresholding the graph edges as with KCLIQUE and LinkClustering.

To recap, the threshold means that (in the case of KCLIQUE and LinkClustering and now HGCE) edges are removed where the connected time between individuals is lower than the threshold.

Threhold parameters used for the Enron dataset:

0.0, 0.0000001, 0.0000002, 0.0000003, 0.0000004, 0.0000005, 0.0000006, 0.0000007, 0.0000008, 0.0000009, 0.000001, 0.000002, 0.000003, 0.000004, 0.000005, 0.00001

The plot below shows the results for Delivery Ratio for BubbleRAP using no global parent (NGP), and BubbleH using global parent (GP) (in future, we can safely ignore the global parents part, as BubbleRAP should always be run without a global parent and BubbleH with a global parent).

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

This is a good result, it shows that in this dataset, for these parameters, BubbleH beats BubbleRAP, but now we need to consider why. I had done an earlier exploration of the enron dataset with default values for M ( 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.9), and so I looked back the results for that, and was surprised to see that BubbleRAP does much better in some cases. Below is a plot with the new data included (Solid red BubbleRAP line and blue dot dash BubbleH show this data).

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

So, BubbleRAP does better in some situations, (PS. NEED TO CHECK THIS AGAIN TO MAKE SURE BUBBLEH IS PROPERLY PLOTTED).

I started to look at then next step, hoping it will give some answers: pick the most hierarchical looking clustering  in [datasets].

I picked the best results for BubbleH, and mapped the communities to a hierarchical tree structure, shown below:

Community structure for nest run of BubbleH on Enron dataset where M = 0.000001.

Community structure for best run of BubbleH on Enron dataset where M = 0.000001

So far so good it seems to have a broad structure with some specific hierarchical clusters, I also mapped the worst run:

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

This too has some good structure to it, note however, that this is a plot of the  worst performing run, in a set of best parameters, (we didn’t run a whole range of values – so, this is worst of the best)

The next to show is the best BubbleRAP run, as below:

0.0000005

Community structure for best run of BubbleRAP on Enron dataset where M = 0.0000005

Interestingly, this has broad set of high level communities (as with the previous plots, but they had a global parent), but less broadness lower in the hierarchy (i.e. fewer lower communities).

TODO: plot the worst run of BubbleRAP

UPDATE: I found a better way to show the Dendograms of the Community hierarchy, and have automated the process. This page shows all of the plots for Enron without global parent and this page WITH global parent. To get a better idea of the relationship between hierarchical structure and results, I need to combine the results from runs, with the structure in the same place – so we can compare side by side what happens when there is different types of community.

Social Sensing Dataset

June 15th, 2011 3 comments

I finally got around to extracting out contact events from the social sensing study. I have made some notes on the Social Sensing Dataset page. The main images are shown below:

Social Sensing Dataset activity, 15mins, hourly, daily, 3 daily, and weekly. Lines showing the most active period

Social Sensing Dataset activity, 15mins, hourly, daily, 3 daily, and weekly. Lines showing the most active period

It seems most active from Oct 2008 onwards, and tails off, with the end of Nov 2008 being a good cut-off point. This means there is over a year of contact data.

Force directed connected time graph for SocialSensing dataset, showing mapping of users to nodes, with low weight edges removed. NOTE: Do not share this image - remove names/btmacs first

Force directed connected time graph for SocialSensing dataset, showing mapping of users to nodes, with low weight edges removed. NOTE: Do not share this image - remove names/btmacs first

The connected time graph (i.e. edges weighted by the total length of time connected with the target node) gives a really good sense of the social structure. As I personally know most of these people, (and I am on it myself) it is easy for me to see the relevance of the layout.

I haven’t written code to lookup locations based on cell towers/wifi points yet, but it might be interesting in the future if we end up using location.

Also, I need to work on getting BTMAC addresses for the three other people. The problem is that I know that at least one of these people changed phones during the study – hence why they are removed.

Extending Datasets

May 31st, 2011 2 comments

Pádraig had the idea to try to eliminate Delivery Ratio from our analysis, this way we could then concentrate specifically on Cost and Latency. He suggested that we take the normal dataset and extend it to three times the length, repeating the events in the correct time ordering. Then we could run the simulations 10 times each with a different start time in the original  portion of the dataset, this would give each message a much better chance of being delivered, over three iterations of the dataset.

extendeddatasets

I implemented this by taking a slice of the MIT dataset, processing the events contained within it, and duplicating them, adding the length of the slice to the timestamp times the iteration (i.e. MIT-NOV, MIT-NOV+1 length, MIT-NOV+2 lengths). Any contact events that were active at the start or end time of the slice, were truncated to start/end at the start/end time. I verified the resulting dataset by counting the number of connection every day, and ensuring that there was a distinct repeating pattern. For each dataset, I split the duration between the start and end time of the original dataset time period into 10 equal chunks, and calculated the corresponding start time for each. For MIT and Cambridge this equates to roughly every three days, whereas in the InfoCom datasets there is  a much smaller gap between start times (~14-20 hours).

[expand title=”start times”]

mit-nov 2004-11-10_00:00:00 - 2004-12-11_00:00:00
2004-11-10_00:00:00
2004-11-13_02:24:00
2004-11-16_04:48:00
2004-11-19_07:12:00
2004-11-22_09:36:00
2004-11-25_12:00:00
2004-11-28_14:24:00
2004-12-01_16:48:00
2004-12-04_19:12:00
2004-12-07_21:36:00

infocom-2005 2005-03-06_00:00:00 - 2005-03-12_00:00:00
2005-03-06_00:00:00
2005-03-06_14:24:00
2005-03-07_04:48:00
2005-03-07_19:12:00
2005-03-08_09:36:00
2005-03-09_00:00:00
2005-03-09_14:24:00
2005-03-10_04:48:00
2005-03-10_19:12:00
2005-03-11_09:36:00

infocom-2006 2006-04-24_00:00:00 - 2006-05-02_00:00:00
2006-04-24_00:00:00
2006-04-24_19:12:00
2006-04-25_14:24:00
2006-04-26_09:36:00
2006-04-27_04:48:00
2006-04-28_00:00:00
2006-04-28_19:12:00
2006-04-29_14:24:00
2006-04-30_09:36:00
2006-05-01_04:48:00

cambridge 2005-10-28_00:00:00 - 2005-12-23_00:00:00
2005-10-28_00:00:00
2005-11-02_13:30:00
2005-11-08_04:00:00
2005-11-13_18:30:00
2005-11-19_09:00:00
2005-11-24_23:30:00
2005-11-30_14:00:00
2005-12-06_04:30:00
2005-12-11_19:00:00
2005-12-17_09:30:00

[/expand]

Running Simulations

We decided to remove the overhead of running the CFAs with multiple parameters, by choosing a set of parameters for each, that had performed well in previous tests. The list below shows the ones we chose:

[expand title=”CFA Parameters”]

MIT-NOV
	HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1.3	0	0.25	0.2	4	100	0.7	0.5	0.1

	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  4	1.3	0	0.25	0.3	4	100	0.9	0.9	0.1
			# Communities AverageMembers
			# 21 15
			2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 29 30 31 32 34 35 36 37 38 39 41 42 43 45 46 47 48 50 51 53 54 55 57 58 59 60 61 62 63 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 94 95 97 98 99 100 101 102 103
			6 8 14 16 18 30 41 46 53 59 61 63 78 79 86 88 91 95 101 102
			6 13 17 29 30 47 86 88 95 101
			10 25 62 65 74 83 102
			3 25 30 62 74
			4 23 30 45 57 91
			9 30 42 101
			19 27 35 37 39 81 89 102
			2 41 51 87 88 95 102 103
			15 41 43 51 67 69 87 88 95 100 103
			19 27 41 66 85 98
			15 43 46 58 61 67 69 78 79 83 85 90 95 97 100 103
			2 7 11 15 57 61 67 69 78 79 85 87 88 90 95 97 100 101 103
			6 7 8 10 11 14 15 16 17 21 30 31 34 35 38 41 43 46 55 58 61 69 75 78 79 80 83 85 86 88 90 91 94 95 97 101 102
			8 10 15 16 17 30 38 41 46 53 63 73 80 83 85 95 97 99 101 102
			10 14 18 25 35 46 65 83 86 88 90 102
			10 17 25 35 53 72 85 88
			10 25 53 72 74 78
			27 30 50 95 97 99
			4 12 19 20 23 24 27 30 32 36 37 39 41 45 48 50 54 59 60 68 70 76 77 81 82 84 89 98
			19 27 37 39 41 68 81 98

  KCLIQUE
	  BubbleH
	  K	MCT
	  3	0.01
			# k = 3
			# minimum_connected_ratio = 0.01
			# number_of_communities = 4
			# community_sizes = [5, 3, 45, 24]
			# average_size = 19.25
			103 2 67 100 87
			99 27 69
			6 7 8 10 11 13 14 15 16 17 18 21 29 30 31 34 35 38 41 43 44 46 53 55 58 59 61 63 67 73 75 78 79 80 83 85 86 88 90 91 95 97 100 101 102
			68 70 76 12 77 81 82 19 84 23 24 89 32 98 36 37 39 45 48 50 54 20 59 4 

	  Bubble
	  K	MCT
	  3		0.01	

  LinkClustering
	  BubbleH
	  ET
	  0
			# D_max: 0.591406681046
			# S_max: 0.596774193548
			# Communities: 79
			# Average size : 8
			# Nodes : 95 (seen 635 times)
			3 72
			51 73
			58 30 77 50 102 88 89 97 53 67 82 69 81 86 87 84 85 24 20 21 48 23 46 45 43 41 4 7 6 8 68 11 83 13 99 76 75 38 70 91 90 100 101 95 29 79 78 15 10 39 12 59 14 17 98 19 54 31 16 37 36 35 55 63 32 61 103
			25 72
			62 45
			74 65
			72 35
			77 19 74 20
			19 9 77
			25 84
			103 67 47
			3 30
			25 65
			44 73
			74 73
			9 2
			83 61 88 67 102 90 100 81 95 87 79 85 11 10 101 21 17 16 86 41 2 7 103
			25 74
			27 74
			91 10 17 16 30 88 41 61 62 102 83 101 86 79
			61 88 67 91 83 100 101 86 87 97 85 69 95 57 30 41 7 103 78 31 90
			3 74
			19 98 66 24
			76 64
			4 45 50 72
			102 63 30 33 101
			62 65
			24 20 48 19 32 37 77 98 89 73 54 59
			10 21 61 88 63 90 91 83 101 102 94 97 85 95 25 86 58 17 16 55 18 30 43 53 41 46 69 7 8 103 78
			1 83 30
			9 72
			98 36 77 76 89 70 68 20 84 24 39 12 59 48 23 19 54 82 45 37 50 4 32
			30 42 101
			18 84
			42 72
			90 28 103
			44 27
			25 3
			25 35
			25 62
			102 88 91 90 80 101 95 87 79 85 11 10 81 15 21 14 16 46 86 31 30 61 43 35 41 97 7 6 9 8 18
			60 81 30
			9 52
			52 101 15 16 30 41
			64 73
			39 27 19 37 50 36 98 89 4 68
			99 63 73 102 83 101 95 87 97 85 11 10 38 15 14 17 16 46 31 30 53 41 94 6 8
			73 72
			83 29 88 65 90 102 103 80 86 79 78 10 38 15 21 55 18 31 43 35 46 69 7
			61 63 66 67 102 83 100 101 95 85 10 14 16 46 41 55 69 7 8 78 90
			11 13 58 17 16 47 31 30 29 88 95 83 101 86
			23 34
			93 40
			9 57
			21 16 30 40 41 61 88 6 102 101 86 14
			27 66
			67 43 61 88 63 53 90 69 80 81 86 87 85 21 46 95 29 41 79 7 6 8 83 99 75 91 103 100 101 102 94 97 78 11 10 13 38 15 58 17 16 55 18 31 30 35 34 14
			9 42
			11 13 14 17 30 35 41 61 88 6 93 101 86 87 79
			73 66
			74 72
			3 53
			87 15 95 51 43 41 88 67 102 8 103 69 100
			7 21 43 99 75 64 95 90 101 86 79 78 10 58 55 18 31 30 29 88 6 8
			3 62
			10 88 72 17 69 85 87 53 78
			62 74
			62 28
			9 74
			77 42
			60 98 20 54 77 50
			2 51
			12 2
			91 10 101 14 31 30 53 34 99 61 74 102 83 41 95 78
			57 45 4 23
			9 25
			14 61 44 30 53 41 102 88 7 6 91 83 101 95 79
			11 83 99 61 88 63 90 102 103 100 81 69 94 97 85 91 10 27 15 21 14 16 55 95 31 30 35 41 80 79 101 6 8

	  Bubble
	  ET
		0.003

	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/mit-nov/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:03:51 +0100}
			# number_of_communities = 11
			# avg size = 9
			# nodes = 95
			19 20 24 32 12 54 98 76 89 50 77 36 70 48 68 4 84 45 37 82 59 23 39 81 60
			30 102 86 14 6 91 61 41 16 101 88 79 18 53 35 63 44 9 40 93 52 72 33 42
			15 97 85 100 80 83 38 73 1
			90 95 29 46 58 8 7 75 43 66 64 94
			67 103 87 2 51 57 47
			21 55 78 34
			11 17 13
			31 10
			69 27 99
			25 65 74 3
			62 28

InfoCom2005
  HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.5	0.5	0.1
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.6	0.9	0.1

			# Communities AverageMembers
			# 2 27
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41

  KCLIQUE
	  BubbleH
	  K	MCT
	  4	0.009

			# k = 4
			# minimum_connected_ratio = 0.009
			# number_of_communities = 1
			# community_sizes = [37]
			# average_size = 37.0
			1 2 3 4 5 6 7 8 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 37 38 39 40 41 

	  Bubble
	  K	MCT
	  2	0.11

  LinkClustering
	  bubbleH
	  ET
	  0.008
			# D_max: 0.398456726708
			# S_max: 0.368421052632
			# Communities: 15
			# Average size : 7
			# Nodes : 40 (seen 99 times)
			9 30
			21 41
			12 23
			11 25 26 35 15 41 13 14 24 10 39 27 20 21 22 16 33 32 37 29 40 34 1 19 3 5 4 7 6 8 17 18
			38 36
			10 39 15 22 33 37 36 34 1 13
			24 27 39 2 14 35 34
			24 25 13 15 9 5 37 29 34
			18 12
			2 23
			11 39 27 32 30 37 40 35 5 4 28
			26 22 23 33 8 6 18 34
			9 12
			39 38 40 22 19 32 35 4

	  bubble
	  ET
	  0.04

	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2005/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:03:59 +0100}
			# number_of_communities = 3
			# avg size = 14
			# nodes = 41
			40 32 39 1 27 22 15 35 29 19 24 17 4 25 8 13 30 10 9 26 14 36 5 34 11 6 16 38 21 3 23 28 41 2 12 31
			33 20 37
			7 18

InfoCom2006
  HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.7	0.5	0.2
			# Communities AverageMembers
			# 2 49
			22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98
			22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98

	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.3	4	100	0.6	0.5	0.2

  KCLIQUE
	  BubbleH
	  K	MCT
	  3	0.02

			# k = 3
			# minimum_connected_ratio = 0.02
			# number_of_communities = 1
			# community_sizes = [58]
			# average_size = 58.0
			22 25 26 27 28 31 32 34 35 36 37 38 39 40 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 60 61 63 64 66 67 70 72 73 74 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 93 95 98 

	  Bubble
	  K	MCT
	  3	0.02

  LinkClustering
	  BubbleH
	  ET
	  0.01
			# D_max: 0.352182263797
			# S_max: 0.407407407407
			# Communities: 95
			# Average size : 4
			# Nodes : 78 (seen 387 times)
			86 35
			38 40 49 36 35
			44 56
			83 48
			24 62 27
			68 69 21
			22 67
			36 48
			30 59
			46 44
			56 71
			91 39
			92 48
			58 94
			83 40
			56 94 40
			59 43
			80 65
			56 40 48
			24 60 62 51 75 52 77 76 88 89
			46 39 86
			47 56 40
			77 58 44 45 51 43 41 60 75 89 73 71 70 64 80 92 95 93 74
			25 38 58 49 32 57 56 88 40 43 60 75 89 53 52 86
			68 48
			46 26 39
			26 80
			39 74 89 46 54 88 53 52
			26 56
			74 97
			56 29
			54 48
			26 97
			46 85
			60 25 75 22 57 51 78
			24 62
			54 41
			76 58 82 32 44 30 51 52
			38 22 49 56 40
			47 22
			87 65 72
			26 58 53 43
			31 65
			83 73 23
			24 80
			88 57 70 28 50 77 75 89 66 67 82 31 93 81 44 84 85
			42 48
			24 62 48
			69 21
			47 45 28 43 60 88 89 73 79 64 93 81 86 84 78
			54 22
			73 29
			56 73 40 49
			80 59
			45 28
			24 62 36
			68 65
			22 48
			65 96
			42 29
			86 85
			25 44
			71 41
			51 21
			63 38 59
			47 92
			25 67
			31 35 49
			29 43
			89 58 43 37 36 53 34
			39 30
			22 85
			97 85
			92 71
			74 64 73 68 51 45 86 43 52
			76 66 91 90 31 96 93 79 78
			59 52
			77 58 32 33 86 45 51 43
			25 65
			83 23
			46 39 59
			30 48
			33 48
			46 39
			48 51 53 76 89 64 66 82 73 80 84 78
			27 49 55 56 35 61 63 40 38
			25 26 74 54 57 75 52
			26 22
			68 47
			69 65
			61 27 51 43
			28 60 88 89 64 66 67 82 80 81 86 87 84 25 44 45 42 43 34 77 76 75 74 73 72 70 91 90 93 95 79 78 51 58 98 32 57 37 50 53 52 54 31
			27 48
			37 59 34

	  Bubble
	  ET
	  0.09  

	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2006/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:04:18 +0100}
			# number_of_communities = 5
			# avg size = 16
			# nodes = 78
			78 93 76 66 60 74 64 82 77 81 45 72 84 88 28 43 89 90 44 53 51 67 79 87 37 70 75 86 91 56 34 52 73 48 58 40 95 41 32 80 71 50 31 54 26 57 85 47 92 98 36 24 30 33 59 42 62 96 65 23 29 83 94 97 21
			38 35 49 63 55 27 61
			25 22
			39 46
			68 69

Cambridge
  HGCE
	  BubbleH
	  K	A	M	E		P	S	R	ST	MAP	Z
	  5	1.6	0	0.25	0.2	4	100	0.6	0.5	0.1
			# Communities AverageMembers
			# 4 23
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
			1 2 3 4 5 6 7 9 10 11 13 14 15 17 18 19 20 21 23 25 28 29 30 32 33 34
			2 3 5 6 7 8 9 10 11 12 13 16 18 20 21 22 25 27 28 29 31 32 33 34 35 36
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 23 24 26 29 30 31 33 35

	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  5	1	0	0.25	0.3	4	100	0.9	0.9	0.2

  KCLIQUE
	  BubbleH
	  K	MCT
	  2	0.02
			# k = 2
			# minimum_connected_ratio = 0.02
			# number_of_communities = 3
			# community_sizes = [2, 2, 11]
			# average_size = 5.0
			1 4
			24 26
			32 34 3 36 7 16 21 22 25 27 28 

	  Bubble
	  K	MCT
	  2	0.04

  LinkClustering - bubbleH
	  ET
	  0.004
			# D_max: 0.75851421306
			# S_max: 0.5
			# Communities: 12
			# Average size : 5
			# Nodes : 34 (seen 55 times)
			11 25 27 21 22 16 32 28 36 34 3 12 7
			9 5 19 15 33
			2 34
			26 35
			3 6
			13 15 6 19 20
			9 13
			10 20 14 17 23 19 18 30 1 33 2 5 4 6 15
			2 35
			24 26
			11 21 22 29 7

	  Bubble
	  ET
	  0.003

	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/cambridge/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:04:24 +0100}
			# number_of_communities = 3
			# avg size = 12
			# nodes = 36
			34 21 36 22 25 32 28 3 7 16 27 12 11 29 8
			18 33 15 14 5 20 19 4 6 2 23 1 17 30 10 13 9 31
			26 35 24

[/expand]

I originally made the decision to use seperate parameters for Bubble and BubbleH, but in hindsight this was not a good idea, as they cannot easily be compared.

When we discussed the results, we realised the KCLIQUE seems to perform well for latency with BubbleH, despite not having much hierarchical structure. Pádraig suggested I investigate this further, to find out what is happening. In the mean time I have re-run the simulations, using the best parameters for BubbleH – to level the playing field in terms of parameters. As is obvious from the above community structures, only HGCE produces communities with one all encompassing community, and in the case of KCLIQUE for both InfoCom datasets, it produces only one community – this seems odd, as we would expect the best performance from a more complicated structure.

This resulted in the following community structure.

[expand title=”Community Structures”]

MIT-NOV
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    21	        15.80952381     	4	        87
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    4	          20.25	            4	        46

  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    11          8.636363636     	2	        25

  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    78	          8.141025641     2       	67

INFOCOM-2005
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    2	          40     	          40	      40

  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    1	          38     	          38	      38

  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    3	          13.66666667     	2	        36

  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    14          7.071428571     	2	        32

INFOCOM-2006
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    2	          73     	          73	      73
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    1	          59                59       	59
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    5	          15.6            	2       	65
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    94          4.117021277     	2	        43

CAMBRIDGE
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    4	          28.75     	      23	      36

  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    3           6     	          3       	12

  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    3           12                3         18

  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    11          5	                2	        15

[/expand]

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH.

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH. (OLD PLOT - SEE BELOW FOR NEW PLOT)

In the results above, there seems to be something odd happening with LinkClustering for BUBBLERap, for Cost and Hops in MIT-NOV and InfoCom2005 it has much higher values than for the other CFAs. Also, there appears to be something odd happening with Latency in the Cambridge dataset, with a maximum of ~1000 hours latency which is about 40 days. The MIT dataset is roughly the same length, but doesn’t get anywhere near the same latency.

The two plots below show Delivery Ratio and Latency for BubbleH, HGCE for the Cambridge dataset.

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

These seem to show that at the points where the new parts of the dataset are joined, there is a huge jump in deliveries of messages. It also seems that the datasets are truncated at the end. I will double check the start/end times, but this could be due to the original dataset having no events after a certain timestamp, and therefore, in the last section, the simulation stops when there are no events left to process. The huge spike in messages delivered explains the very high and very low extremes – as messages that were sent in the last iteration of message sending, get delivered very quickly. It seems odd however, that the same is not true for the first iteration. It could be, that because the last set have started to be distributed, maybe only one hop away, these nodes are already in a much better position to deliver the nodes at the start of the next section. This could be due to higher activity in the first hours of the dataset collection, perhaps even due to all the nodes being switched on in the same place at the same time. If this is the case, then we need to start the dataset slice slightly later, to avoid this. The activity traces for each dataset is shown below:

MIT-Nov Daily Connections

MIT-Nov Daily Connections

Cambridge Average Daily Connections

Cambridge Daily Connections

Infocom2005 Hourly Connections

Infocom2005 Hourly Connections

Infocom2006 hourly connections

Infocom2006 hourly connections

From these we can see that MIT has a good distribution of connectivy across the dataset, but the other three have only short spikes of activity. This in itself does not make a difference to the comparison of resulst for each datasets, as they are all relative, but it does go some way to explain the very high latency figures for Cambridge.

The issues with KCLIQUE

I looked back at the data for KCLIQUE in the NGMAST simulations, where there were multiple parameters to KCLIQUE, resulting in varying degrees of community structure, for all datasets, the delivery ratio in KCLIQUE seems to increase with member count, whereas community count peaks before delivery ratio really gets going.

commsvsstats-combined

Delivery Ratio, Cost, Community Count and Average Member count.

This makes sense, because for BubbleH, the more we know about members of a the communities, the more messages we can direct using local ranking. And, because we haven’t included an overall community for KCLIQUE (or InfoMap*, or LinkClustering) we have no way to deliver messages to nodes not in a community, until a node holding a message comes into contact with the destination node. I’m not sure if we designed it this way or whether this is just a bug.

The next step is to fix this issue, by either

  1. adjusting the BubbleH algorithm to use global rank to pass messages between nodes at the highest level
  2. Force a global community in the other community finding algorithms (this will have the same effect as 1)

* InfoMap includes every node in the communities list, but does not include a global parent community

Update:

I have now run all of the datasets with a forced global parent community, this seems to have evened the playing field somewhat:

matrixes-global-parent

Results where every CFA is forced to produce a global parent community including all nodes. (NOTE the change in axis labels from previous run)

The interesting thing is that there is very little difference in delivery ratio and cost using BUBBLERap.

Results for NGMAST11/NEMO

May 23rd, 2011 2 comments

We have decided not to submit to NGMAST11 conference as the results we have were not really ready, also I found an error in the config setup for KCLIQUE BubbleH. I have therefore run all of the simulations from scratch, just to make sure it is all correct. In this re-run, I ran the HGCE algorithms far too many times for each dataset, so there is a very large number of results for the HGCE algorihtm (~700 for each dataset). We plan to submit to the Finding NEMO workshop instead.

To visualise the results so far, we have plotted the best  (see note below) results for BubbleRAP and BubbleH:

matrix

BUBBLERap and BubbleH compared for Delivery Ratio, Cost and Latency

To determine the ‘best’ run for algorithms with multiple parameters, we first filtered the results to the best 10 for latency, then ordered by delivery ratio followed by cost, this should mean that a good delivery ratio is still shown for good delivery latency, but avoids results where a abnormally low delivery ratio is chosen with a very poor delivery ratio.

Initial results – HGCE, KCLIQUE, InfoMap and LinkClustering

May 7th, 2011 2 comments

We ran each community finding algorithm over the datasets: MIT-NOV, Cambridge, InfoCom 2005, and InfoCom 2006. We used bubbleRAP, BubbleH, Unlimited flood, hold & wait, and Prophet to measure the effect of the CFA on the dataset. The results below are grouped by dataset. Only the best results for BubbleH/BubbleRAP KCLIQUE and HGCE are shown. Note: The local ranking used for InfoMap is based on the rank given by the InfoMap algorithm, NOT centrality as with BubbleRAP and BubbleH.

MIT-NOV

MIT-NOV

CAMBRIDGE

CAMBRIDGE

IncoCom 2005

IncoCom 2005

InfoCom 2005

InfoCom 2005

My next plan is to link statistics about community structure (number of communities, average members etc.) to the results.

Also, I want to find out if there is an existing measure of overlappiness? Does the amount of overlap in communities affect the results in any way? Also, what about the extent to which there is ‘nesting’ of communities, does this make a difference?

I also want to visualise the communities produced in some way, so that it is clearer what the network structure for the best results is based on.

This excel file this excel file contains the combined results of each CFA for all 4 datasets, so we can compare the effect of community structure (number of communities, average member size) directly with results for each simulation run.

BubbleH and SMW11

April 19th, 2011 2 comments

Our paper for SMW11 was accepted, and now we will be able to include the results from the bug-fixed version of  BubbleH, which performs very well. An addition that Pádraig suggested, was to ignore level information in the hierarchy, and simply use community size. This has the benefit of making the algorithm simpler, but still incorporating hierarchical structure. I ran the two versions side by side and found that they perform almost identically, with some cases ignoring level being slightly better!

MIT-NOV Dataset, with levels included and not included for multiple parameters to H-GCE

MIT-NOV Dataset, with levels included and not included for multiple parameters to H-GCE

This plot is very hard to read, but it is possible to see the similarities at a broad level. The best performing run was with H-GCE parameters of K=3, E=0.15, ST=0.9, MAP=0.9 and Z=2. It’s structure seems to be relatively flat, with a large number of communities:

[expand title=”click to view”]

  0(90)
  |
   - 1(5)
  |
   - 2(5)
  |
   - 3(8)
  |
   - 4(8)
  |
   - 5(8)
  |
   - 6(4)
  |
   - 7(28)
  |
   - 8(27)
  |
   - 9(22)
  |
   - 10(17)
  |
   - 11(13)
  |
   - 12(20)
  |
   - 13(16)
  |
   - 14(26)
  |  |
  |   - 15(12)
  |
   - 16(7)
  |
   - 17(4)
  |
   - 18(4)
  |
   - 19(3)
  |
   - 20(8)
  |
   - 21(8)
  |
   - 22(3)
  |
   - 23(13)
  |
   - 24(8)
  |
   - 25(27)

[/expand]

This gives us the following graphic for the SMW11 paper.

bubbleh-bubble-prophet-flood-ignorelevelstrue-bestrun

MIT-NOV Dataset, for multiple parameters to H-GCE, compared to Bubble, PROPHET and Flood

The latency is shown below, which seems to follow the same trend as the previous version, but with BubbleH actually beating all but Flood at the end.

Latency in MIT-NOV dataset for Bubble, BubbleH, PROPHET and Flood.

Latency in MIT-NOV dataset for Bubble, BubbleH, PROPHET and Flood.

This is a positive result, but whilst doing some work towards the next paper for NGMAST11, I realised that we should be doing runs for multiple parameters to K-Clique, however for this paper, probably don’t need to worry so much. Also, for reference, the average value for BubbleH is included in the plot below.

bubbleh-bubble-prophet-flood-ignorelevelstrue-bestrun-withavg

We need to consider whether to evaluate multiple parameters to BubbleRAP, and see whether this affects the algorithm. Also, we need to consider whether the hierarchy is really making things better or is it the sheer number of communities, because the best performing run has a quite flat structure.

The trouble with too many hops

April 12th, 2011 6 comments

In the recent paper that we submitted to SMW2011, we showed that we could increase the delivery ratio using overlapping hierarchical clustering, however, we also identified that the cost involved was much higher than we expected, up to 245 hops in some cases.  To find out what was happening, we checked the behaviour of the best performing parameters (GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2) by recording the route of each message along with some hints as to why the message was being passed at that point. Hop Stats for GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2 (11MB uncompressed).

This is the current algorithm:

[expand title=”click to view algorithm”]

for (BubbleHMessage message : my_buffer) {
  /**
  * Identify smallest community that Self and Destination share =
  * Bridging Community
  * */

  int my_bridge_community = BubbleHierarchyOracle.bridgeCommunity(my_node.getID(), message.dest, my_properties);

  // if encountered node is destination
  if (process.getNode().getID() == message.dest) {
  // pass it on to them.
    destination_met++;
    toSend.add(message);
    message.setHopCode("DM");
  } else {
    int remote_bridge = BubbleHierarchyOracle.bridgeCommunity(process.getNode().getID(), message.dest, my_properties);
    if (BubbleHierarchyOracle.isBetter(remote_bridge, my_bridge_community)) {
      // if P is in community with message.dest, that is smaller
      // than BC
      // pass message
      message.setHopCode(remote_bridge +"-isBetter-" + my_bridge_community+ ":" + BubbleHierarchyOracle.whyIsBetter(remote_bridge, my_bridge_community));
      toSend.add(message);
      smaller_community_found++;
    } else if (process.getCommunities().contains(my_bridge_community)) {// if both nodes are in the bridge community
        // if the rank of the encountered node is higher, pass the message
      if (process.getlocalRank(my_bridge_community) > getlocalRank(my_bridge_community)) {
        // pass message
        toSend.add(message);
        message.setHopCode("RANKBC-" + my_bridge_community);
        bridge_community_ranking_message_passed++;
      }
    } else {
      // process is not destination, is not in a smaller community
      // with the destination, and is not in the bridge community,
      // therefore we do not pass the message to it
    }
  }

}

[/expand]

The isBetter function:

[expand title=”click to view isBetter function”]

public static boolean isBetter(int candidate_bridgeCommunity, int test_bridgeCommunity) {

  if(candidate_bridgeCommunity == -1 && test_bridgeCommunity == -1){
    // neither have a shared community
    return false;
  }else if((candidate_bridgeCommunity == -1) && (test_bridgeCommunity >= 0)){
    // candidate shares a community with dest
    return false;
  }else if((candidate_bridgeCommunity >= 0) && (test_bridgeCommunity == -1)){
    // current shares a community, but candidate does not
    return true;
  }else if(heirarachy.level(candidate_bridgeCommunity) > heirarachy.level(test_bridgeCommunity)){
      // candidate has a higher level
      return true;
  }else if (heirarachy.members(candidate_bridgeCommunity).length < heirarachy.members(test_bridgeCommunity).length){
      // candidate has less members
      return true;
  }else{
    return false;
  }

}

[/expand]

Some information about the run:

Log first ran in simulator: 2004-12-09_23:57:24
Log Filename: [...]0.hop-stats.dat
Mesages: 10506
	Hops: 276313
	Avg: 26
	Max Hop Count: 245
Earliest Hop: Wed, 10 Nov 2004 00:00:00 +0000
Latest Hop: Thu, 09 Dec 2004 23:13:29 +0000
Distribution of path lengths for GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2 using MIT-NOV dataset
Distribution of path lengths for GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2 using MIT-NOV dataset

Community Structure:

[expand title=”click view community structure”]

 0(95)
  |
   - 1(95)
    |
     - 2(33)
    |
     - 3(30)
    |
     - 4(33)
    |
     - 5(31)
    |
     - 6(20)
    |  |
    |   - 8(7)
    |
     - 7(12)
    |
     - 9(15)
    |
     - 10(12)
    |  |
    |   - 11(5)
    |  |
    |   - 12(5)
    |
     - 13(4)
    |
     - 14(24)
    |
     - 15(15)
    |
     - 16(11)
    |
     - 17(49)
    |  |
    |   - 18(11)
    |
     - 19(41)
    |  |
    |   - 20(14)
    |
     - 21(4)
    |
     - 22(11)
    |
     - 23(5)
    |
     - 24(45)

[/expand]

Community Membership:

[expand title=”click to view”]

# options
# {'phi': 0.25, 'threads': 5, 'max_times_spoken_for': 1, 'minCliqueSize': 3, 'num_realizations': 100, 'epsilon': 0.14999999999999999, 'perturb_alpha': 0.20000000000000001, 'minCliqueEdgeWeight': 0.0, 'similarity_threshold': 0.69999999999999996, 'outfile': None, 'min_appearance_prop': 0.5, 'intNodeIDs': False, 'file': None, 'alpha': 1.0}
# ForceGlobalParent:true
# community_id-parent_id: members...n
0--1: 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 93 94 95 97 98 99 100 101 102 103
1-0: 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 93 94 95 97 98 99 100 101 102 103
2-1: 6 8 10 11 14 15 16 17 18 21 29 30 31 38 41 43 46 55 61 63 79 80 83 85 86 88 90 91 94 95 97 101 102
3-1: 6 10 11 14 15 16 17 18 21 30 31 38 41 43 55 61 69 79 80 83 85 86 88 90 91 94 95 97 101 102
4-1: 6 7 8 10 14 15 16 17 18 21 29 30 31 41 46 55 58 61 75 78 79 80 83 85 86 88 90 91 94 95 97 101 102
5-1: 6 10 11 13 14 15 16 17 18 27 30 31 35 41 46 53 61 63 69 78 79 83 85 86 88 91 95 97 99 101 102
6-1: 3 9 10 16 25 30 31 35 41 53 61 65 72 74 79 88 91 95 101 102
7-1: 6 10 14 18 25 31 62 65 83 86 88 102
8-6: 3 25 41 53 65 72 74
9-1: 6 14 16 30 41 44 53 61 73 79 83 91 95 101 102
10-1: 9 15 16 30 41 42 52 61 77 91 101 102
11-10: 9 30 42 77 101
12-10: 9 30 41 42 101
13-1: 1 30 83 101
14-1: 7 15 16 30 35 38 41 46 53 58 61 69 79 80 83 85 90 91 95 97 100 101 102 103
15-1: 10 16 30 41 53 61 73 74 79 83 91 95 99 101 102
16-1: 6 11 13 14 17 30 47 58 86 90 95
17-1: 1 2 3 7 8 10 15 21 25 27 29 31 34 38 42 43 46 47 51 52 55 57 58 61 64 65 66 67 69 72 73 74 75 78 79 80 83 85 87 88 90 93 94 95 97 98 99 100 103
18-17: 8 10 15 27 66 80 83 85 97 98 100
19-1: 1 2 6 7 8 14 15 18 21 29 33 34 38 40 43 46 47 51 55 57 58 64 66 67 72 73 75 78 80 83 85 86 87 88 90 94 95 97 100 102 103
20-19: 2 15 38 51 67 80 85 87 88 95 97 100 102 103
21-1: 28 58 90 103
22-1: 23 30 39 41 59 63 79 81 91 101 102
23-1: 45 69 72 88 99
24-1: 3 4 9 12 19 20 23 24 25 28 32 33 34 36 37 39 42 44 45 48 50 51 52 54 57 59 60 62 63 64 65 66 68 70 72 73 74 76 77 81 82 84 89 94 98

[/expand]

As expected there are a high number of messages that arrive within a small number of hops, which is highly desirable, however there is a long tail that extends up to 245 hops, which was not expected.

An annotated log file excerpt for message pair 85:3 (one of the longest paths) is shown below. Note that this message is live, which means it has not been delivered. Analysis of this shows a large number of hops between the same nodes, and the same communities.

Log file Exerpt

Key:

n-isBetter-n:Reason = the first  (candidate) community is deemed
                      better than the last (test), for the reason given.
RANKBC-n  =           the rank of the two nodes in the given community n is
                      compared, and the encountered node scored higher
DM =                  destination met, so the message was passed

[expand title=”click to view log”]

Message ID: 8060
From 85 to 3
Status: live
Passed to node 85 at 2004-11-10 00:00:00 reason: INITIATED
Passed to node 16 at 2004-11-10 09:22:09 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 102 at 2004-11-10 09:56:10 reason: RANKBC-6
Passed to node 41 at 2004-11-10 12:04:45 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-10 14:36:56 reason: RANKBC-8
Passed to node 90 at 2004-11-10 14:40:55 reason: 17-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 95 at 2004-11-10 15:23:10 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 30 at 2004-11-10 15:26:52 reason: RANKBC-6
Passed to node 102 at 2004-11-10 15:40:22 reason: RANKBC-6
Passed to node 41 at 2004-11-10 18:06:57 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 90 at 2004-11-10 18:19:24 reason: RANKBC-8
Passed to node 95 at 2004-11-11 09:47:52 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 30 at 2004-11-11 11:46:14 reason: RANKBC-6
Passed to node 102 at 2004-11-11 12:00:38 reason: RANKBC-6
Passed to node 41 at 2004-11-11 12:55:31 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-11 14:05:32 reason: RANKBC-8
Passed to node 41 at 2004-11-11 14:38:50 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-11 15:11:56 reason: RANKBC-8
Passed to node 41 at 2004-11-11 16:37:36 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-11 17:30:20 reason: RANKBC-8
Passed to node 41 at 2004-11-11 17:56:34 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-11 18:18:04 reason: RANKBC-8
Passed to node 30 at 2004-11-15 09:20:19 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 101 at 2004-11-15 09:46:13 reason: RANKBC-6
Passed to node 41 at 2004-11-15 09:57:18 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 11 at 2004-11-15 12:47:19 reason: RANKBC-8
Passed to node 88 at 2004-11-15 13:41:22 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 38 at 2004-11-15 13:47:35 reason: RANKBC-6
Passed to node 79 at 2004-11-15 14:41:15 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 91 at 2004-11-15 14:52:48 reason: RANKBC-6
Passed to node 101 at 2004-11-15 14:58:03 reason: RANKBC-6
Passed to node 102 at 2004-11-15 14:58:48 reason: RANKBC-6
Passed to node 41 at 2004-11-15 15:26:31 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-15 16:04:25 reason: RANKBC-8
Passed to node 101 at 2004-11-15 16:08:44 reason: RANKBC-6
Passed to node 53 at 2004-11-15 16:14:16 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 13 at 2004-11-15 16:25:03 reason: RANKBC-8
Passed to node 91 at 2004-11-15 16:29:28 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 53 at 2004-11-15 16:30:40 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-15 16:41:37 reason: RANKBC-8
Passed to node 41 at 2004-11-15 16:51:04 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-15 17:31:42 reason: RANKBC-8
Passed to node 41 at 2004-11-15 18:13:21 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-16 09:18:35 reason: RANKBC-8
Passed to node 41 at 2004-11-16 09:24:56 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-16 11:01:28 reason: RANKBC-8
Passed to node 30 at 2004-11-16 11:01:30 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 101 at 2004-11-16 11:58:26 reason: RANKBC-6
Passed to node 102 at 2004-11-16 12:09:36 reason: RANKBC-6
Passed to node 41 at 2004-11-16 12:15:36 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-16 13:35:34 reason: RANKBC-8
Passed to node 102 at 2004-11-16 14:12:41 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 41 at 2004-11-16 14:47:50 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-16 14:49:36 reason: RANKBC-8
Passed to node 41 at 2004-11-16 14:49:53 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-16 15:06:52 reason: RANKBC-8
Passed to node 101 at 2004-11-16 15:13:25 reason: RANKBC-6
Passed to node 53 at 2004-11-16 15:41:12 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 13 at 2004-11-16 15:50:43 reason: RANKBC-8
Passed to node 102 at 2004-11-16 15:55:57 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 53 at 2004-11-16 16:44:56 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 14 at 2004-11-16 16:46:28 reason: RANKBC-8
Passed to node 79 at 2004-11-16 16:52:06 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 61 at 2004-11-16 16:53:08 reason: RANKBC-6
Passed to node 58 at 2004-11-16 17:20:54 reason: RANKBC-6
Passed to node 31 at 2004-11-16 17:21:37 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 90 at 2004-11-16 18:52:52 reason: RANKBC-6
Passed to node 101 at 2004-11-17 12:32:06 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 41 at 2004-11-17 12:37:00 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-17 12:51:31 reason: RANKBC-8
Passed to node 102 at 2004-11-17 12:53:32 reason: RANKBC-6
Passed to node 41 at 2004-11-17 12:59:28 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 17 at 2004-11-17 13:10:13 reason: RANKBC-8
... etc.

[/expand]

In the implementation, the way that the ‘best’ bridging community is found, involves a choice when there are two or more candidates for the best BC for a node pair. This choice is naive because the choice is made by picking the last one in the list.

There seems to be a number of problems when considering whether a given best bridging community for one node is better than another. When considering this choice, the depth of the community in the structure is compared, this is also a naive decision, as the hierarchy levels  in the structure do not relate to the importance of the community, as the depth is not governed by any metric saying how deep a community is, only that it has a parent, and perhaps some children.  However, the size of the communities is still perhaps a good metric to use.

There will always be a bridging community for every node pair, as there is one global parent community, this means the the global community is treated like a local community. This should not be a big problem, as it pushes messages to the top when there is no better community to be found, these messages should soon find a node with a better community than the global one.

A couple of solutions to this are:

  • compare all best communities for a node/message when comparing two nodes
  • store a best-so-far community with the message, and only pass it on to nodes with a better one.
  • stop using any local/global ranking, and consider only community best-ness

Pádraig suggested checking to see whether it is particular destination nodes causing the high hop counts, the graphic below illustrates the average hop lengths for each node, both when the node is the source and when the node is the destination. The data table is also given, which also includes the standard deviation.

Average hop lengths by node where the node is the source and where the node is the destination

Average hop lengths by node where the node is the source and where the node is the destination

Data table:

[expand title=”click to view data”]

node avg_hops_when_src stddev avg_hops_when_dest stddev
1 5.65 3.06 165.62 56.67
2 36.99 48.48 9.68 12.99
3 24.58 35.35 177.15 90.16
4 31.98 40.29 14.3 18.24
5 1 0 1 0
6 36.5 48.97 8.59 6.47
7 30.65 50 6.27 10.11
8 30.64 51.51 17.74 13.27
9 40.73 49.72 49.09 34.73
10 28.68 50.77 25.77 21.38
11 29.25 51.39 5.19 3.42
12 28.47 37.03 13.25 17.61
13 28.65 48.69 9.17 9.79
14 29.51 48.44 5.63 4.58
15 29.37 49.2 25.79 23.91
16 28.68 50.36 9.35 7.35
17 28.81 50.53 8.09 10.76
18 24.65 35.73 82.75 35.73
19 26.43 49.7 12.36 16.71
20 28.99 47.09 14.4 17.78
21 26.54 49.97 6.3 9.82
22 1 0 1 0
23 31.04 46.51 24.57 30.6
24 24.81 44.58 13.21 17.19
25 22.88 40.02 107.42 63.14
26 1 0 1 0
27 19.05 33.76 18.77 12.32
28 14.76 31.24 45.16 26.61
29 37.24 46.37 6.75 9.76
30 29.38 49.53 5.64 3.51
31 30.43 49.31 18.72 10.58
32 33.48 46 12.8 17.24
33 39.87 50.71 4.82 3.3
34 33.17 40.86 5.78 3.49
35 33.63 45.75 5.06 2.24
36 23.25 40.12 15.01 18.54
37 35.37 45.88 14.89 18.24
38 27.52 43.52 9.3 9.72
39 25.62 39.24 46.45 29.91
40 24.88 36.34 5.75 3.27
41 32.17 52.09 9.75 5.69
42 38.6 48.48 111.61 61.11
43 29.83 51.16 6.34 9.82
44 14.51 16.61 76.79 37.2
45 37.45 43.78 29.45 24.56
46 27.56 48.06 6.54 9.87
47 28.09 43.68 22.54 19.48
48 20.84 33.7 13.55 17.73
49 1 0 1 0
50 18.75 34.44 13.67 17.81
51 28.18 50.56 6.25 4.93
52 41.13 49.35 64.24 34.58
53 30.95 47.38 15.41 13.87
54 31.66 46.03 13.83 17.66
55 28.63 44.23 6.44 9.57
56 1 0 1 0
57 31.55 41.85 5.34 3.2
58 27.45 50.7 22.09 22.22
59 27.46 44.1 26.02 27.82
60 10.72 9.25 30.58 26.47
61 30.2 49.97 13.91 18.41
62 33.79 44.06 109.91 57.24
63 30.76 48.56 14.25 16.67
64 20.71 22.38 5.73 3.49
65 20.37 34.53 111.34 63.53
66 37.36 45.57 127.11 65.15
67 38.19 48.7 9.25 13.14
68 25.67 37.47 15.05 18.31
69 29.79 45.81 7.61 4.27
70 21.41 33.17 16.02 18.61
71 1 0 1 0
72 22.07 27.05 24.52 20.85
73 30.64 49.46 72.68 66.96
74 20.09 30.47 93.49 55.14
75 28.4 50.19 6.65 9.8
76 27.27 37.04 14.08 17.77
77 15.32 29.31 31.28 38.38
78 29.72 47.43 6.76 9.76
79 23.72 42.65 10.85 9.49
80 29.97 52.24 31.66 25.39
81 34.55 49.01 22.03 26.42
82 27.24 40.62 16.61 18.63
83 28.57 32.77 99.87 50.57
84 33.09 45.61 15.61 18.58
85 27.16 48.18 8.73 8
86 24.51 44.49 4.67 3.37
87 27 50.14 9.69 13.32
88 30.65 52.13 28.11 44.69
89 21.62 35.71 14.28 17.76
90 27.7 48 10.19 9.92
91 28.15 47.49 8.95 8.27
92 1 0 1 0
93 27.7 37.74 34.67 35.09
94 36.59 47.23 3.75 1.49
95 29.91 49.99 23.63 20.39
96 1 0 1 0
97 37.44 49.69 9.7 6.36
98 18.9 35.19 17.85 19.89
99 31.82 49.52 32.99 20.77
100 23.79 39.27 34.01 17.46
101 29.15 50.93 9.07 8.7
102 34.7 47.82 6.42 5.27
103 37.67 48.72 60.97 33.82

[/expand]

It seems that in some cases, a node might be very hard to get to, and therefore has a high number of hops, however it could also be the case that because the node is not seen very often, then the messages just keep moving around different nodes. The way to see this would be to take measurements of the the hop count (as above) at time intervals. However, it would be best to first fix the algorithm to see if the behaviour continues, then consider further analysis.

Problem Found/Solution

After a bit of bug tracking and debug output scouring, we were able to find two problems. The first, a simulator problem meant that community file (numbered 0 to n) were being loaded in an incorrect order, which meant the internal numbering of communities did not match the file names:
[expand title=”view ordering”]

community.0.dat
community.1.dat
community.10.dat
community.11.dat
community.12.dat
community.13.dat
community.14.dat
community.15.dat
community.16.dat
community.17.dat
community.18.dat
community.19.dat
community.2.dat
community.20.dat
community.21.dat
community.22.dat
community.23.dat
community.24.dat
community.3.dat
community.4.dat
community.5.dat
community.6.dat
community.7.dat
community.8.dat
community.9.dat

[/expand]

The second problem was a little less subtle, it was caused by the behaviour of the isBetter function when considering the level of a community in the hierarchy and the number of members in a community. When an upper level community had a smaller membership than a lower level one, the message tended to be flip-flop’d between communities.  This was temporarily fixed by only considering the size of a community when the level was the same.

Results after this fix (for both BubbleRAP and BubbleH) are shown below:

However!

We have since decided to test what happens when we ignore the depth of a community, and only consider the member count, this might have the effect of flattening the hierarchy somewhat, but it simplifies the algorithm… watch this space for results.