Archive

Author Archive

Community Clustering part 2

December 8th, 2010 2 comments

I calculated the communities as described  in data-pipelines and from this generated two lists of nodes for every week in October, and for all weeks.  I took the largest list for each, and used that as the community for which to test the LBR algorithm.

updated-plot-comm

Updated plot, showing the delivery ratio for LBR-Community, one (dotted) line for the communities found in each week, and overall.

It seems that, as expected, routing within the community gives an advantage to the algorithm, and in fact, when the community knowledge is derived from the full dataset, then LBR performs better that the average of the rotated dataset. This plot is slightly misleading however, as the data for the previous algorithms is the rotated averages, whereas the data for LBR-Community is for straight runs. Below is the plot against straight runs of each other algorithm.

updated-plot-comm-straight

This plot shows straight runs of all algorithms, and the updated LBR-Community runs.

In this plot, it is obvious that the difference is less profound as with the average plots, however, I am considering running the rotation scheme again for LBR-Community.

LBR Location Analysis – MIT Dataset (Oct)

December 8th, 2010 8 comments

I wanted to see what the real picture was when using cell towers for location, so for a run of LBR on the MIT Dataset for October, I collected details about every message hop. I compared the timestamps for the hops to the cell tower information in the database, to find out which cell tower each node involved in the message hop was recorded as using.

Colocated: (same cell) 2247

Not-Colocated: (different cells) 15688

One sided: (only one sees cell) 13452

No data: (no cells seen) 3211

Count: (total number of hops evaluated) 31387

These results are slightly concerning, as roughly half of the hops between nodes happened when nodes reported different locations (cell towers ID),  10% of hops did not record any cell tower information at all, and in 42% of hops, only one node reported a location. This seems to demonstrate that we cannot rely on cell tower information to give us an accurate idea of the co-location of people. However, it might still prove useful to use the cell tower location for generating general mobility statistics, which could be used as routing metrics. Davide mentioned he had some ideas about these sort of statistics.

Categories: Uncategorized

Community Clustering part 1

December 1st, 2010 2 comments

I have visualised the graph of connections in the MIT Dataset, for the month of October split into 5 parts: Weeks 1 to 4, and the whole month. Each week represents a new set of edges, built after the end of the previous week. So each graph shows the edges that are formed only in that week.

mit-graph

The list of edges and matrix data is in the zip file here: mit-graph.

Categories: Uncategorized

Seminar/Presentation 29 Nov 2010

November 29th, 2010 3 comments

Presentation summary:

Routing in Human Interaction Networks

The goal of this seminar is to get some ideas from the participants, about the feasibility, and possible directions for this research.

We believe that by detecting the underlying patterns of human interaction involving movement and contacts between individuals, we can design a delay tolerant mechanism for passing messages between actors in the system (for example using mobile phones) that does not rely on fixed infrastructure. A system such as this should exhibit low cost (hop count, update overhead, delivery latency) and high efficiency (delivery ratio).

We believe that we can drive a routing policy by calculating and sharing metrics and meta-data at node level; from the contacts they make and the places they visit, without the need for a centralised system. In particular we are interested in using location to drive the messaging scheme.

We will show recent, related results of Persistence Based Routing vs existing schemes, some initial results from our simple schemes, and will ask for suggestions regarding the direction of our research into Location Based Routing.

Present: Pádraig Cunningham, Davide Cellai, Derek Green, Conrad Lee, Fergal Reid, Neil Hurley

Presented some background, and ideas about routing in human interaction networks (Slides – 1.2MB), someone noted tha Delivery Ratio and Latency are directly linked, with the suggested that the requirement for high delivery ratio and low latency, may not be intuitive in some situations. e.g. when someone who will cause a high latency, may be the only one to get the message across.

The presentation seemed to go well, however some parts I might not have delivered properly, meaning that I seemed to be skimming over a few bits.

Some suggestions were made:

  • Consider the use of some sort of altered vector clocks, which keep a history, and can be shared with other nodes
  • Partition the graph in the Reality Mining data, to identify the communities, then test the algorithms for only those communities
  • Strong consensus to start using periodic patterns for predicting routes
  • Neil suggested that I try to generate a dataset that does work well for LBR
  • Ferhgal suggested what we had talked about before: finding out at what places, messages get sent

Reference to chase up:

  1. PNAS 2005/2006 – LIBEN-NOWELL – ?GeoRouting in social networks?
  2. SneakerNet
  3. Talk to Martin Harrigan – who is using Vector Clocks in some clever way

There may have been other things I don’t remember – feel free to comment with additions!

Rotating the data

November 25th, 2010 2 comments

Met with Pádraig and Davide briefly to discuss the poor performance of the LBR algorithm. Pádraig had previously suggested that we run multiple simulations over different time slots, and Davide and I had talked about and planned to do split the runs up as follows:

timerotation

Measuring the delivery ratio, and cost for PBR, LBR, Prophet, Random 1.0, Random 0.2,  Random 0.0 and Flooding, will give us a good picture about the nature of the data, and we will be able to see if the previous plots are representative of the data.

I will also try to adapt the ranking mechanism, to include only the top (most seen) 10% or 20% of cell towers, to see if that makes any difference to the results.

Update:
Using the method above, I ran the simulations for LBR, PBR, Unlimited Flood, Random 1.0,0.0,0.2 and Prophet. Below is the average delivery ratio over 18 runs, for each  protocol in 28 day time windows, at 7 day intervals,  between  27 Aug 2004 and 21 Jan 2005.

Average delivery ratio over 18 runs for each protocol, between Aug 2004 and Jan 2005

There does seem however, to be a few issues with the number of lines returned in the logs, this suggests that somehow, I have mis-calculated, probably just for the end  time period used, the start and end time period, but this seems to only account for the last few hours of each run (out of the month). So until I find the error, I think these results are roughly correct.

LBR, PBR, Flooding and Random comparison

November 21st, 2010 2 comments

This graph shows the delivery ratios from the start of the simulation for MIT-OCT dataset. It would appear that in fact, LBR performs poorly against the random algorithm, where the P is the probability of passing on the message (1 copy) to the met node. Interestingly though, when P = 1, the delivery ratio increases. The graphs below show the delivery cost for the different algorithms, spread across two plots, so that the very high costs of Random and Prophet do not dwarf the lower costs of LBR and PBR. These three plots together show that whilst the delivery ratio of  prophet and random are high, the cost is very high.

Delivery costs for three routing algorithms on the MIT-OCT dataset, spread over two plots for clarity.

Perhaps a combined metric for comparison is  cost over delivery ratio, this would perhaps give a better idea of general quality of the algorithm.


Categories: Uncategorized

Location Based Ranking, effects on routing

November 17th, 2010 2 comments

Using the simple formula for calculating the ranking of users, as follows:

Popularity of user [latex]i = \sum_j \tau_{ij} ( p_j )[/latex]
where [latex]\tau_{ij}[/latex] is the time (or number of times) user [latex]i[/latex] has visited the tower [latex]j[/latex], [latex]p_j[/latex] is the popularity of tower [latex]j[/latex].

Create a list of towers, with an associated number of times the tower is reported in the dataset, call this the set [latex]p[/latex],

SELECT count(celltower_oid) as count, celltower_oid FROM cellspan GROUP by celltower_oid ORDER BY count DESC

then for each user, get a list of the cell towers, with an associated number of reading for that user, and call this set  [latex]\tau[/latex]

SELECT count(celltower_oid) as count, celltower_oid FROM cellspan WHERE person_oid = {USER_ID} GROUP by celltower_oid ORDER BY count DESC

then, for each member of  [latex]\tau[/latex], multiply the number of time the tower is seen in  [latex]\tau[/latex]  by the number of time the tower is seen in total from  [latex]p[/latex].

This yields a number for each user, which can be used when initialising nodes in ContactSim (aka LocationSim). I added new properties into the the configuration files, which reflected the ranking of each node, and created new classes in LocationSim, which reflected a new protocol called LBR. When a node is instantiated, is looks up its ranking metric with an oracle class (which reads the data from the configuration file). During the operation of the simulator, when a node get’s an onConnect() event call, it checks it messages to see:

  1. if it has any message for the node it is connected to, it passes them on
  2. if the node it is connected to has a higher ranking, then it passes all messages on

The initial results show a good delivery ratio compared to PBR

Delivery COST of messages for LBR and PBR using 7 day periodicity, where messaging starts at the first timestep (PBR 7 day Period Start, LBR Start), or after 1 week.

Delivery Ratio of messages for LBR and PBR using 7 day periodicity, where messaging starts at the first timestep (PBR 7 day Period Start, LBR Start), or after 1 week.

This suggest that simply knowing the rank of the users, based on location, significantly improves routing. However, I need to check the algorithm is working before I can say the results are accurate:

  • Does the node remove it from it’s buffer when it passes the message on? Yes it does
  • Is the node correctly recording deliveries?

Currently working on: Graph of all connected users. Taking a long time……

Meeting 12 Nov 2010

November 12th, 2010 2 comments

We discussed progress since last week, and I showed the plots from the MIT-Reality dataset, and decided that it probably is OK for determining whether nodes are co-located or not, and give some extra information about movements. I said that I would generate a graph of people connected during the same time period to the same cell tower.

Pádraig suggested a simple policy for routing, based on ranking of locations and consequently people that visit them, where a persons rank is based on the rank of the towers (places) he visits, and how many different towers he visits, making a person that moves around to different, highly ranked towers, more important for routing that someone that only visits low ranked towers, or perhaps even someone who only visits one highly ranked tower.

Davide said that he might have made some notes about this in the past and would look them up, and Pádraig said that it could be calculated within some probablistic framework. I noted, that this sort of scheme would really be routing towards the center of the ranking network, which Pádraig suggested was because it was routing based on the centrality of the network. But we decided that it is worth using this to at least test this idea out against a random routing protocol.

With regards the actual implementation, Pádraig explained his idea that we could do the ranking (at least to start with) before the simulation starts, and simple gives nodes fixed ranked positions, meaning that we don’t actually have to change any of the simulator code, the rank is just a property of the node itself. So, a routing policy is as follows:

on meeting another node:

foreach message in buffer

   if node is destination node
       pass message
   else if node has a higher rank and visits a 
           location that the destination node does
       pass message
   else
       keep message

The idea being the the protocol uses the existing contact events to drive communications, but uses the fixed location ranking to form the routing policy.

Pádraig also said that we really need to have a conference target in mind, and we couldn’t really think of a suitable one, Davide suggested Ubicomp, but it has a low acceptance rate.

So, for next time, I will have generated a graph of connected individuals based on connecting to cell towers during the same time period, made sure I can generate the same results as Graham, generated  a ranking for individuals based on cell towers, developed a routing scheme based on this, and hopefully identified conference targets.

UPDATE:

Davide suggested the following calculation for determining the popularity of users, which can be used as a metric for routing.

Popularity of user [latex]i = \sum_j \tau_{ij} ( p_j + c )[/latex]
where [latex]\tau_{ij}[/latex] is the time (or number of times) user [latex]i[/latex] has visited the tower [latex]j[/latex], [latex]p_j[/latex] is the popularity of tower [latex]j[/latex] and [latex]c[/latex] is a parameter which tunes the importance of user mobility

Pádraig suggested that [latex]c[/latex] should be 0.

MIT-Reality Mining Dataset – location analyis

November 12th, 2010 2 comments

I have spent some time looking at the reality mining dataset, to see how useful it is for inferring location about individuals. Firstly I generated a matrix, People over visited cell towers, which was huge, due to the large number of cell towers in the dataset (32, 000). Then I used this data to generate a graph  of people in the dataset, where an edge between two people, if they have ever visited the same cell tower, independantly of whether they were there at the same time. This produced the plot below:

This graph shows people connected based on visiting the same cell towers, but not necessarily at the same time. (full size)

The two plots show the same data, but with well connected node 22 removed on the right. This data implies that most people are using the same cell towers at some points in the dataset, which is promising for using cell tower a  location metric. However, this does not show any information about whether nodes were connected at the same time.

To further clarify the data, I processed the information provided for cell towers, which is stored in two columns in the dataset – oid and name, for example:

OID Name
35 40342, T-Mobile
36 40772, T-Mobile
37
38 0,
39 883, AT&T Wirel
40 2353, AT&T Wirel

I seperated out the cell ids and carrier names  into two columns, which enabled me to get the distribution of carrier names as below:

This shows the number of cells for each known carrier reported in the dataset (30 Others refers to 30 distinct reported carrier names, who’s sum totat was 429 reported towers) (full size)

This suggests that many of the towers are provided by different carriers, so this suggests that, we might have distinct clusters of users, that for some reason are connected by a few – perhaps they ‘roamed’ onto a new network, or changed provider at some point. This needs further clarification, and may only be able to be measured using ‘brute force’, by constructing the graph of people connected based on c0-locations of cell towers.

So far, the analysis is inconclusive, which means it will need a little more investigation. However, my feeling is that we WILL have enough overlap in the dataset to make it feasible to infer location from cell towers records.

Categories: Uncategorized

Meeting with Pádraig and Davide

November 5th, 2010 2 comments

Met initially over lunch then in the office. Pádraig had made the following observations:

I think the “Routing with Location” post is very useful for making concrete how location information might help routing. This is the key principle:
“the more locations it visits in common with the destination node, the more likely it will be to pass the message to the destination.”
To be devil’s advocate, the worry would be that the useful information entailed in location is already compiled into the existing contact information. If that is true then the uplift due to the inclusion of location criteria in the routing will not be significant.

Davide and I had been discussing this earlier in the day, and I explained my idea of measuring the importance of a location in terms of how ‘active’ it was; i.e. how many messages exchanges happen there. In this way, it is distinct from contact knowledge alone.

Davide suggested a similar mechanism, which was to measure the popularity of locations, based on, for example the number of different people that visit there. I misssed slightly the meaning when he said an instantaneous value. We talked about concept of maintaining a global table, which tracks the current popularity of locations, and routing to nodes that visit these locations most often.

We discussed the possibility of using the cell tower information contained within the MIT reality mining dataset, and suggested mechanisms for implying location from this. The first, assuming that the nodes are connected to the same network operator, simply being when two nodes are at the same cell tower T1 and T1, they are co-located, else they are not (e.g. they see T1 and T2 respectively)

Another situation to test for however, is when two nodes can ‘see’ each other, but they do not report the same cell tower (T1 ad T2). I suggested a hybrid location, e.g. T12 which is a location where two people have been seen to be connected, but were not connected to the same tower. Initially, we decided to analyse the data to see what it looks like. Pádraig suggested using a simple matrix – People over cell tower IDs with a count of how many times they occur.

One thing we didn’t factor for however, is whether a node can see multiple cell towers at one…. this will become clearer when I have a proper look at the dataset.

Incidentally, this got me thinking about a generic way to represent locations, as 1d, 2d, 3d etc. The MIT dataset cell tower locations, providing all nodes are using the same co-ordindate scheme (i.e. network operator), can be described as a 1d co-ordindate system. The ‘distance’ between points, whilst not meaningful for physical location, could be based on some metric of common movements between locations.

We also talked about ContactSim, and Pádriag agrees that we should probably use this if it is suitable. He asked me to ask Graham for a copy of his chapter that deals with his PBR protocol (Chapter 7) for him to read. We decided that I should try to replicate Graham’s results, to ensure I had a proper handle on what the simulator does.

Pádraid would also like me and Davide(?) to hold a seminar for few people, in which we at least show the results from Grahams work, and discuss the ideas we have about routing with location, I suggested about a year from now, Pádraig was thinking about 2 weeks from now, so in next week’s meeting we will fix the details for the following week.

For next time, I will have replicated Graham’s results, analysed the MIT dataset for applicability for our needs, will have started planning for the seminar, and will have started to adapt Graham’s simulator to use location information.

Categories: Uncategorized