Matt Stabeler

Location clustering so far.

October 21st, 2010 matt 2 comments

I have had a chance to look at the code that Derek sent me, namely the CFinder software for finding k-Cliques, and an implementation of the MDS algorithm, I struggled to get these to make any sense of the data that I have, and started to think about an alternative approach.

Despite the decision we made in a previous meeting, to consider location readings globally, regardlessof order/time, I decided to consider using the nature of the data to inform the clustering algorithm, namely, using the time ordering to indicate the relationship between readings. This meant going back to the idea of using readings from individuals, to determine the places that are relevant to them. This of course may have issues when comparing places from different individuals, but no more than when implementing a simulator (or real world) system, where individuals track their own movement patterns.

By taking time ordering into consideration, we can implicitly detect when a node is moving from one place to another, by measuring the distance between the current reading, and the centroid of previous readings from a known place.

- take each location reading for a person, and measure the distance
            between it and the center of the current cluster
    - if the distance of reading from the center of current cluster is greater
        than the mean distance of points in the cluster to the center + alpha,
        then it is outside of the cluster.
        - read ahead by X (to make sure this is not just an erroneous reading)
        - if X readings are beyond the threshold of Current cluster
          - then we have left the location,
              - record the current cluster if it has more than
                  N location readings
              - start a new cluster with the current reading
        - else
          - discard the erroneous reading
    - else
      - add to list of cluster points
      - re-calculate cluster centre

Where alpha is a metric that allows the cluster to grow, alpha can be fixed (say 50m) or a function of the mean distance to the centre of the cluster. The number of readings to read ahead (X) should probably be fixed at say, 10, so that if the person goes away and comes back very quickly, they are considered to have not left.

The minimum number of readings required to consider a cluster valid should be relatively large, say 10 or 20, so that only places where the user lingers for some time, are considered. If we use a number too high, then we will miss some important places. A number too small would mean that places where users just pause for a while will be recorded, which may not be important places.

The algoritm still need improvement, so that it also considers the time between readings; for example, rather that considering the number of readings in a cluster, consider the amount time between the first and most recent reading, and use this to determine whether a place is recorded. This means that we should be able to record important places even if readings were not taken for some reason.

Categories: Datasets, Thoughts, what i've been doing

Data Analysis – Co-location using WiFi access points

October 19th, 2010 matt 2 comments

In the last post, I described a method for deciding on co-location based on spotting of common WiFi points, which co-located users if they saw the same WiFi access point, at roughly the same time. We refined this method to discriminate when the approximate distance, based on signal reading (which could be signal strength, or signal quality). Pseudo code for this algorithm:

For each consecutive timestep window, between start and end.
- Retrieve all readings within that window for all users.
  - Foreach user
    - pick the strongest signal from the list of readings for each access point (as there may be multiple)
    - rank reading in order if signal strength
    - discard the readings for access points where:
      - Signal Strength divided by Strongest Signal Strength is less than Alpha
    - Foreach other user
      - perform signal processing as above
      - if the intersection of the remaining list of access points for each user is not null, consider the two to be co-located

The variables in this algorithm are the start and end time, the size of the window and the alpha value.

There were a few factors to consider before we could implement this, firstly, it was important to find out what the values for signal strength actually meant; the data ranged between values of 11 to 110. I used an N95 phone, with the Campaignr software installed, to record a number of reading at varying distances from known access points. I instructed the software to upload to a script that simply recorded the data in a text file. This showed that the ‘signal_strength’ value increases as the phone moves away from the access point. Another factor to consider, was the range of values that were reported in the database. Davide suggested plotting the values of all signal strengths for all readings, over the the readings that would be selected using the alpha value selection. The figure below shows the data based on an alpha value of 0, for all readings, 0.8, and 1.0 (for only the strongest readings).

The second figure shows all readings, split into individuals.

The next step is to decide what alpha value to use, and store the co-locations into the database. To this end, I have written a script to parse all 3.6 million records, calculate the alpha value (Strongest signal strength over signal strength) and write the value back into the database. This will allow us to quickly see the effect of using different alpha values for decision making, without having to repeat the calculations every time. It will also allow us to plot the effect of different alpha values against each other very quickly.

Categories: Datasets, projects, what i've been doing

About Datasets

September 14th, 2010 matt 2 comments

I have spent a couple of days looking at the quality of the datasets I have parsed (Cabspotting and Geolife) and have had a couple problems. When I imported the data, I made the decision to use the MySQL datetime format for timestamps. This resulted in very poor performance when querying the database (maybe). So, I decided to convert the timestamps to integers, which represented unix time.

ALTER TABLE `entity_locations` ADD `unixtime` INT NULL;

Query OK, 25075904 rows affected (7 min 40.11 sec)

The coversion was a simple case of create an new column, then updating the new column with converted timestamps:

UPDATE entity_locations SET unixtime = UNIX_TIMESTAMP(timestamp);

Query OK, 25075904 rows affected (11 min 13.38 sec)

Then adding an index (I should have added this when adding the extra column, but may not have made much difference)

ALTER TABLE `entity_locations` ADD INDEX ( `unixtime` );

Query OK, 25075904 rows affected (15 min 21.79 sec)

All seemed well and fine, however, I had noticed during testing a sample dataset, that the MySQL datetime field equated to 23 seconds earlier than the new unixtime field. (as parsed by PHP), the ameded dataset maintained this difference, meaning it was not a compound error. This will not be a major problem, unless comparing to another dataset, using a small granularity.

One problem I did notice however, in the Cabspotting dataset, the timestamp column had ‘on update current timestamp’ set meaning that when the new unixtime column was updated, the timestamp column was given a new timestamp value. I rectified this by removing the trigger, and setting the timestamp based on the unixtime. (not accounting for the 23 seconds).

Another issue I noticed with some of the data, was that the timestamp was set to the time of insert, for some of the geolife data, as well as some records which were zeroed, and some which were in the future. This may mean the need for a full import. Luckily I have kept all of the data and parsing scripts.

I still need to do some thorough analyis of the datasets in other ways as mentioned before. And I also want to convert the social sensing type data into the same format as these two datasets. There is also the CenceMe dataset to consider; it may not be suitable for parsing, as the location data could be very sparse. This of course might be useful for comparing to thicker datasets.

Geolife dataset

Initial analysis of the datasets to find the concentration of readings shows that there are a handful of users who collect data from around apri/mayl 2008, then October 2008, another set of users contribute data in earnest, with fewer contributing from the original set of users. Readings tail off from april 2009, with vert few users contributing after the beggining of August 2009.

Cabspotting

Show a good amount of data is recoreded for most users between 17 May 2008 and 09 Jun 2008, with only a small number of exceptions.

Categories: Datasets, what i've been doing, What i've been writing

Weekly Update 30 Apr 2010

April 30th, 2010 matt 2 comments

Last week I set out the following plan for the week:

For the next week I plan to nail down how I can implement vector clocks for location and proximity, also think about what metrics we can derive from this. At the same time I plan to read through Knox’s thesis, and read Simons paper, and try to formalise my ideas about location.

Since then, I have read Simon’s paper, which provided a nice discussion about what needs to be thought about when we deal with location. I have made progress with the simulator software. I also wrote code to process all of the data I collected from N95s, so that each reading in the database has a location associated with it. The simulator will now be able to use this to drive the location aspects of nodes.

I had some thoughts about planning a chapter about deriving location from sensor readings (e.g. WiFi position, GPS Readings, Cell Tower Positioning and Triangulation) – e.g. how can we measure the accuracy of our posistion calculations, how accurate does a the position need to be, and how should we represent location.

I also started to read up on Graph Theory, but starting with Barabasi’s book ‘Linked’ , which gives a really interesting overview of Graph Theory and how to measure aspects of complex networks. This made me realise that I will also need a chapter about this subject, and so I plan to write a review of Graph Theory based roughly based on this book.

I have also been giving more thought regarding Vector Clocks and how to use them, but I have not implemented anything yet.

My plan for next week is:

to get a better understanding of Graph Theory by finishing the book.
Implement vector clock code in the simulator, based on proximity, and work towards a location based one.
Read Knox’s thesis.
Formalise my thoughts on how to use Vector Clocks of proximity and location, and what metrics can be derived from them
Generate a rough outline of chapters for my thesis, and identify the main areas for the background section

Categories: weekly report, what i've been doing, What i've been reading

Supervisor PhD Meeting 14 Apr 2010

April 15th, 2010 matt 2 comments

Had a meeting with Paddy for me to pitch my ideas.

This was the basis of my Pitch_to_Paddy

Mobility is NOT location…. see Simons paper

Discussed my ‘Hypothesis’

Human mobility patterns are predictable
Human proximity patterns are predictable
Knowledge of proximity and location makes opportunistic routing more efficient than proximity alone.
Proximity and Mobility can be used independantly to achieve the same efficiency in oppotunistic networking.
Mobility or Proximity can be discounted when planning opportunistic networking algorithms.
There are low complexity algorithms based on vector clocks that can be used for routing
Any given node will only need to communicate with with other nodes that they know (friends), or want to know (friends of friends).
- Paddy suggested this might be a bit like a DNS tree, which hands off knowledge
- Also experiments need to establish that a friend or a friends of a friends knowledge is enough to route
- Might establish that 3 degrees is more than enouhg
- tradeoff = you can get better efficiency if you give more coverage of the network

Local Metrics

Using vector clocks –

Range – how many hops away – build a knowledge about the network using information in the vector clocks – how do you do that? This is a PhD part.

How do we determine the important nodes? – the Aaron Quigley effect.

Nodes in your ball of radius = sphere of interest = friends, +1 hop = friends of friends, +1 = everyone else.

Dig out reviews on DTN’s – especially patterns – but paddy thinks that the notion of location and proximity have never been used, but the patterning structures e.g. highly important update nodes. So i need to look at ways of discovering special nodes. How do they determine that . Location thing seems to be different- find a review.

Mobility as opposed to location – gives your prediction element.

Limit the range of forward projection.

Datasets

Email GW to see if he can get a public dataset. – sign over under NDA?

Email Barry Smith to see if he knows of any datasets we can use. – Vodaphone dataset?

Email Aaron Quigley – he will know if there is any publicly available – his masters student has access to a corporate travel dataset.

Also – see Simons Paper about location.

Look up Intel Placelab dataset

Email knox to see what datasets he might have.

Progression

Paddy not sure where the Vector clocks fit in

Is it a novel implemenation mech. – i.e. am I going to use vector clocks in this thing.

I want to make a prototype. – paddy likes the idea – there are some hard questions – some novelty in there. This parts Understanding how to frame solid hypothesis. reading reviews. building exp structure, breaking out a few bits – vector clocks, heuristics about making decisions, how it all fits together.

Ideas about locations

Not fully formed so need to think a bit more about it

Future

We can turn the VC thing on its head, and make it useful for proximity and location.

I want to build prototype

Need to be careful not to spend too much time comparing to existing things if they are not really related.

Important thing is does it matter where you are when I pass you a message – as proximity and mobility are the same – do I pass it to you because i know you see that person, or because I know you will be in the same location as that person.

Nodes can predict their own positions – share this information with other nodes – paddy suggested sharing based on the types of people – e.g. friends get detailed info, FOAF get obfuscated location, others get much broader information.

Requests?

Does a node do calculations about other nodes, or does it ask the other node – can you get this to this person?

A little but like Agent systems?

You might have different approaches depending on who you are dealing with – e.g. message to a friends goes through friends, message to FOAF goes otherwise, everyone else – can you get it to somewhere nearer than me – or somehting…

Then we can say we of course can encrypt this information.

Plan

Paddy felt that vector clocks etc. that are used to encode e.g. double vecotr cocoks location mobility, is a solid piece of work, and if it gets into a good conference, then it is my PhD.

It will need a Order of computaion section with a complexity analysis i.e. is it N^N, Nlog(N), N^2, N^3 – dont go much beyond that – need to analyse the algorithms at the end. well travelled ground about what the complexity of vector clocks is.

I want to Nail down what these metrics in the system are, then implement CAR using these metrics as well as coming up with my own algorithm.

Need to convince Paddy what algorithms/datasets to compare with, there needs to be a good rationale be behind it.

Need to refine the contributions bit – but this will come with time

Hypotheis section is good – but must refine and remove negative ones – it should keep the positive ones, and prove one thing true or false – pick the one we think is most likely.

Add another hypothesis that low complexity algorithms based on vecotr clocks (30:24) can be used.

Dont go down rabbit holes!!! Give Paddy weekly updates – every Friday – nag paddy – if he has not responded by 10am monday – Paddy will comment back

Tighten up – look at knox Hyp section. – and write a halfpage hypothesis introduction and a one/two line hypothesis at the end.

Aside:

Can we use Barabasi way to generate new dataset – almost reverse engineer their preditions, and try to get a dataset based on it?

e.g. Random graph of streets for dublin, randomly place nodes – simulator and start to make locations as the predictions.

Dig out reviews on DTN’

Categories: Ideas, Supervisor Meetings, what i've been doing, What i've been reading, What Paddy Says

Presentation 10th Feb 2010

February 11th, 2010 matt No comments

Gave presentation to Paddy, Davide, Neil Cowzer and Fergal Reid (clique) about my quick and dirty analysis of the dataset that I have collected allready.

Slides

General concensus was that there was not really enough users, and so there were some suggestions about other datasets that might be found -persuade a mobile phone company to give data about user movements. Mine flickr/twitter for geo-tagged photo’s/tweets, and try to determine groups of people based on similar locations.

Also suggested that the GMA is good for visualising data, not greatly interesting, PH is interesting as is SPD. BD is something that is useful as an application to gather data, but would need a very large engineering effort.

Paddy suggested that if we could make the data collection process very easy, then we could throw it out to the student population to start collecting data. Fergal said that in J2ME it would be very difficult, but by sticking to C++ it might work (for Nokia phones).

Also talked about getting ground truth for data, Fergal Suggested collecting accellorometer data too (so if someone asked – how did you verify GPS trace, one can say that we correlated it with the accelorometer data). I also suggested tagging locations.

Determined the following actions:

Look for access to datasets with good location – 1 week
- WaveLAN Dataset
- HeaNET – chase paddy – Eduroam
- Mine location data from Flickr
Look at applying analysis to these datasets – specifically
- Periodicity Hunting
- Spatial Dependance on the Degree
See if we can construct overlay over these networks
- e.g. drop nodes
  - Popular locations
  - popular people
  - Other?
  - Vector clocks might be the way to do it
Read up about Vector Clocks as suggested in the paper by Klineberg, Watts and ???? at KDDOA
Speak to Graham about whether I can easily integrate this data into his code, if so – do it, otherwise think about implementing it seperately(robustly!)

Also planned to meet Paddy again next week to go over these things, and try to hammer out a better plan. Then meet with these people again in three weeks to show where I have go to.

Davide also talked about churn in proximity patterns – might be worth thinking about what this means (example was then a person regularly sees other people, and after a while, one of those people drops off the radar – what does this mean)

Paddy said that in his mind, the long goal is to be able to forward plan using the knowledge of data that has passed (prediction).

Categories: discussions, General Meetings, Ideas, projects, what i've been doing, What Paddy Says

Meeting with Davide 14 Dec 2009

December 14th, 2009 matt No comments

Had a meeting with Davide to discuss current research,

we talked about what I had discussed with Paddy, and Davide seemed to think this is effectively the same as the E-DTN position paper, and that we should pursue these ideas further.

We decided that in lieu of getting access to largert datasets, we should look at the ones we have allready. I suggested that we use the Tom dataset that was collected over a week or so when Tom used the N95 I programmed – as a start this might help us to see what the data looks like. He suggested the following tasks:

Contact Eiko Yoneki to see if she has some datasets that we could use
- UPDATE: She didn’t have any datasets and she is eager to get some herself…
Look at Grahams simulator, understand it, and extend it to include the ability for location data to be incorporated
Generate two graphs from the Tom data (which he called the TomStalker dataset):
- Latitude vs Longitude vs Number devices seen (3d)
- Frequency of device spotting vs time
Look into taking a statistics course at ucd, to learn the techniques of statisical analysis
Include him in correspondance about this

Categories: General Meetings, projects, what i've been doing

Meeting with Prag 9 Dec 2009

December 10th, 2009 matt No comments

Had a brief meeting with Prag Sharma, who described to me the sort of things that the clique group were doing.

I explained to him that I was interested in ways of analysing social networks in terms of movement patterns. He mentioned a few datasets that clique has some access to: Conrad and Fergul have access handset data from 6 million Nodes from a telephone network – IDIRO, but he did not know what the data included. Another dataset was NORON data, which is to do with financial fraud and included banking transaction data.

He suggested that a good person to talk to was Derek Green, who he thought was doing similar work.

We saw Derek in is cube, and it seems he is looking more at social clustering, but we thought there might be some interesting overlap, so I will send him my position paper, and he will send me his recent presentation.

Prag suggested I check out the clique website.

Categories: discussions, General Meetings, projects, what i've been doing

Discussion with Graham W

October 26th, 2009 matt No comments

Had a discussion with GW about other things to test within datasets, for example, what use is it to simulate message passing between users who do not know each other, and if we had some knowledge about this data, can we find a more efficient way to route messages. This led us to talk about:

how many nodes can you get to within x hops in a network?
- how many paths between friends need to use strangers and vice versa – can be used rto define privacy rules too 🙂
- see miklas paper about defining friends and strangers
- The point being that you may never need to send a message to any body else

Graham showed me how to run the simulator he wrote, and we talked about writing a paper together for an upcoming conference – but decided the deadline was too soon, and that we didn’t have any new results to put into it.

Categories: discussions, Ideas, projects, Publications, what i've been doing

ODCSSS 2009

June 16th, 2009 matt No comments

I have taken on mentorship of an ODCSSS project which we have dubbed – CitySense. My student, John Paul Meaney, is currently working on plugging in movement models to TOSSIM and Tiny OS – and also implementing simple DTN protocols. We have chosen TinyOS and TOSSIM so that we are able to easily test this in the real world, to compare data with simulation data.

Categories: discussions, ODCSSS2009, projects, Supervisor Meetings, what i've been doing, What i've been writing, What Paddy Says

Newer Entries Older Entries

Archive

Location clustering so far.

Data Analysis – Co-location using WiFi access points

About Datasets

Weekly Update 30 Apr 2010

Supervisor PhD Meeting 14 Apr 2010

Discussed my ‘Hypothesis’

Local Metrics

Datasets

Progression

Ideas about locations

Future

Requests?

Plan

Aside:

Presentation 10th Feb 2010

Meeting with Davide 14 Dec 2009

Meeting with Prag 9 Dec 2009

Discussion with Graham W

ODCSSS 2009

Random Posts

Categories

Archives

Meta

Archive

Discussed my ‘Hypothesis’

Local Metrics

Datasets

Progression

Ideas about locations

Future

Requests?

Plan

Aside:

Random Posts

Tag Cloud

Categories

Archives

Meta