Is the President a friend of a friend of yours? Or perhaps a friend of a friend of a friend? How many social links, do you think, separate you from the author of this post? Have you ever played the Kevin Bacon game? Do you know the Erdős number of Albert Einstein? Games aside, computing distances between nodes of social networks is a rather common problem both for sociological research as well as for various practical applications.
One interesting practical example is social search. Suppose you are looking for a person named Mary in, say, Facebook or Skype. Obviously, presenting you with a list of the many millions of Marys out there will not be too helpful. If you are instead shown only the Marys that are close to you in the social graph, your chances of finding the right person are much higher. Similarly, if you are looking for a webpage about a “rock concert”, you might prefer to see the pages visited by your friends and friends of friends higher up in the result list. Now, in order to implement such sorting, a method is required for quickly determining distances between the various nodes in the social graph.
The problem of finding distances in graphs is old and well-studied. Unfortunately, many classical algorithms for solving it are a bit too slow to be useful in certain cases (such as the social search example above). Indeed, the naive shortest path search algorithm will have to traverse nearly the whole graph in order to compute the path between you and the several Marys in your search results. For a modern billion-user social network, this means performing billions of operations — way too much to spend on a single search request.
Computation time may be significantly reduced if we resort to certain approximate algorithms. A popular family of such algorithms are landmark-based methods. The idea is simple: pick several (say, 100) random “landmark nodes” in the graph, then precompute and store shortest path distances from each landmark node to all other nodes. Now suppose that we want to quickly find the approximate distance between “Joe” and “Mary”. For that we simply check the lengths of all possible paths through the landmark nodes, i.e. [Joe → Landmark1 → Mary], [Joe → Landmark2 → Mary], etc, and choose the smallest one of them. Recall that all the distances [Joe → Landmark#] and [Landmark# → Mary] have been precomputed and stored. We’ll only need to retrieve and add them up, so the whole computation will take around ≈300 operations or so — many orders of magnitude less than a billion.
The problem with this approach is that it requires quite a lot of memory to store the precomputed distances. In a graph with a billion nodes there are billion distances from “Landmark1” to each of the other nodes. And if you have 100 landmarks – you need to store 100 billion distances, eating up hundreds of gigabytes of disk space or RAM.
The work by Floreskul, Tretyakov and Dumas, that will be introduced at the Wednesday’s lightning talk session of ICWSM’14 proposes a simple algorithmic trick, which makes it possible to significantly reduce the memory requirements of the landmark-based algorithm. The idea is to limit the number of nodes, for which the distance needs to be stored, in a smart way. If you are interested, you are welcome to check out the paper and visit the poster at the conference.