When Distance Matters

Is the President a friend of a friend of yours? Or perhaps a friend of a friend of a friend? How many social links, do you think, separate you from the author of this post? Have you ever played the Kevin Bacon game? Do you know the Erdős number of Albert Einstein? Games aside, computing distances between nodes of social networks is a rather common problem both for sociological research as well as for various practical applications.

Some users are closer to you in the network than the other
Some users are closer to you in the network than the other

One interesting practical example is social search. Suppose you are looking for a person named Mary in, say, Facebook or Skype. Obviously, presenting you with a list of the many millions of Marys out there will not be too helpful. If you are instead shown only the Marys that are close to you in the social graph, your chances of finding the right person are much higher. Similarly, if you are looking for a webpage about a “rock concert”, you might prefer to see the pages visited by your friends and friends of friends higher up in the result list. Now, in order to implement such sorting, a method is required for quickly determining distances between the various nodes in the social graph.

The problem of finding distances in graphs is old and well-studied. Unfortunately, many classical algorithms for solving it are a bit too slow to be useful in certain cases (such as the social search example above). Indeed, the naive shortest path search algorithm will have to traverse nearly the whole graph in order to compute the path between you and the several Marys in your search results. For a modern billion-user social network, this means performing billions of operations — way too much to spend on a single search request.

Red circle denotes a "landmark node" with a table of precomputed distances to all other nodes.
Red circle denotes a “landmark node” with a table of precomputed distances to all other nodes.

Computation time may be significantly reduced if we resort to certain approximate algorithms. A popular family of such algorithms are landmark-based methods. The idea is simple: pick several (say, 100) random “landmark nodes” in the graph, then precompute and store shortest path distances from each landmark node to all other nodes. Now suppose that we want to quickly find the approximate distance between “Joe” and “Mary”. For that we simply check the lengths of all possible paths through the landmark nodes, i.e. [Joe → Landmark1 → Mary], [Joe → Landmark2 → Mary], etc, and choose the smallest one of them. Recall that all the distances [Joe → Landmark#] and [Landmark# → Mary] have been precomputed and stored. We’ll only need to retrieve and add them up, so the whole computation will take around ≈300 operations or so — many orders of magnitude less than a billion.

The problem with this approach is that it requires quite a lot of memory to store the precomputed distances. In a graph with a billion nodes there are billion distances from “Landmark1” to each of the other nodes. And if you have 100 landmarks – you need to store 100 billion distances, eating up hundreds of gigabytes of disk space or RAM.

The work by Floreskul, Tretyakov and Dumas, that will be introduced at the Wednesday’s lightning talk session of ICWSM’14 proposes a simple algorithmic trick, which makes it possible to significantly reduce the memory requirements of the landmark-based algorithm. The idea is to limit the number of nodes, for which the distance needs to be stored, in a smart way. If you are interested, you are welcome to check out the paper and visit the poster at the conference.

About the author

Konstantin

As of 2014, I am working as a researcher at the University of Tartu, Estonia, dealing with pretty much anything related to machine learning, data analysis and artificial intelligence. The most recent projects had to do with bioinformatics, social network analysis and robotics. I blog, albeit not too frequently in the recent years, at http://fouryears.eu/.

View all posts

2 Comments

  • Interesting stuff. Reading your post reminded me of approaches in road network route finding algorithms, where huge benefits can be obtained by effectively reducing the network to major highways outside of the local areas around the origin and destination. Could someone create a reasonable “highway” between heavily connected nodes (or landmarks?) and do something similar?

    • I am not perfectly knowledgeable about the state of the art in path finding algorithms for road networks (this seems to be a highly commercialized field), but I always felt those must be slightly different in spirit from the social network case.

      Firstly, road networks are (mostly) planar graphs with a known 2D embedding, which makes it possible to use good heuristics within otherwise standard graph traversal algorithms. In contrast, social networks are typically highly interconnected, with a “small world” structure and no clear “highways” to follow. Secondly, in most practical cases when you are searching for a shortest path in a road network (i.e. GPS navigation) you have the luxury of waiting the seconds needed for the traversal algorithm to complete – you don’t need your answer to come in milliseconds.

      What your question relates to, to my mind, is the problem of choosing “good” landmark nodes (those that often lie on shortest paths and thus guarantee better approximation quality). This problem has been studied somewhat (e.g. we did it here), and so far it seems that picking nodes with high degree is both a simple and reasonable choice for landmarks.