Capturing and exploring the context of social media discussions is critical to understanding the relationships and information we extract from them. Knowing where information comes from helps us interpret it more correctly and knowing how our extracted results change as we condition on context provides insights into the underlying phenomena and can suggest further lines of investigation and action. For instance, the Livehoods project demonstrated how we can calculate a social distance between locations and use this to map out neighborhood boundaries. Extending this analysis by including contextual factors, such as gender and temporal information, we find that these boundaries can shift, sometimes significantly so. Consider the two maps below of neighborhood boundaries extracted from social data gathered on weekdays (Fig 1) and weekends (Fig 2), demonstrating clearly distinct mobility patterns, such as the appearance of a distinct “5th Ave shopping district” on weekends.
Figure 1. Neighborhood boundaries inferred from weekday behaviors
Figure 2. Neighborhood boundaries inferred from weekend behaviors
The social sciences have long taken the conditioning on demographic factors such as gender and socioeconomic status seriously. In fact, it’s commonly understood that incorrect conditioning can in some cases completely reverse empirical results (see for example, Simpson’s Paradox). So, why aren’t social media analyses more commonly conditioned on such demographic variables? We believe that much of the blame lies on the practical difficulties of implementing the necessary feature extractors, aggregators and analysis components. Moreover, without appropriate computational abstractions, much of this work must be adapted or re-created for each new data set and research question. To address this challenge, we are presenting a paper next week at ICWSM 2014, and releasing software that dramatically simplifies the implementation of co-occurrence analyses, a surprisingly common class of social media analyses. At the core of our software are discussion graphs, a data model for representing and computing upon relationships extracted from social media. Discussion graphs capture both the structural features of relationships inferred from social media, as well as the context from which they are derived, in just a couple steps. First, feature extraction turns raw social media data into an initial discussion graph, where each node is an extracted feature-value, and hyper-edges connect all nodes that co-occurred together in a tweet. Secondly, we project this graph to include only the relationships among target nodes, aggregating the remaining features as context annotating each relationship.
Figure 3. Discussion graph framework
We implement a tool (cleverly named Discussion Graph Tool, or DGT) to easily build and manipulate discussion graphs. DGT enables sharing and re-use feature extractors (for example, to infer gender from names, or mood from language cues), extraction of relationships and information, and conditioning on context. With DGT, extracting relationships from social media, including the social distances that underlie the neighborhood boundary inference shown above, is a simple 4-5 line script. To learn more about discussion graphs and DGT, read our paper Discussion Graphs: Putting Social Media Analysis in Context, at ICWSM 2014, and look for our upcoming tool release at our project site, http://research.microsoft.com/dgt This work is a collaboration between Emre Kıcıman, Scott Counts, Michael Gamon, Munmun De Choudhury and Bo Thiesson.