In the past 7 years, Twitter has grown to become a social media giant. The service now facilitates the exchange of over 400 million “tweets”, or short 140-character messages, per day. The massive extent of this (mostly) publicly-exchanged communication data offers a tremendous opportunity to researchers, companies, and governmental institutions.
Access to the “Firehose”, a feed provided by Twitter that allows access to 100% of the publicly available tweets, can be both difficult to manage and restrictive with respect to cost. As a result, many researchers rely on Twitter’s “Streaming API”, which provides a sample of all tweets matching parameters pre-set by the API user. However, this API suffers from an essential drawback regarding the lack of documentation about how much and what kind of data researchers will get. In this paper, we aim to answer the question: To what extent is the sampled data offered through the Streaming API a valid representation of the overall activity on Twitter?
In our analysis, we used a variety of common statistical measures to compare aspects of the data collected by the Streaming API with that collected by a random sample from the Twitter Firehose. Our findings include the following:
- When estimating the top n hashtags in a dataset, the Streaming API data may be misleading when n is small (the estimate improves as n increases).
- Similarly, the topical distribution of tweets collected through the Streaming API becomes more representative as the amount of data collected increases.
- By analyzing retweet networks, we were able to show that we can identify, on average, 50-60% of the top 100 “key players” when creating networks based on one day of Streaming API data.
- Surprisingly, the Streaming API returns almost the complete set of geotagged tweets, despite sampling (shown below).
To verify our results, we compare our Streaming API dataset with 100 synthetic Streaming API datasets created through random sampling of the Firehose. This way, we are able to show that the bias introduced by the Streaming API is significant.
What to learn more? Read our full paper, Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose, accepted at ICWSM 2013.