Real-time information from microblogs like Twitter is useful for different applications such as market research, opinion mining, and crisis management. For many of those messages, location information is required to derive useful insights. Today, however, only around 1% of all tweets are explicitly geotagged.
What challenges arise from this limited geotag information?
Extracting location information can be challenging, as place names (also called toponyms) must be identified from the tweet message and metadata. A toponym can refer to:
- Multiple geographic locations (Geo/Geo disambiguation), e.g., ‘Paris’ could refer to the capital of France, as well as any of 23 cities in the USA.
- A real spatial location, but also a person or a thing (Geo/Non-geo disambiguation). E.g., ‘Metro’ may reference a city in Indonesia, a train, or a company.
Disambiguating between these multiple meanings is called toponym resolution and represents a primary challenge when dealing with location information in microblogs.
How can we improve toponym resolution for microblog data?
To disambiguate toponyms, we propose a multi-indicator method for determining
(1) the location where a tweet was created, as well as (2) the location of the user’s residence. Our method is based on various weighted indicators, including the
names of places that appear in the text message, dedicated location entries, and additional information from the user profile.
As shown in the picture, our method consists of four steps:
- Detection of spatial indicators: Spatial indicators are location information that allow geolocalization. Our method spots spatial indicators in the text message and in the user profile of a tweeter.
- Geographical interpretation: Each spatial indicator refers to (at least) one geographical area. We determine that area and represent it with a polygon.
- Weighting: As some spatial indicators are more reliable than others, we attribute a variable height to each polygon. The height is computed based on weights determined using an optimization algorithm and the reported uncertainty of the spatial indicator for the currently analyzed case.
- Stacking: By intersecting and stacking the 3D polygons over each other, a height map is built. The highest area in this height map is then used for geolocalization.
How accurate is our approach?
We collected more than 1.03 million geotagged tweets as ground truth data and applied our approach. The evaluation shows that our method is capable of locating 92% of all tweets with a median accuracy of below 30km, as well as predicting the user’s residence with a median accuracy of below 5.1km.
For more, see our full paper, A Multi-Indicator Approach for Geolocalization of Tweets (to appear @ ICWSM 2013)
Axel Schulz, SAP Research Darmstadt, Technische Universität Darmstadt
Aristotelis Hadjakos, HFM Detmold
Heiko Paulheim, University of Mannheim
Johannes Nachtwey, SAP Research Darmstadt
Max Mühlhäuser, Technische Universität Darmstadt