Open Domain Event Extraction from Twitter
University of Washington Computer Sci. & Eng. Seattle, WA firstname.lastname@example.org
University of Washington Computer Sci. & Eng. Seattle, WA email@example.com
University of Washington Computer Sci. & Eng. Seattle, WA firstname.lastname@example.org
Decide, Inc. Seattle, WA email@example.com
Tweets are the most up-to-date and inclusive stream of information and commentary on current events, but they are also fragmented and noisy, motivating the need for systems that can extract, aggregate and categorize important events. Previous work on extracting structured representations of events has focused largely on newswire text; Twitter’s unique characteristics present new challenges and opportunities for open-domain event extraction. This paper describes TwiCal— the ﬁrst open-domain event-extraction and categorization system for Twitter. We demonstrate that accurately extracting an open-domain calendar of signiﬁcant events from Twitter is indeed feasible. In addition, we present a novel approach for discovering important event categories and classifying extracted events based on latent variable models. By leveraging large volumes of unlabeled data, our approach achieves a 14% increase in maximum F1 over a supervised baseline. A continuously updating demonstration of our system can be viewed at http://statuscalendar.com; Our NLP tools are available at http://github.com/aritter/ twitter_nlp.
Entity Steve Jobs iPhone GOP Amanda Knox
Event Phrase died announcement debate verdict
Date 10/6/11 10/4/11 9/7/11 10/3/11
Type Death ProductLaunch PoliticalEvent Trial
Table 1: Examples of events extracted by TwiCal. events. Yet the number of tweets posted daily has recently exceeded two-hundred million, many of which are either redundant , or of limited interest, leading to information overload.1 Clearly, we can beneﬁt from more structured representations of events that are synthesized from individual tweets. Previous work in event extraction [21, 1, 54, 18, 43, 11, 7] has focused largely on news articles, as historically this genre of text has been the best source of information on current events. In the meantime, social networking sites such as Facebook and Twitter have become an important complementary source of such information. While status messages contain a wealth of useful information, they are very disorganized motivating the need for automatic extraction, aggregation and categorization. Although there has been much interest in tracking trends or memes in social media [26, 29], little work has addressed the challenges arising from extracting structured representations of events from short or informal texts. Extracting useful structured representations of events from this disorganized corpus of noisy text is a challenging problem. On the other hand, individual tweets are short and self-contained and are therefore not composed of complex discourse structure as is the case for texts containing narratives. In this paper we demonstrate that open-domain event extraction from Twitter is indeed feasible, for example our highest-conﬁdence extracted future events are 90% accurate as demonstrated in §8. Twitter has several characteristics which present unique challenges and opportunities for the task of open-domain event extraction. Challenges: Twitter users frequently mention mundane events in their daily lives (such as what they ate for lunch) which are only of interest to their immediate social network. In contrast, if an event is mentioned in newswire text, it 1 http://blog.twitter.com/2011/06/ 200-million-tweets-per-day.html
Categories and Subject Descriptors
I.2.7 [Natural Language Processing]: Language parsing and understanding; H.2.8 [Database Management]: Database applications—data mining
Social networking sites such as Facebook and...
Please join StudyMode to read the full document