Open Domain Event Extraction from Twitter

Only available on StudyMode
  • Topic: Named entity recognition, Twitter, Natural language processing
  • Pages : 26 (7610 words )
  • Download(s) : 27
  • Published : April 22, 2013
Open Document
Text Preview
Open Domain Event Extraction from Twitter
Alan Ritter
University of Washington Computer Sci. & Eng. Seattle, WA aritter@cs.washington.edu

Mausam
University of Washington Computer Sci. & Eng. Seattle, WA mausam@cs.washington.edu

Oren Etzioni
University of Washington Computer Sci. & Eng. Seattle, WA etzioni@cs.washington.edu

Sam Clark∗
Decide, Inc. Seattle, WA sclark.uw@gmail.com

ABSTRACT
Tweets are the most up-to-date and inclusive stream of information and commentary on current events, but they are also fragmented and noisy, motivating the need for systems that can extract, aggregate and categorize important events. Previous work on extracting structured representations of events has focused largely on newswire text; Twitter’s unique characteristics present new challenges and opportunities for open-domain event extraction. This paper describes TwiCal— the first open-domain event-extraction and categorization system for Twitter. We demonstrate that accurately extracting an open-domain calendar of significant events from Twitter is indeed feasible. In addition, we present a novel approach for discovering important event categories and classifying extracted events based on latent variable models. By leveraging large volumes of unlabeled data, our approach achieves a 14% increase in maximum F1 over a supervised baseline. A continuously updating demonstration of our system can be viewed at http://statuscalendar.com; Our NLP tools are available at http://github.com/aritter/ twitter_nlp.

Entity Steve Jobs iPhone GOP Amanda Knox

Event Phrase died announcement debate verdict

Date 10/6/11 10/4/11 9/7/11 10/3/11

Type Death ProductLaunch PoliticalEvent Trial

Table 1: Examples of events extracted by TwiCal. events. Yet the number of tweets posted daily has recently exceeded two-hundred million, many of which are either redundant [57], or of limited interest, leading to information overload.1 Clearly, we can benefit from more structured representations of events that are synthesized from individual tweets. Previous work in event extraction [21, 1, 54, 18, 43, 11, 7] has focused largely on news articles, as historically this genre of text has been the best source of information on current events. In the meantime, social networking sites such as Facebook and Twitter have become an important complementary source of such information. While status messages contain a wealth of useful information, they are very disorganized motivating the need for automatic extraction, aggregation and categorization. Although there has been much interest in tracking trends or memes in social media [26, 29], little work has addressed the challenges arising from extracting structured representations of events from short or informal texts. Extracting useful structured representations of events from this disorganized corpus of noisy text is a challenging problem. On the other hand, individual tweets are short and self-contained and are therefore not composed of complex discourse structure as is the case for texts containing narratives. In this paper we demonstrate that open-domain event extraction from Twitter is indeed feasible, for example our highest-confidence extracted future events are 90% accurate as demonstrated in §8. Twitter has several characteristics which present unique challenges and opportunities for the task of open-domain event extraction. Challenges: Twitter users frequently mention mundane events in their daily lives (such as what they ate for lunch) which are only of interest to their immediate social network. In contrast, if an event is mentioned in newswire text, it 1 http://blog.twitter.com/2011/06/ 200-million-tweets-per-day.html

Categories and Subject Descriptors
I.2.7 [Natural Language Processing]: Language parsing and understanding; H.2.8 [Database Management]: Database applications—data mining

General Terms
Algorithms, Experimentation

1. INTRODUCTION
Social networking sites such as Facebook and...
tracking img