Avirup Sil∗ Temple University Philadelphia, PA email@example.com Yinfei Yang St. Joseph’s University Philadelphia, PA firstname.lastname@example.org Abstract Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.
Ernest Cronin∗ Penghai Nie St. Joseph’s University St. Joseph’s University Philadelphia, PA Philadelphia, PA email@example.com firstname.lastname@example.org Ana-Maria Popescu Yahoo! Labs Sunnyvale, CA email@example.com Alexander Yates Temple University Philadelphia, PA firstname.lastname@example.org
referents, but exclusive focus on Wikipedia as a target for NED systems has signiﬁcant drawbacks: despite its breadth, Wikipedia still does not contain all or even most real-world entities mentioned in text. As one example, it has poor coverage of entities that are mostly important in a small geographical region, such as hotels and restaurants, which are widely discussed on the Web. 57% of the named-entities in the Text Analysis Conference’s (TAC) 2009 entity linking task refer to an entity that does not appear in Wikipedia (McNamee et al., 2009). Wikipedia is clearly a highly valuable resource, but it should not be thought of as the only one. Instead of relying solely on Wikipedia, we propose a novel approach to NED, which we refer to as Open-DB NED: the task is to resolve an entity to Wikipedia or to any relational database that meets mild conditions about the format of the data, described below. Leveraging structured, relational data should allow systems to achieve strong accuracy, as with domain-speciﬁc or database-speciﬁc NED techniques like Hoffart et al.’s NED system for YAGO (Hoffart et al., 2011). And because of the availability of huge numbers of databases on the Web, many for specialized domains, a successful system for this task will cover entities that a Wikipedia NED or database-speciﬁc system cannot. We investigate two complementary learning strategies for Open-DB NED, both of which signiﬁcantly relax the assumptions of traditional NED systems. The ﬁrst strategy, a distant supervision approach, uses the relational information in a given database and a large corpus of unlabeled text to learn a database-speciﬁc model. The second strat-
Named-entity disambiguation (NED) is the task of linking names mentioned in text with an established catalog of entities (Bunescu and Pasca, 2006; Ratinov et al., 2011). It is a vital ﬁrst step for semantic understanding of text, such as in grounded semantic parsing (Kwiatkowski et al., 2011), as well as for information retrieval tasks like person name search (Chen and Martin, 2007; Mann and Yarowsky, 2003). NED requires a catalog of symbols, called referents, to which named-entities will be resolved. Most NED systems today use Wikipedia as the catalog of
egy, a domain adaptation approach, assumes a single source database that has accompanying labeled data. Classiﬁers in this setting must learn a model that transfers from the source database to any new database, without requiring new training data for the new database. Experiments show that both strategies outperform a state-of-the-art Wikipedia NED system by wide margins without requiring any labeled...