* MoDyCo, UMR 7114, CNRS 200 avenue de la République, 92001 Nanterre ** LaLIC, Université Paris-Sorbonne Maison de la recherche, 28 rue Serpente 75006 Paris E-mail: email@example.com, Philippe.Laublet@paris-sorbonne.fr, firstname.lastname@example.org
This paper presents our work on the detection of temporal information in web pages. The pages examined within the scope of this study were taken from the tourism sector and the temporal information in question is thus particular to this area. The differences that exist between extraction from plain textual data and extraction from the web are brought to light. These differences mainly concern the spatial arrangement of the text, the use of punctuation and the respect of traditional syntactic rules. The temporal expressions to be extracted are classified into two kinds: temporal information that concerns one particular event and repetitive temporal information. We adopt a symbolic approach relying on patterns and rules for the detection, extraction and annotation of temporal expressions; our method is based on the use of transducers. First evaluations have shown promising results. Since the visual structure of a web page is very important and often informs the user before he has even read the text, a semiotic study is also presented in this paper.
With the methods of the Semantic Web, portal applications can be created, relying on ontologies. For these applications and many service applications, temporal information is often essential. For example, a tourism web portal would need information about the type of tourism object and its location in time and space. In addition, the extracted information must be stored in the knowledge base according to the ontology used by the application. In this paper we will focus on temporal information in tourism web pages. The temporal information has to be detected, extracted and annotated. The annotation format will probably rely on existing XML tools (Stern 2007). To perform these tasks, we encountered three main kinds of difficulties. First, we have to deal with complex, imprecise temporal information. Of course, single dates are easy to process but more complex expressions, such as periods or repetitive information (e.g. from March to July, open every day except Tuesday), must be treated as well. Second, after being extracted, the information needs to be linked to the proper tourism object. If the web page concerns only one object, this is straightforward, but some web pages concern many objects and an analysis is therefore necessary to decide how to link each piece of information to its object. Third, the web pages we deal with are all of the same type: tourism web pages. However, they vary a lot as they are made by different people, have different forms and concern different types of tourism objects. We will try to show that a semiotic study of some pages is necessary to take into account some of their specificities. The work presented in this paper is situated within the framework of the EIFFEL project. Its main objective is
to create a portal in the area of tourism with different functionalities. This portal, for use by the local tourism sector, will include a specialised search engine. It will allow users to find and collect precise and essential information in context. It will also help the territory as a French region to promote its services. This is a wideranging project (Noël et al. 2008) which is based on web semantic technologies, knowledge representation and linguistic methods and expertise. It includes automatic identification, selection and extraction of various items from the Web according to existing ontologies. This project involves mainly two companies – Mondeca and Antidot – and three laboratories – LIRMM (CNRS), INRIA-Rocquencourt and MoDyCo (CNRS). Our corpus,...