Abstract. Data mining is concerned with analysing large volumes of (often unstructured) data to automatically discover interesting regularities or relationships which in turn lead to better understanding of the underlying processes. The field of temporal data mining is concerned with such analysis in the case of ordered data streams with temporal interdependencies. Over the last decade many interesting techniques of temporal data mining were proposed and shown to be useful in many applications. Since temporal data mining brings together techniques from different fields such as statistics, machine learning and databases, the literature is scattered among many different sources. In this article, we present an overview of techniques of temporal data mining.We mainly concentrate on algorithms for pattern discovery in sequential data streams.We also describe some recent results regarding statistical analysis of pattern discovery methods.
Keywords. Temporal data mining; ordered data streams; temporal interdependency; pattern discovery.
Data mining can be defined as an activity that extracts some new nontrivial information contained in large databases. The goal is to discover hidden patterns, unexpected trends or other subtle relationships in the data using a combination of techniques from machine learning, statistics and database technologies. This new discipline today finds application in a wide and diverse range of business, scientific and engineering scenarios. For example, large databases of loan applications are available which record different kinds of personal and financial information about the applicants (along with their repayment histories). These databases can be mined for typical patterns leading to defaults which can help determine whether a future loan application must be accepted or rejected. Several terabytes of remote-sensing image data are gathered from satellites around the globe. Data mining can help reveal potential locations of some (as yet undetected) natural resources or assist in building early warning systems for ecological disasters like oil slicks etc. Other situations where data mining can be of use include analysis of medical records of hospitals in a town to predict, for example, potential outbreaks of infectious diseases, analysis of customer transactions for market research applications etc. The list of application areas for data mining is large and is bound to grow rapidly in the years 173
174 Srivatsan Laxman and P S Sastry
to come. There are many recent books that detail generic techniques for data mining and discuss various applications (Witten & Frank 2000; Han & Kamber 2001; Hand et al 2001). Temporal data mining is concerned with data mining of large sequential data sets. By sequential data, we mean data that is ordered with respect to some index. For example, time series constitute a popular class of sequential data, where records are indexed by time. Other examples of sequential data could be text, gene sequences, protein sequences, lists of moves in a chess game etc. Here, although there is no notion of time as such, the ordering among the records is very important and is central to the data description/modelling. Time series analysis has quite a long history. Techniques for statistical modelling and spectral analysis of real or complex-valued time series have been in use for more than fifty years (Box et al 1994; Chatfield 1996). Weather forecasting, financial or stock market prediction and automatic process control have been some of the oldest and most studied applications of such time series analysis (Box et al 1994). Time series matching and classification have received much attention since the days speech recognition research saw heightened activity (Juang & Rabiner 1993; O’Shaughnessy 2000). These applications saw the advent of an increased role for machine learning techniques like Hidden Markov Models and time-delay neural networks in time series analysis....
References: Proc. 2003 IEEE Comput. Soc. Conf. on Computer Vision and Pattern Recognition, pp I–375–I–
381, Madison, Wisconsin
sequences. In Proc. 4th IEEE Int. Conf. on Data Mining (ICDM 2004), pp 3–10, Brighton, UK
Baeza-Yates R A 1991 Searching subsequences
Principles of Data Mining and Knowledge Discovery, vol. 2431, pp 51–61
Bettini C, Wang X S, Jajodia S, Lin J L 1998 Discovering frequent event patterns with multiple
Springer-Verlag) vol. 2076, pp 152–165
Frenkel K A 1991 The human genome project and informatics
Proc. 3rd IEEE Int. Conf. on Data Mining (ICDM 2003), pp 67–74
Gwadera R, Atallah M J, Szpankowski W 2005 Markov models for identification of significant
episodes. In Proc. 2005 SIAM Int. Conf. on Data Mining (SDM-05), Newport Beach, California
Han J, Kamber M 2001 Data mining: Concepts and techniques (San Fransisco, CA: Morgan Kauffmann)
2001) Washington, DC, vol. 2226, pp 435–441, 25–28
Juang B H, Rabiner L 1993 Fundamentals of speech recognition
Please join StudyMode to read the full document