Data Mining is an analytical process that primarily involves searching through vast amounts of data to spot useful, but initially undiscovered, patterns. The data mining process typically involves three major stepsexploration, model building and validation and finally, deployment.
Data mining is used in numerous applications, particularly business related endeavors such as market segmentation, customer churn, fraud detection, direct marketing, interactive marketing, market basket analysis and trend analysis. However, since the 1993 World Trade Center bombing and the terrorist attacks of September 11, data mining has increasingly been used in homeland security efforts.
Two of the earlier homeland security programs were the Total Information Awareness Program (TIA) and the Computer-Assisted Passenger Prescreening System (CAPPS II). Privacy and other concerns led to the eventual demise of these programs.
In addition to efforts by the federal government, state programs are also being implemented. The Texas Fusion Center is a prime example of state agencies data mining data in efforts to thwart attacks against our populace.
Data mining is not difficult to implement, as an example of detecting potential subversives using Amazon.com wishlists is presented.
The primary negatives of data mining are concerns related to privacy. False positives whereby individuals are wrongly identified as "terrorists" and inadequate government control over data are prime examples.
In conclusion, data mining can be enormously beneficial in homeland security efforts, however, until privacy and other concerns are adequately addressed, it will be difficult for the government to get approval from its citizens for many programs.
This technical paper is intended to introduce to the reader to the analytical process known as data mining and its growing application in homeland security endeavors. In doing so, some of the more popular techniques and applications will be briefly addressed before highlighting data mining in homeland security and related anti-terrorism initiatives.
The paper will end with a brief discussion on the increasing concerns towards the negative aspects of the process such as privacy issues, etc.
Finally, some overall conclusions and prospects for the future are touched upon.
Data Mining OverviewDefinition, Techniques, & Applications Definition and Techniques
Data Mining, which is also sometimes referred to as Knowledge-Discovery in Databases or KDD, is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables. After a relationship has been discovered, typically the findings are then validated and used by applying the detected patterns to new sets of data. Usually the data mining process consists of large amounts of data and is typically business or market related. The ultimate goal of data mining is prediction, and predictive data mining is the most common type of data mining and one that has the most direct business applications.
Depending upon the source, there are slightly different steps inherent in the data mining process. However, the common theme consists of three stages as described by StatSoft: "(1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic...