Document Clustering Unstructured Data

Only available on StudyMode
  • Topic: Machine learning, Cluster analysis, K-means clustering
  • Pages : 10 (2532 words )
  • Download(s) : 543
  • Published : March 6, 2011
Open Document
Text Preview
Document Clustering Unstructured Data
An Implementation in Text Mining Tool
M.Poongodi , N.Suganya and M.Umadevi
1 2 3

Department of Computer Science Anna University TamilNadu,India Poongodi.me@hotmail.com, suganya.nks@gmail.com, devi_becse04@yahoo.in

Abstract—With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is ordered or classified by group or category. The problem of finding best such grouping is still there. This paper discusses the implementation of k-Means clustering algorithm for clustering unstructured text documents that we implemented, beginning with the representation of unstructured text and reaching the resulting set of clusters. Based on the analysis of resulting clusters for a sample set of documents, we have also proposed a technique to represent documents that can further improve the clustering result. Keywords—Information Extraction (IE); Clustering, k-Means Algorithm; Document Classification; Bag-of-words; Document

Figure 1.

Document Clustering

Matching; Document Ranking; Text Mining

I.

INTRODUCTION

Text Mining uses unstructured textual information and examines it in attempt to discover structure and implicit meanings ―hidden within the text [6]. Text mining concerns looking for patterns in unstructured text [7]. A cluster is a group of related documents, and clustering, also called unsupervised learning is the operation of grouping documents on the basis of some similarity measure, automatically without having to pre-specify categories [8]. We do not have any training data to create a classifier that has learned to group documents. Without any prior knowledge of number of groups, group size, and the type of documents, the problem of clustering appears challenging [1]. Given N documents, the clustering algorithm finds k, number of clusters and associates each text document to the cluster. The problem of clustering involves identifying number of clusters and assigning each document to one of the clusters such that the intra-documents similarity is maximum compared to inter-cluster similarity.

One of the main purposes of clustering documents is to quickly locate relevant documents [1]. In the best case, the clusters relate to a goal that is similar to one that would be attempted with the extra effort of manual label assignment. In that case, the label is an answer to a useful question. For example, if a company is operating at a call center where users of their products submit problems, hoping to get a resolution of their difficulties, the queries are problem statements submitted as text. Surely, the company would like to know about the types of problems that are being submitted. Clustering can help us understand the types of problems submitted [1]. There is a lot of interest in the research of genes and proteins using public databases. Some tools capture the interaction between cells, molecules and proteins, and others extract biological facts from articles. Thousands of these facts can be analyzed for similarities and relationships [1]. Domain of the input documents used in the analysis of our implementation, discussed in the following sections, is restricted to Computer Science (CS). II. REPRESENTATION OF UNSTRUCTURED TEXT Before clustering algorithm is used, it is necessary to give structure to the unstructured textual document. The document is represented in the form of vector such that the words (also called features) represent dimensions of the vector and frequency of the word in document is the magnitude of the vector. i.e. A Vector is of the form where t1,t2,..,tn are the terms/words(dimension of the vector) and f1,f2,…,fn are...
tracking img