Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Aditya Jadhav, Mahesh Kukreja
E-mail: email@example.com & firstname.lastname@example.org
Abstract : In the information industry, huge amount of
data is widely available and there is an imminent need for turning such data into useful information. This need is fulfilled by the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data provided by Data Mining. In case of a single system with few processors, there are restrictions on the speed of processing as well as the size of the data that can be processed at a time. The speed as well as the limit on the size of the data to be processed can be increased if data mining is carried out in parallel fashion with the help of the coordinated systems connected in LAN. But the problem with this solution is that LAN is not elastic, i.e. the number of systems in which the work is to be distributed on basis of the size of the data to be processed cannot be changed. Our main aim is to distribute data to be analyzed in various nodes in cloud. For optimum data distribution and efficient data mining as per user’s desire, various algorithms must be implemented.
Elasticity: Computing resources can be rapidly increased or decreased as needed, as well as released for other uses when they are no longer required. Pay as you go: Remittance for only the resources actually used and for only the time used must be done.
1.2 Virtualization In computing, the creation of a virtual (rather than actual) version of something, such as a hardware platform, operating system, a storage device or network resources is known as Virtualization. Virtualization can be viewed as part of an overall trend in enterprise IT that includes autonomic computing, a scenario in which the IT environment will be able to manage itself based on perceived activity, and utility computing, in which computer processing power is seen as a utility that clients can pay for only as needed. To centralize administrative tasks while improving scalability and workloads is the usual goal of virtualization. II. EUCALYPTUS Elastic Utility Computing Architecture for Linking Your Programs to Useful Systems - is an open-source software infrastructure for implementing “cloud computing” on clusters . The current interface to Eucalyptus is compatible with Amazon’s EC2 interface, but the infrastructure is designed to support multiple client-side interfaces. Commonly available Linux tools and basic Web-service technologies are used to implement Eucalyptus, making it easy to install and maintain. The creation of on-premise private clouds, with no requirements for retooling the organization's existing IT infrastructure or need to introduce specialized hardware is facilitated ISSN (Print): 2278-5140, Volume-1, Issue – 2, 2012
I. 1.1 Cloud:
Definition of cloud computing is based on following five attributes: Multitenancy (shared resources), massive scalability, elasticity, pay as you go, and self provisioning of resources. 1. Multitenancy (shared resources): The ability to share resources at the network level, host level and application level is provided by Cloud computing. Massive scalability: The ability to scale to tens of thousands of systems, as well as the ability to massively scale bandwidth and storage space is another advantage that Cloud computing provides.
International Journal on Advanced Computer Engineering and Communication Technology (IJACECT)
computing engine and HDFS – Hadoop distributed file system HBase (pre-alpha) for online data access. Apache Hadoop framework supports execution of data intensive applications on clusters built of commodity hardware . Hadoop is derived from Google's MapReduce and Google File System (GFS). Similar to GFS, Hadoop has its own Hadoop File System (HDFS). Hadoop enables users to store and process large volumes of data and...
Please join StudyMode to read the full document