Mapr vs Hadoop

Only available on StudyMode
  • Topic: Hadoop, MapReduce, Input/output
  • Pages : 7 (1446 words )
  • Download(s) : 33
  • Published : April 8, 2013
Open Document
Text Preview
Performance Evaluation of MapR
October 15, 2012

Abstract

2

Some background and motivation indicating that this
is a problem worth solving.
Hadoop has been known to have performance bottlenecks arising out of its layered architecture, and its reliance on the underlying OS (and filesystem) [2, 3].
MapR sets out to solve this problem. Among other
changes to the Hadoop framework, MapR implements
a custom filesystem that interacts directly with the
underlying block device, thereby bypassing the native
filesystem. MapR has some publicly available performance numbers for their architecture which is visibly better compared with Hadoop ; however, it isn’t clear
where these gains are coming from. In this project,
we seek to characterise the performance benefits arising out of the custom filesystem in MapR . Furthermore, we seek to throw some light on trade-offs that MapR ’s decision of implementing a custom filesystem has; for instance, what are the drawbacks if any due to this change. This isn’t clear from MapR ’s

whitepaper that is available on their website [1].

MapR is an alternate distribution of Apache Hadoop
developed by MapR Technologies. The MapR distribution claims to have a much better performance compared with Apache Hadoop: between a factor of
3x and 20x. The aim of this project is to characterize
this performance gain by means of possible reasons
for it. In particular, we shall focus on the performance characteristics of MapR ’s custom filesystem by experiments designed to evaluate its processor utilization, disk allocation algorithm, and its interaction with native I/O schedulers.

1

Motivation

Problem Statement

A clear statement of the problem to be solved.
MapR designers have observed that Hadoop’s layered architecture—Hadoop application to Hadoop Distributed File System (HDFS) to native filesystem
to hardware—is a design choice that degrades performance. Consequently, they choose to implement a custom filesystem written in C++ that interacts directly with the underlying block device. They claim that this design choice improves performance by eliminating the overhead of the native filesystem and it’s page cache. The purpose of this project is to characterise the performance gain due to MapR ’s custom filesystem. Section 4 describes the experiments we would like to conduct to characterise this performance gain.

3

Hypothesis

The solution you propose to the problem, with some
preliminary justification.

4

Method

The method by which the solution is going to be
evaluated. Be specific: state which measurements
you intend to make, what you expect to learn from
1

them, etc.

a disk seek in all of the above cases. Comparing both the read bandwitdth and average run length before disk seek for MapR and an alternate Hadoop implmentation in [2] where write requests are interleaved at the application level

would be a nice exercise.

We intend to make the following measurements.
One could read the list as Measurement 1, Measurement 2, and so on. In the next section, we shall enumerate our expectations from these measurements using the same ordering i.e., Expectation 1 corresponds to Measurement 1.

(3) Effect of I/O schedulers: General-purpose schedulers (used in Hadoop ) are tuned for an opti(1) Processor utilization for Disk I/O: MapR implemal trade-off between storage latency and storments a custom filesystem service that directly age bandwidth. Thus, in the case of multiple

interacts with the underlying block device. It is
concurrent readers/writers, there is a I/O conalso designed to do away with filesystem page text switch for every few hundred KB. Typical
cache. On the other hand, Hadoop relies on the
MapReduce applications favour a bias over stornative filesystem of the nodes in the cluster for age bandwidth which makes general-purpose I/O
disk I/O; the native FS possibly making use of a
schedulers in commodity OSs unsuitable.
page cache.
To evaluate the effect of...
tracking img