October 15, 2012
Some background and motivation indicating that this
is a problem worth solving.
Hadoop has been known to have performance bottlenecks arising out of its layered architecture, and its reliance on the underlying OS (and ﬁlesystem) [2, 3].
MapR sets out to solve this problem. Among other
changes to the Hadoop framework, MapR implements
a custom ﬁlesystem that interacts directly with the
underlying block device, thereby bypassing the native
ﬁlesystem. MapR has some publicly available performance numbers for their architecture which is visibly better compared with Hadoop ; however, it isn’t clear
where these gains are coming from. In this project,
we seek to characterise the performance beneﬁts arising out of the custom ﬁlesystem in MapR . Furthermore, we seek to throw some light on trade-oﬀs that MapR ’s decision of implementing a custom ﬁlesystem has; for instance, what are the drawbacks if any due to this change. This isn’t clear from MapR ’s
whitepaper that is available on their website .
MapR is an alternate distribution of Apache Hadoop
developed by MapR Technologies. The MapR distribution claims to have a much better performance compared with Apache Hadoop: between a factor of
3x and 20x. The aim of this project is to characterize
this performance gain by means of possible reasons
for it. In particular, we shall focus on the performance characteristics of MapR ’s custom ﬁlesystem by experiments designed to evaluate its processor utilization, disk allocation algorithm, and its interaction with native I/O schedulers.
A clear statement of the problem to be solved.
MapR designers have observed that Hadoop’s layered architecture—Hadoop application to Hadoop Distributed File System (HDFS) to native ﬁlesystem
to hardware—is a design choice that degrades performance. Consequently, they choose to implement a custom ﬁlesystem written in C++ that interacts directly with the underlying block device. They claim that this design choice improves performance by eliminating the overhead of the native ﬁlesystem and it’s page cache. The purpose of this project is to characterise the performance gain due to MapR ’s custom ﬁlesystem. Section 4 describes the experiments we would like to conduct to characterise this performance gain.
The solution you propose to the problem, with some
The method by which the solution is going to be
evaluated. Be speciﬁc: state which measurements
you intend to make, what you expect to learn from
a disk seek in all of the above cases. Comparing both the read bandwitdth and average run length before disk seek for MapR and an alternate Hadoop implmentation in  where write requests are interleaved at the application level
would be a nice exercise.
We intend to make the following measurements.
One could read the list as Measurement 1, Measurement 2, and so on. In the next section, we shall enumerate our expectations from these measurements using the same ordering i.e., Expectation 1 corresponds to Measurement 1.
(3) Eﬀect of I/O schedulers: General-purpose schedulers (used in Hadoop ) are tuned for an opti(1) Processor utilization for Disk I/O: MapR implemal trade-oﬀ between storage latency and storments a custom ﬁlesystem service that directly age bandwidth. Thus, in the case of multiple
interacts with the underlying block device. It is
concurrent readers/writers, there is a I/O conalso designed to do away with ﬁlesystem page text switch for every few hundred KB. Typical
cache. On the other hand, Hadoop relies on the
MapReduce applications favour a bias over stornative ﬁlesystem of the nodes in the cluster for age bandwidth which makes general-purpose I/O
disk I/O; the native FS possibly making use of a
schedulers in commodity OSs unsuitable.
To evaluate the eﬀect of...