Mapreduce Case Study Solution

MapReduce is a widely used parallel computing framework for large scale data processing. The two major performance metrics in MapReduce are job execution time and cluster throughput. They can be seriously impacted by straggler machines— machines on which tasks take an unusually long time to finish. Speculative execution is a common approach for dealing with the straggler problem by simply backing up those slow running tasks on alternative machines. Multiple speculative execution strategies have been proposed, but they have some pitfalls: i) Use average progress rate to identify slow tasks while in reality the progress rate can be unstable and misleading, ii) Cannot appropriately handle the situation when there exists data skew among the tasks, …show more content…
In a typical MapReduce job, the master divides the input files into multiple map tasks, and then schedules both map tasks and reduce tasks to worker nodes in a cluster to achieve parallel processing. When a machine takes an unusually long time to complete a task (the so-called straggler machine), it will delay the job execution time (the time from job initialized to job retired) and degrade the cluster throughput (the number of jobs completed per second in the cluster) significantly. This problem is handled via speculative execution—slow task is backed up on an alternative machine with the hope that the backup one can finish faster. Google simply backs up the last few running map or reduce tasks and has observed that speculative execution can decrease the job execution time by 44 percent [1]. Due to the significant performance gains, speculative execution is also implemented in Hadoop [2] and Microsoft Dryad [3] to deal with the straggler …show more content…
It is indeed able to do so, particularly for map and shuffle/merge phases. As shown in Fig., after a brief initialization period. For the correctness of the MapReduce programming model, it is necessary to ensure that the reduce phase does not start until the map phase is done for all data splits. However, the pipeline, as shown in Fig., contains an implicit serialization.
Fig.1 Serialization between shuffle/merge and reduce phases.
1.2.2 Repetitive Merges and Disk Access
Hadoop ReduceTasks merge data segments when the number of segments or their total size goes over a threshold. However, the current merge algorithm in Hadoop often leads to repetitive merges, thus extra disk accesses. Fig. shows a common sequence of merge operations in Hadoop.
Altogether, this means repetitive merges and disk access, causing degraded performance for Hadoop. Therefore, an alternative merge algorithm is critical for Hadoop to mitigate the impact of repetitive merges and extra disk accesses.
Fig 2. repetitive merges.
1.2.3 The Lack of Network

Mapreduce Case Study Solution

You May Also Find These Documents Helpful

Nt1310 Unit 4 Exercise 1

Nt1310 Unit 4 Exercise 1

NT1330 Unit 1 Assignment Analysis

NT1330 Unit 1 Assignment Analysis

Nt1330 Unit 3 Problem Analysis Paper

Nt1330 Unit 3 Problem Analysis Paper

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Unit 3 Assignment 1 Task Tracker

Unit 3 Assignment 1 Task Tracker

The Importance of Processing Duplicates in an Mpi

The Importance of Processing Duplicates in an Mpi

IBM SUPERCOMPUTER, WATSON

IBM SUPERCOMPUTER, WATSON

Week 6 Discussion 2

Week 6 Discussion 2

Hadoop Discrimination Research Paper

Hadoop Discrimination Research Paper

Annotated Bibliography on four peered reviewed journals

Annotated Bibliography on four peered reviewed journals

Algorithm Scheduling

Algorithm Scheduling

The Normal Forms 3NF And BCNF

The Normal Forms 3NF And BCNF

Datacube Computation

Datacube Computation

Jam Topics for Present Issues in India

Jam Topics for Present Issues in India

hbase

hbase

Related Topics