BigBench in Hadoop Ecosystem

Topics: Data, Hadoop, Apache Mahout Pages: 16 (6193 words) Published: April 23, 2014
A BigBench Implementation on the Hadoop
Ecosystem
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto
badrul.chowdhury@mail.utoronto.ca, tilmann.rabl@utoronto.ca, jacobsen@eecg.toronto.edu
http://msrg.org

Abstract. BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.

In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a performance evaluation.

1

Introduction

Big data analytics is an ever growing field of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become profitable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.

The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare different offerings. To this end, we have developed BigBench, the first proposal for an end to end big data analytics benchmark [1]. BigBench was designed to cover essential functional and business aspects of big data use cases.

In this paper, we present an alternative implementation of the BigBench workload for the Hadoop eco-system. We re-implemented all 30 queries and ran proof of concept experiments on a 1 GB BigBench installation.

The rest of the paper is organized as follows. In Section 2, we present an overview of the BigBench benchmark. Section 3 introduces the parts of the Hadoop ecosystem that were used in our implementation. We give details on the transformation and implementation of the workload in Section 4. We present a proof of concept evaluation of our implementation in Section 5. Section 6 gives an overview of related work. We conclude with future work in Section 7.

2

BigBench Overview

Structured Data
Item

Marketprice

Sales
Web Page

Unstructured
Data

Reviews
Customer

Web Log
Semi-Structured Data

Adapted
TPC-DS
BigBench
Specific

Fig. 1. BigBench Schema

BigBench is an end-to-end big data analytics benchmark, it was built to resemble modern analytic use cases in retail business. As basis for the benchmark, the Transaction Processing Performance Council’s (TPC) new decision support benchmark TPC-DS was chosen [2]. This choice highly sped up the development of BigBench and made it possible to start from a solid and proven foundation. A high-level overview of the data model can be seen in Figure 1. The TPCDS data model is a snowflake schema with 6 fact tables, representing 3 sales channels, store sales, catalog sales, and online sales, each with a sales and a returns fact table. For BigBench the catalog sales were removed, since they have decreasing significance in retail business. As can be seen in Figure 1, additional big data specific dimensions were added. Marketprice is a traditional relational table storing competitors prices. The Web Log portion...


References: Proceedings of the ACM SIGMOD Conference. (2013)
2
Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_
and_innovation/big_data_the_next_frontier_for_innovation.
5. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113
8
(2010) 1–10
7
Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629
8
Communications in Computer and Information Science. Springer Berlin Heidelberg
(2012) 220–234
2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18
11
of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in
print).
International Symposium On High Performance Computer Architecture. HPCA
(2014)
Continue Reading

Please join StudyMode to read the full document

You May Also Find These Documents Helpful

  • Ecosystem Essay
  • Ecosystem Essay
  • Ecosystems Essay
  • Ecosystems Research Paper
  • Ecosystem Essay
  • Ecosystem Essay
  • Hadoop Ebook Essay
  • Ecosystem Research Paper

Become a StudyMode Member

Sign Up - It's Free