A BigBench Implementation on the Hadoop
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Abstract. BigBench is the ﬁrst proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The ﬁrst proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.
In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the diﬀerent design choices we took and show a performance evaluation.
Big data analytics is an ever growing ﬁeld of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become proﬁtable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.
The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare diﬀerent oﬀerings. To this end, we have developed BigBench, the ﬁrst proposal for an end to end big data analytics benchmark . BigBench was designed to cover essential functional and business aspects of big data use cases.
In this paper, we present an alternative implementation of the BigBench workload for the Hadoop eco-system. We re-implemented all 30 queries and ran proof of concept experiments on a 1 GB BigBench installation.
The rest of the paper is organized as follows. In Section 2, we present an overview of the BigBench benchmark. Section 3 introduces the parts of the Hadoop ecosystem that were used in our implementation. We give details on the transformation and implementation of the workload in Section 4. We present a proof of concept evaluation of our implementation in Section 5. Section 6 gives an overview of related work. We conclude with future work in Section 7.
Fig. 1. BigBench Schema
BigBench is an end-to-end big data analytics benchmark, it was built to resemble modern analytic use cases in retail business. As basis for the benchmark, the Transaction Processing Performance Council’s (TPC) new decision support benchmark TPC-DS was chosen . This choice highly sped up the development of BigBench and made it possible to start from a solid and proven foundation. A high-level overview of the data model can be seen in Figure 1. The TPCDS data model is a snowﬂake schema with 6 fact tables, representing 3 sales channels, store sales, catalog sales, and online sales, each with a sales and a returns fact table. For BigBench the catalog sales were removed, since they have decreasing signiﬁcance in retail business. As can be seen in Figure 1, additional big data speciﬁc dimensions were added. Marketprice is a traditional relational table storing competitors prices. The Web Log portion...
References: Proceedings of the ACM SIGMOD Conference. (2013)
Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_
5. Dean, J., Ghemawat, S.: MapReduce: Simpliﬁed Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113
Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629
Communications in Computer and Information Science. Springer Berlin Heidelberg
2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18
of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in
International Symposium On High Performance Computer Architecture. HPCA
Please join StudyMode to read the full document