Preview

BigBench in Hadoop Ecosystem

Powerful Essays
Open Document
Open Document
6193 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
BigBench in Hadoop Ecosystem
A BigBench Implementation on the Hadoop
Ecosystem
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto badrul.chowdhury@mail.utoronto.ca, tilmann.rabl@utoronto.ca, jacobsen@eecg.toronto.edu http://msrg.org

Abstract. BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.
In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with
Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a performance evaluation. 1

Introduction

Big data analytics is an ever growing field of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become profitable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.
The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare different offerings. To this



References: Proceedings of the ACM SIGMOD Conference. (2013) 2 Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_ and_innovation/big_data_the_next_frontier_for_innovation. 5. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113 8 (2010) 1–10 7 Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629 8 Communications in Computer and Information Science. Springer Berlin Heidelberg (2012) 220–234 2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18 11 of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in print). International Symposium On High Performance Computer Architecture. HPCA (2014)

You May Also Find These Documents Helpful

  • Good Essays

    References: Coronel, C., Morris, S., & Rob, P. (2013). Database systems: Design, implementation, and management (10th ed.). Independence, KY: Cengage.…

    • 906 Words
    • 3 Pages
    Good Essays
  • Good Essays

    References: Barrodale Computing Services Ltd. (2011). Applications of Object Relational Database Management Systems at BCS. Retrieved from Barrodale Computing Services Ltd. (BCS): http://www.barrodale.com/docs/ORDBMS%20Applications%20at%20BCS.pdf…

    • 672 Words
    • 3 Pages
    Good Essays
  • Better Essays

    Jacobs, Adam. "The Pathologies of Big Data." Communications Acm 19 June 2014: n. pag. Google Scholar. Web. 10 Sept. 2014.…

    • 1115 Words
    • 5 Pages
    Better Essays
  • Satisfactory Essays

    With that in mind, organizations should always cease to ensure that their data is eagerly managed. With the market changing, the process of data management is becoming more complex and the capacity of data to be managed is steadily increasing, this is sometimes referred to as “big data”. Big data is used in understanding organizations and their decision making process; when decisions are made, they are based on complex data transactions which have become difficult to the system that are using basic database and warehouse management systems (Vael, 2013). This causes many data management difficulties such as an increase in data, immature decision making, legal issues and data securing and integrity to name a few, but they can easily be reduced or resolved by the use of the following:…

    • 707 Words
    • 3 Pages
    Satisfactory Essays
  • Good Essays

    Big DATA

    • 460 Words
    • 2 Pages

    The amount of information being collected is so huge that modern database management tools are becoming overloaded and therefore obsolete. The need to find new ways of supporting big data helps explain the need for more data.…

    • 460 Words
    • 2 Pages
    Good Essays
  • Good Essays

    Case Study Big Data

    • 923 Words
    • 4 Pages

    Volvo separated from Ford in 2010, it was breaking free from an IT infrastructure that consisted of a tangle of different systems and licenses. The need was there to develop a new standalone IT infrastructure that could provide better Business Intelligence, boost communication capabilities and enrich collaborations. It will be explained how The Volvo Car Corporation transformed data into knowledge, how they integrated cloud infrastructure into its networks and the advantage The Big Data Theory gives to Volvo Car Corporation.…

    • 923 Words
    • 4 Pages
    Good Essays
  • Good Essays

    A computer database relies upon software to organize the storage of data. This software is known as a database management system (DBMS). Database management systems are categorized according to the database model that they support. The model tends to determine the query languages that are available to access the database. A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from hardware failures. In these areas there are large differences between products.…

    • 1240 Words
    • 5 Pages
    Good Essays
  • Best Essays

    Big Data

    • 6190 Words
    • 25 Pages

    Abstract Big data is one of the most vibrant topics among multiple industries, thus in this paper we have covered examples as well as current research that is being conducted in the field. This was done based on real applications that have to deal with big data on a daily basis together with a clear focus on their achievements and challenges. The results are very convincing that big data is a critical subject that will continue to receive further study.…

    • 6190 Words
    • 25 Pages
    Best Essays
  • Satisfactory Essays

    Bca Se Syllabus

    • 1858 Words
    • 8 Pages

    Block 4 : Trends in DBMS : Objectives – Next generation Database – Application – Object Oriented system – Object Oriented DBMS – Pitfalls of RDBMS – Comparison of RDBMS and OODBMS – Client/Server Database : Objective – Evolution – Client/Server computing – Critical Products – Knowledge base Management system : Objectives – Definition and importance of Knowledge – Difference of KBMS and DBMS.…

    • 1858 Words
    • 8 Pages
    Satisfactory Essays
  • Powerful Essays

    big data by oracle

    • 8910 Words
    • 39 Pages

    The Oracle Database 11g: Beyond Relational Technologies ....... 10 Hadoop and the Oracle Big Data Appliance................................. 11 Business Intelligence and Dynamic Information Discovery .......... 11 Why Engineered Systems Matter ................................................ 12 Delivering Real-Time Decisions ................................................... 14 Oracle Platform Solutions for Big Data use Cases ........................... 14 Big Data Platform for Risk, Reporting and Analytics .................... 14 Platform for Data-Driven Customer Insight and Product Innovation.16 Platform for Security, Fraud and Investigations ........................... 18 Why Oracle ..................................................................................... 19 Introduction: Big…

    • 8910 Words
    • 39 Pages
    Powerful Essays
  • Best Essays

    Data Warehousing and Olap

    • 2507 Words
    • 11 Pages

    Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse.…

    • 2507 Words
    • 11 Pages
    Best Essays
  • Good Essays

    MapReduce is a widely used parallel computing framework for large scale data processing. The two major performance metrics in MapReduce are job execution time and cluster throughput. They can be seriously impacted by straggler machines— machines on which tasks take an unusually long time to finish. Speculative execution is a common approach for dealing with the straggler problem by simply backing up those slow running tasks on alternative machines. Multiple speculative execution strategies have been proposed, but they have some pitfalls: i) Use average progress rate to identify slow tasks while in reality the progress rate can be unstable and misleading, ii) Cannot appropriately handle the situation when there exists data skew among the tasks,…

    • 844 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    Yahoo, an American internet corporation currently utilizes a global cloud computing infrastructure that relies heavily on a technology called Apache Hadoop. Yahoo’s ability to crunch unimaginable amounts of data for the purpose of creating increasingly relevant experiences for its users is based on this Apache Hadoop technology. Apache Hadoop works with the cloud to process and analyze all of the data Yahoo collects efficiently. Because Yahoo places a high importance on its ability to customize results and stories for users, Hadoop is crucial, specifically by enabling previously unattainable feats of efficiency and speed. One area of IT infrastructure that partners really well with the capabilities of cloud computing that Yahoo uses is green computing, utilizing green server farms. I’m going to recommend that Yahoo increase its utilization of these new green server farms to maximize operational efficiency.…

    • 1177 Words
    • 5 Pages
    Powerful Essays
  • Powerful Essays

    modeling

    • 5987 Words
    • 23 Pages

    The data in large, commercial databases pose special challenges for database designers and users. Some major concerns are:…

    • 5987 Words
    • 23 Pages
    Powerful Essays
  • Satisfactory Essays

    research paper

    • 329 Words
    • 2 Pages

    Zemke, F. (2012, MARCH). What 's new in SQL:2011. Retrieved September 2012, from www.sigmod.org: http://www.sigmod.org/publications/sigmod-record/1203/pdfs/10.industry.zemke.pdf…

    • 329 Words
    • 2 Pages
    Satisfactory Essays

Related Topics