A Comparative Study of Approaches to Cluster-Based Large Scale Data Analysis

People

Overview

This goal of this research project is to understand the tradeoffs between the MapReduce and parallel database management systems approaches to performing large-scale data analysis over large clusters of computers, and to bring together ideas from both communities. Both MapReduce and parallel database systems provide scalable data processing over hundreds to thousands of nodes. Both provide a stylized, high-level programming environment that allows users to efficiently filter and combine datasets while masking much of the complexity of parallelizing computation over a cluster. But they differ in substantial ways as well, such as their approaches to dealing with fault tolerance, their data modeling requirements, their query flexibility, and their ability to function in a heterogeneous processing environment. This multi-university team of researchers is investigating the effect of these differences on the performance and scalability of these two approaches. The research team is running a set of experiments that compare an open source MapReduce implementation (Hadoop) to two commercial parallel database systems on a benchmark that includes a range of tasks designed to assess the tradeoffs between both approaches. The research team is seeking to understand which differences between the two approaches to performing large scale data analysis are fundamental tradeoffs, and which differences are possible to combine inside a single solution, so that ideas from one community can benefit the other. This work is funded by the NSF under grants IIS-0844480, IIS-0844013, and IIS-0843487 .

Papers and Technical Reports

Efficient Processing of Data Warehousing Queries in a Split Execution Environment
Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. In Proceedings of SIGMOD, 2011. (bibtex)

MapReduce and Parallel DBMSs: Friends or Foes?
Michael Stonebraker, Daniel J. Abadi, David. J. DeWitt, Samuel Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. CACM, 53(1), January 2010. (bibtex)

HadoopDB in Action: Building Real World Applications
Azza Abouzied, Kamil Bajda-Pawlikowski, Jiewen Huang, Daniel J. Abadi, and Avi Silberschatz. Demonstration. SIGMOD, 2010. (bibtex)

Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database
C. Yang, C. Yen, C. Tan, S. R. Madden, and D. J. Abadi. In Proceedings of ICDE, 2010. (bibtex)

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
Azza Abouzied, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Alexander Rasin. PVLDB, 2(1), August 2009. (bibtex)

A Comparison of Approaches to Large Scale Data Analysis
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel R. Madden, and Michael Stonebraker. In Proceedings of SIGMOD, 2009. (bibtex)