this week’s paper is about the handling of data at scale. And for that the MapReduce framework was invented, which gives a nice abstraction for working on data without caring about the underlying system which actually processes the data.
In the beginning I was unsure if I should choose this very paper for the introduction of the topic as it already goes one step further: MapReduce in an online (streaming) fashion. Original MapReduce only worked with batched jobs as it has a benefit to the system to know the input and output data of every step in the processing pipeline. I did choose it nevertheless as the paper gives a nice overview of MapReduce in general and then additionally introduces this newer (paper is from 2010) concept of doing the processing in an online fashion where all steps are run in parallel.
Easy understandable start into your journey with big data 😉
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a
modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see “early returns” from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing.
HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.