Parallelizing The Data Cube. 

Frank Dehne, Todd Eavis, Susanne Hambrusch, Andrew RauChaplin 

Abstract: This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for topdown and one for bottomup cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce interprocessor communication overhead by partitioning the load in advance instead of computing each individual groupby in parallel as is done in previous parallel approaches. In fact, after the initial load distribution phase, each processor can compute its assigned subcube without any communication with the other processors. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting. The bottomup partitioning strategy balances the number of single attribute external memory sorts made by each processor. The topdown strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated groupby sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array. Experimental results presented show that our partitioning strategies generate a close to optimal load balance between processors. 

paper.pdf 
paper.ps 

