Motivation

In ONE DATA, data is computed within a Spark cluster. This means that the original dataset is split up into several partitions and distributed to several workers/nodes that process the data simultaneously. This improves the speed of computations, especially for big datasets.

The problem is though, that it can happen that the data is not equally distributed between partitions after it was processed, which can decrease performance. In such a case, it is possible to manually manipulate the partitions using the processor discussed in this article. 


Overview

The Manipulate Partitions processor re-arranges input data into a custom number of partitions either trough coalescing (only applicable for reducing the number of partitions) or repartitioning. It can be useful especially after joins or other operations that fragment their output.


Input

The processor works with any input dataset.


Configuration

Use Repartition

Specifies whether input data should be re-arranged trough repartitioning. If set to off, coalescing will be used instead. Coalesce will never hurt performance but also does not try to achieve balanced partitions. Repartition might cause a performance decrease but provides more control on partitioning constraints. Unlike coalesce, repartitioning can also be used to increase the number of partitions.


Partition by:

The column(s) to partition by. If set, partitioning will respect the selected column(s) and place rows with the same values in the selected column(s) inside the same partition if possible. If no columns are specified, data is grouped in partitions according to the row's hash values.


Output

The processor itself has no output, as it is just used to manipulate the partitions in which the dataset is split in the background. So the output node of the processor returns the dataset which was passed to it through the input node.



Additional Information

As mentioned above, using the Manipulate Partitions processor can increase the performance of a workflow. There are several cases where it can be useful to manipulate the partitions manually. Some of them are mentioned below:


Small Data Partitions spread over Multiple Nodes

Unevenly distributed Data

 

Grouping certain Values in Partitions

Note that, even though repartitioning can help to improve performance, it is good practice to always check if it really reduces the overall execution time. This is because the partition operation can be time consuming, as data has to be transferred between nodes.