Statistical analysis is very important when working with data and especially with large amount of data that can bring value to the company.
In many cases it is handy to divide the values into different equal ranges and have a general look at where and how data is distributed, and this is what the Bucketing Processor achieves.
One straightforward example is that a user wants to analyse sold items from his/her company and group them into three categories (not much sold, medium sold and highly sold) and this can help the company concentrate on mostly sold items and improve the others.
Given a bucket count input, this processor associates a bucket number for all numeric, non identifier values of a selected column.
Bucketing refers to the process of dividing a set of values into equal intervals and assigning an interval index to each value of this set.
This processor needs some input data in which there is at least one column containing numeric values.
All these configuration fields are mandatory meaning that user has to provide the column on which bucketing will be applied and how many buckets to generate and has to provide the name of the result column.
The processor will use the "Bucket Count" value to divide the values from the column of interest into multiple intervals with same size, and will assign each value from the column to the corresponding interval (exp: lower values will be assigned lower indexes and vice-versa).
The output of this processor is the same input dataset accompanied with a new column which contains the bucketing indexes (the name should already be declared in the processor configuration in the third field).
This output which is also a dataset, can be further used by other processors or can be fed into a save processor.
Let's assume that the column of interest is as follows:
The bucketing value is, for example, equal to 3.
The interval size of each bucket is equal to:
[ Max_Value - Min_Value ] / Bucket_Count
=> so the bucketing interval will be equal to [(9 - (-1)) / 3] = 3.33
- the first interval will be [-1, (-1 + 3.33)] = [-1, 2.33] and it contains 4 values (-1, 0, 1 and 2)
- second is [2.33, 5.66] and it contains one value (3)
- and the last interval is [5.66, 9] and contains one value as well (9)
So the result is as follows:
So each value in the bucketing column will belong to a bucketing interval and it will be assigned the index of this interval (values range from 1 to bucketing value selected by user).