The Grouped Decision Tree Processor is used to divide the input into different groups, each of which is used to train a decision tree and generate a forecast.
This processor has two input nodes:
- Left Node: contains the training Dataset (dependent and independent columns)
- Right Node: contains the test set (can contain only independent columns)
- The training and test sets must match in the schema, that is why it is recommended to use the Horizontal Split Processor on an input Dataset
- Since a tree will be created for each group, it is very recommended to use the Foreach Processor to split the input Dataset into group with respect to a specific column.
- It is also recommended to put the Foreach Processor before the Horizontal Split so each group will be divided into training and test sets (otherwise the hole input Data will be split and potentially some groups might be divided conveniently)
The configuration of the processor can be applied as follows:
Select the dependent column for the creation of the decision tree models. If classification/regression is done depends on the type of the dependent variable, Integers Text and Dates are used with classification, Numeric and Double are used with regression.
Define the independent columns for the creation of the decision tree models.
Name Of The Forecast Column
Name that should be used for the forecast column in the output. It has to be different from all existing column names in the forecast dataset. It also must not include whitespaces.
The default value is "Forecast".
Group By Column
Select the Column to group by. A decision tree model will be computed for every group.
Maximal Number Of Leafs In The Decision Trees
Select the maximal number of leafs the computed decision trees may have.
The default value is 6.
Number Of Training Data Partitions
The number of data partitions the training input should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. (Can improve performance)
Number Of Forecast Data Partitions
The number of data partitions the forecast input should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. Can improve performance.
This processor returns the input Dataset along with an additional column containing the forecast values.
If this processor is preceded by a Foreach Processor, then its output will be arranged with respect to values of the grouping column (defined in the Foreach Processor)
In this example, the Grouped Decision Tree Processor will be applied on the Iris Dataset to study the relation of a flower petal length (dependent column) and its petal width, sepal length and sepal width (independent columns) within each flower category (the group by column).
The Foreach Processor will be used to split the Dataset with respect to entries of "variety" column (this column will be also used as grouping column in the Grouped Decision Tree Processor).
The Horizontal Split will divide the input Data into training (80%) and test (20%) sets.
The Result Table Processor can be used to visualize the outputs.