This processor is used to apply the Principal Component Analysis Algorithm on an input Dataset which can be useful to reveal important patterns from high-dimensional Datasets.
The processor works on any input Dataset that contains at least a numerical column on which the Principal Component Analysis (PCA) algorithm will be applied.
This processor is configured as follows:
NOTE THAT: the caching field (fifth field) uses resources (mainly memory) and that is why it can badly affect the performance when used with large Datasets.
This processor has two output nodes:
- Enhanced Output (left node): returns the input Data with the score of each selected component (defined in the first configuration field).
- Details Output (right node): returns a table including the grouping column (second configuration field) along with the indexes, explained variance, cumulative explained variance and the standard deviation of each entry in the grouping column, as well as the loadings for each selected component
It is very recommended to use a "Foreach" processor applied on the grouping column to have a more understandable PCA result
The PCA and Foreach processors will be used on a Dataset that monitors the sales, investments, profits and number of workers for three different companies during the period from 2003 till 2011.
This Dataset was created via the Custom Input Table:
The input Dataset is as follows:
The column "Company" will be used in the "Foreach" processor in order to divide the work with respect to each entry from this column.
It will be also used as grouping column in PCA processor, while the columns Investment, Nr-Workers and Profit will be used as components: