The Centroid Clustering Processor separates data points from one or multiple columns into a predefined number of k clusters (also known as k-means clustering). The processor standardizes the data before that. All rows will be assigned to a specific cluster.
Clustering is part of the unsupervised learning methods. There are different options to find starting centers that are then gradually shifted by reassigning data points to them, until a reasonable differentiation of observations is reached.
The processor requires at least one column of type numeric as clustering feature.
K: K can range from 1 to the total number of observations (n), yet assigning all observations to one cluster is as little of an information gain as putting every observation in its own cluster. K gives better results being greater than 1 and smaller than n.
Epsilon: If all centers are updated by less than epsilon, the iteration stops, even if the maximum number of steps has not been reached yet.
- random: The optimization starts with random centers.
- k-means parallel: An initialization algorithm tries to find more favorable starting centers for the optimization.
Silhouette Coefficient: A measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighboring cluster. The coefficient can range from -1 to +1. A higher value indicates a better choice for k.
The processor forwards the input dataset with two added columns, 'Cluster' showing the cluster affiliation of all observations, and 'K' for which k the observation was categorized.
If Silhouette coefficient is enabled, an additional column 'Silhouette_Coefficient' with all calculated coefficients is added.
Create a dataset using the Custom Input Table Processor. For example, a list of employee Ids and their weekly hours. The column/s to be used for clustering need to be of type numeric. The aim is to classify the employees into three groups depending on how many hours they worked in a month and what their hourly salary is.
First Example Configuration
In this configuration of the Centroid Clustering Processor, the columns Weekly_Hours and Salary_per_Hour are selected, so the clustering is based on those columns. Single K is set to 3 to get the aforementioned three groups. The rest of the configuration is set to its default.
The employees will then be clustered into groups 0, 1 or 2 depending on their weekly hours and their salary.
Second Example Configuration
In this configuration of the Processor, we won't be setting a single value K but an interval of integer values (In this example we have two values with the minimum or lower bound being 3 and the maximum or upper bound being 4). The other configuration fields are still set to default.
What the Workflow now does is loop through all values of K, and outputs the cluster result for each K (Clusters for a value K vary between 0 and K-1).