Bucketing

UUID: 00000000-0000-0000-0026-000000000001

Description

Associates a bucket number (starting at 1) for all values in a selected column. The bucket count is determined by user input. The size of the buckets is ((maximum column value - minimum column value) / bucket count).


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.bucketed - Output


Configurations

Selected Column * [single column selection]

The column to find minimal and maximal values for and associate a bucket number to for each value.


Bucket Count * [integer]

The amount of buckets to create. The minimum is 1.


Bucket Column Name * [column name]

The name of the additional column containing the bucket number.


Column Summary

UUID: 00000000-0000-0000-0145-000000000002


Description

Computes information about statistical means of the attributes of the data.


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.metrics - Computed Metrics with Identifier



Correlation

UUID: 00000000-0000-0000-0038-000000000001

Description

Computes the correlation matrix for the given dataset


For more details refer to the following article.


Input(s)

  • in.input - Input


Output(s)

  • out.output - Correlation Matrix


Configurations

Correlation Method * [single enum selection]

Specifies the correlation method.


Columns for correlation [multiple columns selection]

The columns selected for computing the correlation. If no column is selected here, all suitable (Double, Integer, Numeric) columns in the input will be correlated among each other.


Distinct Summary

UUID: 00000000-0000-0000-0011-000000000002

Description

Creates summaries by grouping for nominally scaled column values and counts the amount of rows for each distinct column value


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.output - Distinct Summaries


Configurations

Maximum number of distinct values (defaults to 500) [integer]


Columns to include [multiple columns selection]

Selected columns that are processed additionally to all columns with nominal and ordinal scaled values.


Enable Failsafe Mode * [boolean]

If you expect your data set to have a vast amount of distinct values in its cells (> 100.000), consider enabling this failsafe mode. It triggers a memory friendly version of summary computation. However, the memory friendly version will take longer due to data getting grouped multiple times. Basically, this toggle trades short execution times for stability.


Distinct Textual Summary

UUID: 00000000-0000-0000-0151-000000000001

Description

Creates summaries for every column that the statistics used are applicable for. Statistics include most frequent values, most frequent patterns (value formats, e.g. number, uppercase and lowercase combinations), amount of invalid rows (invalid value can be specified) and valid rows, amount of distinct values as well as minimum, mean and maximum value length (for textual representations). The statistics will be output of this processor.


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.distincts - Computed Textual Statistics


Configurations

Invalid String (special treatment in summary) [string]

String value that should be treated as invalid and not be used as a distinct value. The amount of invalid cells will be tracked per column as well. Defaults to empty String


Distinct values to take [integer]

The amount of most frequent distinct values to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Most_Frequent_Distinct_Values" column. Must be positive.


Distinct Cell formats to take [integer]

The amount of most frequent distinct formats to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Most_Frequent_Column_Format" column. Must be positive.


Special Characters [string]

Characters that will be indicated with an "S" in the "Most_Frequent_Column_Format" column. Can also have an effect on the amount of distinct formats recorded in the "Column_Formats" column. Defaults to: /*!@#$%^&*()"{}_[]|\?/<>,


Forecast Metrics

UUID: 00000000-0000-0000-0143-000000000001

Description

Calculate different error measures from forecasts


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.outMetrics - Metrics
  • out.out - Errors


Configurations

Prediction column * [single column selection]

Column with double value as prediction


Original value column * [single column selection]

Column with double value as original value.


Grouping columns [multiple columns selection]

Can be used to specify columns over which the performance measures are aggregated.


Forecast Metrics For Foreach

UUID: 00000000-0000-0000-0143-000000000002

Description

Calculate different error measures from forecasts generated in a foreach branch. Outputs a single-row data set with information (first value of selected column) about the foreach run it was produced in.


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.out - Input carried through
  • out.metrics - Computed Metrics with Identifier


Configurations

Prediction column * [single column selection]

Column with double value as prediction


Original value column * [single column selection]

Column with double value as original value.


Identifier for the foreach run. Select the same column as in the Foreach Destinct Processor preceding this Processor! * [single column selection]

The selected column is assumed to always have the same value in all rows of the input data set. The value of the column is used to identify the foreach-run it has been produced in.


Heuristic Summaries

UUID: 00000000-0000-0000-0004-000000000002

Description

Computes information about statistical means of the attributes of the data.


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.output - Summary Results


Configurations

Compression size. [integer]

Can override the logarithmic compression with a positive fixed value. Defaults to 0 which triggers logarithmic compression. Will ignore values <= 0. Override is experimental!


Merge Interval [integer]

Interval for local AVLTreeDigest merges. Defaults to 1000. Override with positive integer values. Override is experimental!


Grouping columns [multiple columns selection]

Column which is used for grouping the dataset before computing the statistical means.


Row Count

UUID: 00000000-0000-0000-0027-000000000001

Description

Counts the (distinct) rows in the dataset


For more details refer to the following article.


Input(s)

  • in.data - Input


Output(s)

  • out.out - Count Output


Configurations

Create distinct row count (additionally to overall row count) * [boolean]

When toggled, distinct rows are counted, too.


Summaries (Deprecated!)

UUID: 00000000-0000-0000-0004-000000000001
DeprecatedThis Processor calculates exact values but has rather slow performance. To get an impression on the data, use the Heuristic Summaries. It uses heuristics for median and percentile computation that have a high performance even on large datasets. Only use this processor if you need exact values for median and percentiles.
Replaced by: Heuristic Summaries
Removed: true

Description

Computes information about statistical means of the attributes of the data.


Input(s)

  • in.data - Input


Output(s)

  • out.output - Summary Results