# Add Constant String

**UUID:** 00000000-0000-0000-0108-000000000001

## Description

Adds a new column with a user specified string value.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.addedConstant*- Output

## Configurations

#### New Column * *[column name]*

Determine the name of the new column.

#### Input String value * *[string]*

Value of the String value

# Add Multiple Column Name Prefix

**UUID:** 00000000-0000-0000-0177-000000000001

## Description

Rename multiple columns by adding a prefix

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Select Columns *[multiple columns selection]*

Columns to be renamed. As default (no selection) all available columns will be renamed.

#### Prefix * *[string]*

Specify a prefix for selected columns

# Alphanumeric to Numeric ID

**UUID:** 00000000-0000-0000-0253-000000000001

## Description

This processor converts an alphanumeric ID to a numeric one. It outputs both the original data with an added numeric ID column and a mapping for later rejoining. The result ordering is random. It also supports double and timestamp values.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.outputData*- Data Output*out.outputIDMapping*- Output ID-Mapping

## Configurations

#### Column to Convert * *[single column selection]*

Select the columns to convert to a numeric representation type

# Auto-Interval Aggregation

**UUID:** 00000000-0000-0000-0012-000000000002

## Description

Aggregates a data set based on a time stamp or ratio scaled column. Based on the number of data points to be computed by the processor the aggregation interval varies automatically.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Aggregation Input

## Output(s)

*out.out*- Aggregated Output

## Configurations

#### Cuts * *[integer]*

Number of aggregated cuts the processor should compute (this is also the number of rows in the output).

#### Aggregation column * *[single column selection]*

Column containing values to aggregate over.

#### Aggregation Function * *[single enum selection]*

Aggregation function over all columns except for those selected in Aggregation Column.

# Binarization

**UUID:** 00000000-0000-0000-0100-000000000002

## Description

Generates multiple binarized columns from one nominal scaled column for each unique value in the nominal scaled column. Each binarized column will contain a 1 in a row when the corresponding nominal value is present in it. Otherwise there will be a 0.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Binarization Input

## Output(s)

*out.out*- Binarization Output

## Configurations

#### Selected Column Name * *[single column selection]*

Column with nominal scale to binarize from.

#### Prefix *[string]*

Choose a prefix for the new binarized columns.

# Caching

**UUID:** 00000000-0000-0000-0019-000000000003

## Description

Caches the input and forwards the cached data set. Can improve performance on iterative calculations on the same input data set. No configuration needed.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Cached Input

## Configurations

#### Dataframe-based caching * *[boolean]*

Cache the Dataframe instead of the raw RDD. This saves cache space and may improve subsequent and overall query performance due to more optimization options for Spark.

# Columnization

**UUID:** 00000000-0000-0000-0064-000000000001

## Description

Generate new columns based on the respective values in an existing column.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Columnization Input

## Output(s)

*out.output*- Multi-Column Columnization Output

## Configurations

#### ID / Grouping Columns *[multiple columns selection]*

The input data is grouped according to the given column(s), the remaining columns in the groups are then aggregated according to the columnization columns specified. If no grouping column(s) are selected each of the columnized rows is assumed to be in a single group and no aggregation is done, instead only default values are added for those columns where no value can be found after columnization.

#### Columnization Values Columns * *[multiple columns selection]*

The values of these columns are transformed into new columns, the names of the columns will be according to this schema: columizedColumnName_columizedColumnValue_duplicatedColumnName where duplicatedColumnName are the colums which are aggregated.

#### Columnize Top K Distinct Values * *[integer]*

The top k distinct values per column that should be used for columnization. Please note that overall columnized Columns maximally 1000 distinct values are used.

#### Text Separator *[string]*

The text separator is used when textual columns are aggregated with the aggregation type append. Default is ", ".

#### Date Aggregation *[enum composed]*

Configure default values and date columns that should be aggregated with a certain aggregation type.

###### Date Aggregation > Aggregation Columns * *[multiple columns selection]*

The date columns that should be aggregated according to the columnization and grouping columns.

###### Date Aggregation > Default Value * *[timestamp]*

This value is used if a columized column has no datapoints in a group.

#### Number Aggregation *[enum composed]*

Configure default values and number columns that should be aggregated with a certain aggregation type.

###### Number Aggregation > Aggregation Columns * *[multiple columns selection]*

The date columns that should be aggregated according to the columnized and grouping columns.

###### Number Aggregation > Default Value * *[double]*

This value is used if a columnized column has no datapoints in a group.

#### Text Aggregation *[enum composed]*

Configure default values and text columns that should be aggregated with a certain aggregation type.

###### Text Aggregation > Aggregation Columns * *[multiple columns selection]*

The date columns that should be aggregated according to the columnized and grouping columns.

###### Text Aggregation > Default Value * *[string]*

This value is used if a columnized column has no datapoints in a group.

#### Use broadcast joins * *[boolean]*

If multiple columns are selected this processor internally performs columnization on each of them and joins the resulting tables into a single table. A broadcast join can greatly improve performance in situations where a large table (fact) is joined with relatively small tables (dimensions). You are responsible for making sure the broadcasted table does not exceed memory limits of the workers!

# Column Selection

**UUID:** 00000000-0000-0000-0002-000000000001

## Description

Select a subset of columns for further processing.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.data*- Output

## Configurations

#### Column/-s * *[multiple columns selection]*

Columns not selected get excluded.

# Data Filter

**UUID:** 00000000-0000-0000-0022-000000000001

## Description

Processor that matches the input data according to a specified condition within a column. Only rows that match the condition are kept.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Selected column * *[single column selection]*

Column in which the selected value should be matched.

#### Relational Operator * *[single enum selection]*

Operator to use for data selection on specified column.

#### Value * *[string]*

Value that should be matched in the selected column. For more than one value seperate with "," (i.e. for IN or NOT_IN).

# Data Replication

**UUID:** 00000000-0000-0000-0009-000000000001

## Description

Takes one data set as input and outputs two independent and identical instances of the input data set. No configuration nesseccary.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- data

## Output(s)

*out.out1*- First Replicate*out.out2*- Second Replicate

# Data Type Conversion

**UUID:** 00000000-0000-0000-0255-000000000002

## Description

Performs a type conversion on one or multiple columns of the input. Columns can be converted to string, datetime, double and integer representation types. This processor replaces Timestamp Parsing and Multiple Timestamp Parsing processors.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Convert to String *[composed]*

Converts the selected columns to String.

###### Convert to String > Columns to convert *[multiple columns selection]*

Specifies the columns to convert to string representation type.

#### Convert to Datetime *[composed]*

Converts the selected columns to timestamp type. String columns are parsed with the given format. Numeric columns are treated as timetamps.

###### Convert to Datetime > Columns to convert * *[multiple columns selection]*

Specifies the columns to convert to timestamp representation type.

###### Convert to Datetime > Fallback value * *[timestamp]*

Fallback value that is used if the original value can't be converted to datetime.

###### Convert to Datetime > Format * *[string]*

Specify the datetime format as it is given in the selected columns. Works with "M/d/y H:m:s.S" (only one M,d,H,m,s,S necessary). Most commonly used constants for format specificatoin are:

Constant |
| Datetime component |
---|---|---|

y | Year | |

M | Month | |

d | Day | |

H | Hour in day (0-23) | |

m | Minute | |

s | Second | |

S | Millisecond |

All available constants are explained in detail in the following link.

**Examples:**

Column format / value |
| Format in processor |
| Output datetime |
---|---|---|---|---|

14/3/2016 | d/M/y | 2016-03-14 00:00:00.0 | ||

2016-01-20 07:20:05.123 | y-M-d H:m:s.S | 2016-01-20 07:20:05.123 | ||

6:12 | H:m | 1970-01-01 06:12:00.0 |

#### Convert to Double *[composed]*

Converts the selected columns to double type.

###### Convert to Double > Columns to convert * *[multiple columns selection]*

Specifies the columns to convert to double representation type.

###### Convert to Double > Fallback value * *[double]*

Fallback value that is used if the original value can't be converted to double.

#### Convert to Integer *[composed]*

Converts the selected columns to integer type.

###### Convert to Integer > Columns to convert * *[multiple columns selection]*

Specifies the columns to convert to integer representation type. Strings are parsed and datetime values are treated as UNIX timestamps for this conversion.

###### Convert to Integer > Fallback value * *[integer]*

Fallback value that is used if the original value can't be converted to integer or if the source value is NaN.

###### Convert to Integer > Rounding Strategy * *[single enum selection]*

With this option a rounding mode for double-like values in columns can be chosen. The rounding modes are explained in detail in the following link.

# Distinct Rows

**UUID:** 00000000-0000-0000-0039-000000000001

## Description

This processor returns all distinct rows in a dataset. If the columns are limited to a certain selection the distinct rows will be computed for the selected columns only. All non selected columns will not be part of the output anymore.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Distinct Rows

## Configurations

#### Output Columns *[multiple columns selection]*

Select which columns should be contained in the output. These are also the columns used for the distinct computation, if no column is selected, all are used.

# Exclude Columns

**UUID:** 00000000-0000-0000-0002-000000000002

## Description

Excludes selected columns from the input dataset.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.data*- Output

## Configurations

#### Exclude these columns * *[multiple columns selection]*

All columns that are selected here will be removed in the output of this Processor.

#### Allow empty output * *[boolean]*

If this option is enabled the output of this Processor is allowed to have no columns at all. Note, that an empty output almost certainly will cause errors in subsequent Processors. Only certain Processors can deal with completely empty input (e.g. the Collect Records Processor).

# Extended Mathematical Operation

**UUID:** 00000000-0000-0000-0049-000000000002

## Description

Processor that executes a SQL query containing a complex operation. The result is saved in a new column, or if the given name matches an already existing column, the values in the existing column are replaced. If all column names are new, the executed query looks as follows: "SELECT *, as FROM input"

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Output with (additional) computed Column

## Configurations

#### New Column Name / Column to Replace * *[column name]*

If a column with the given name is alread present, its values will be replaced with the newly computed values. If the column name is unique, a new column is added to the output.

#### Expression to be Computed * *[string]*

The expression that should be used within the SQL statement to be executed. It may only contain valid SPARK SQL operations. The column names of the input table can be referenced. The resulting column is always treated as having the type ratio-scaled Double, if the actual type is differing from that, you have to use a Stringify Processor and afterwards the wanted conversion processor to get the correct type.

#### Additional Operation *[composed]*

Additional computations that also add new columns to the input table, or alter existing column values.

###### Additional Operation > New Column Name / Column to Replace * *[column name]*

If a column with the given name is alread present, its values will be replaced with the newly computed values. If the column name is unique, a new column is added to the output.

###### Additional Operation > Expression to be Computed * *[string]*

The expression that should be used within the SQL statement to be executed. It may only contain valid SPARK SQL operations. The column names of the input table can be referenced. The resulting column is always treated as having the type ratio-scaled Double, if the actual type is differing from that, you have to use a Stringify Processor and afterwards the wanted conversion processor to get the correct type.

# Extract Regex

**UUID:** 00000000-0000-0000-0044-000000000001

## Description

Extracts characters, digits or complete strings to a new column using a regular expression (Regex).

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Selected column * *[single column selection]*

Select column containing characters, digits or strings from which the regex matches should be extracted.

#### Regular Expression * *[string]*

Specify the Regular Expression that matches the desired strings in the selected column. For Syntax see documentation of java.util.regex.Pattern.

#### New column name * *[column name]*

Name the new column that contains the Regex matches.

#### Default value *[string]*

Default value that is used when no regex-match is found in a single cell. Defaults to empty String.

# Filter Combination

**UUID:** 00000000-0000-0000-0164-000000000001

## Description

Define multiple filter combinations in the right input table and apply them to the left input table.

**For more details refer to the following ****article****.**

## Input(s)

*in.table1*- Input for inputTable*in.table2*- Input for queryTable

## Output(s)

*out.data*- Query Results

## Configurations

#### Filter Columns * *[multiple columns selection]*

Select the columns containing the possibilities to be created.

#### Combinations *[integer]*

Size of permutation sets (default value is 4)

#### Invalid Value *[string]*

Value that is ignored in the filter table when checking filters. Will yield a filter match for the value.

# Foreach

**UUID:** 00000000-0000-0000-0033-000000000001

## Description

Metaprocessor that determines all distinct values of a selected column. Afterwards, all following operations are applied to every single element of those distinct values (like a for loop which is only applied to distinct values). Be aware that the following path is run n times (Save new / Append).

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- ForEach distinct input.

## Output(s)

*out.output*- ForEach distinct output.

## Configurations

#### Distinct Column * *[single column selection]*

Select column that contains the distinct values for which the following operation should be performed separably.

# GeoJSON

**UUID:** 00000000-0000-0000-0082-000000000001

## Description

Converts the input to complete GeoJSON and outputs it as result.

**For more details refer to the following ****article****.**

## Input(s)

*in.geojsonInput*- GeoJSON Input

## Configurations

#### Geo Type * *[single column selection]*

The type of the features (Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon).

#### Coordinates * *[single column selection]*

The coordinates for the features.

#### Properties * *[multiple columns selection]*

The properties for the features.

#### Grouping Column *[single column selection]*

Select a column to group the input by. For each distinct value of this group a result will be output.

# Group By Aggregation

**UUID:** 00000000-0000-0000-0040-000000000001

## Description

Aggregation grouped by distinct values of the selected columns. Selecting more than one column, results in aggregations over combinations of column values.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Group by *[multiple columns selection]*

Column/-s containing distinct values to aggregate over.

#### Aggregation Function * *[single enum selection]*

Select the aggregation function that is performed on each column except for the column/-s selected above. Be aware that the same aggregation does not make sense to all of the remaining columns.

# Grouped Peak Elimination

**UUID:** 00000000-0000-0000-0152-000000000002

## Description

Eliminates peaks in input signals. This is done by computing a correction value (either the RMS or the mean of the given windows) and then check that each sample is not differing more than X times the correction value of the correction value. All samples that sit out of the valid window are then replaced by interpolated values together with the surrounding samples. How many surrounding samples are replaced by interpolated values can be set with the before and after options for elimination. As all samples that should be in one window have to be collected into RAM the window size should not be too high, by leaving it empty, a sensible default depending on the dataset is chosen.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.smoothed*- Input data without peaks*out.correctionInfo*- Correction Statistics

## Configurations

#### Column for Sorting *[single column selection]*

The column which is used for sorting the input data. If no column is set, we assume the input dataset is already sorted.

#### Columns in which peaks should be eliminated *[multiple columns selection]*

Columns in which peaks should be eliminated. If no columns are selected, all columns with a valid type (all numeric columns) are used for peak elimination.

#### Strategy for computing the base correction value * *[single enum selection]*

The correction value can either be the mean of the current window, or the root-mean-square value of the current window. It can be further customized by setting a multiplier in the next option.

#### Correction Value Multiplier for downwards directed peaks * *[double]*

A number which will be multiplied with the computed correction value. If the distance between the actual value and the normal correction value is larger than the correction value multiplied by this number we have found a downwards directed peak that will be eliminated, by replacing the number (and perhaps some pre- and postfix with the correction value). Default is 5.

#### Correction Value Multiplier for upwards directed peaks * *[double]*

A number which will be multiplied with the computed correction value. If the distance between the actual value and the normal correction value is larger than the correction value multiplied by this number we have found an upwards directed peak that will be eliminated, by replacing the number (and perhaps some pre- and postfix with the correction value). Default is 5.

#### Amount of samples before a peak that should be eliminated *[integer]*

Amount of samples before a peak that will be replaced by the computed correction value. Default is 10.

#### Amount of samples after a peak that should be eliminated *[integer]*

Amount of samples after a peak that will be replaced by the computed correction value. Default is 10.

#### Block Size *[integer]*

The block size used for eliminating peaks, smaller blocks mean that the correction value will be computed from less samples and vice-versa. If not configured, the window size is chosen by the processor according to the size of the dataset.

#### Maximal Number of Elimination Runs *[integer]*

This number shows the maximal amount of peak eliminations done in a loop on the given windows. Either if no more peaks can be eliminated or the amount of runs given with this number is reached, the peak elimination stops and returns its result.

#### Bucketing Column Name *[column name]*

The output data-set will be extended by an additional column containing numeric values that stand for the buckets in which a given sample was put. The default is "PEAK_BUCKET".

# Horizontal Split

**UUID:** 00000000-0000-0000-0052-000000000001

## Description

Splits a data set into two data sets.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Horizontal split input

## Output(s)

*out.first*- First Split*out.second*- Second Split

## Configurations

#### Order by *[single column selection]*

Column by which the rows will be ordered. With ordering column, the resulting splits have the exact size, according to the specified percentage. Also these results will be reproducible. If no ordering column is chosen, the splits will be computed randomly and the size of the splits are approximated. More on this can be found in the documentation of the percentage option.

#### Ordering *[single enum selection]*

Decide whether you want ascending (from lowest to highest value) or descending order (from highest to lowest value).

#### Percentage * *[double]*

Percentage of rows to be selected for the First Split. If no ordering column is given, the splits will be approximated, e.g. a dataset with 70 lines split by 50 % may then result in one output having 32 lines and the other output having 38 lines.

#### Seed *[integer]*

Value used to seed random splitting which will ensure deterministic behavior i.e if you run the same workflow twice with HorizontalSplitProcessor you will see same output both the times.Seed value is used only when splitting randomly i.e when no ordering column is given, if ordering column is given then seed value will not be used. If seed value and ordering column both are not specified then it will split the input randomly and non deterministically i.e there is no guarantee that you will see same output if you run the same workflow multiple times.

#### Exact random splitting * *[boolean]*

Guarantees to split at exactly the percentage given (rounded to the nearest integer) when no ordering is specified. Enabling this option may severly impact performance!. If a ordering is specified this option has no effect.

# If Else

**UUID:** 00000000-0000-0000-1138-000000000001

## Description

Evalutates the boolean SQL query and forwards the first processor input to the left output if all result rows are true or to the right output if at least one result row was false. Please note that Integrated Workflow Execution Processors under this processor will still be expanded, and all parts of the integrated workflow not relying on one of the inputs will still be executed!

**For more details refer to the following ****article****.**

## Input(s)

*in.table1*- Input for firstInputTable*in.table2*- Input for secondInputTable

## Output(s)

*out.trueOutput*- True Output*out.falseOutput*- False Output

## Configurations

#### Query * *[string]*

Specify a query that defines the test condition (both processor inputs can be used). This query must return a single boolean or numeric column. If any of the values in this column is false or zero the assertion fails and the first processor input will be forwarded to the second output. If the all values in the column are true, the first input will be forwarded to the first output.

# Indexing

**UUID:** 00000000-0000-0000-0018-000000000001

## Description

Adds an incremental index (+1 for each row) in front of the first column of a data set (increments by 1, starts at a defined value).

**For more details refer to the following ****article****.**

## Input(s)

*in.in*- Indexing input

## Output(s)

*out.out*- Indexing output

## Configurations

#### Index column name *[column name]*

Name of the new column containing the index which will be added in front of the first column of the input data set.

#### Start Index * *[integer]*

The start index is set to 0 by default. Can be altered if another value is desired for a starting index.

# Inner Join

**UUID:** 00000000-0000-0000-0015-000000000001

## Description

Inner join on two input data sets by one column match. Columns selected as join partners must have same data types. All other columns in the two data sets may not have identical names.

**For more details refer to the following ****article****.**

## Input(s)

*in.in1*- First input*in.in2*- Second input

## Output(s)

*out.out*- Output

## Configurations

#### First join partner * *[single column selection]*

The column in the first input data set that is a join partner. The column won't exist in the output data set anymore.

#### Second join partner * *[single column selection]*

The column in the second input data set that is a join partner. The column will not exist in the output data set anymore.

#### New column name * *[column name]*

The new column name of the matched join partners in the output data set.

#### Broadcast right table before join * *[boolean]*

A broadcast join can greatly improve performance in situations where a large table (fact) is joined with relatively small tables (dimensions). You are responsible for making sure the broadcasted table does not exceed memory limits of the workers!

# KPI Alternatives

**UUID:** 00000000-0000-0000-0047-000000000001

## Description

Collects KPI Alternatives from a suitable data set input. Alternatives come with information about the column location (schema, table, column name), distinct values and formats, fill rate of the column and total amount of entries.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### KPI Column * *[single column selection]*

Select the column containing the KPI name

#### Schema Column * *[single column selection]*

Select the column containing the Schema information

#### Table Column * *[single column selection]*

Select the column containing the Column information

#### Column name Column * *[single column selection]*

Select the column containing the Column Name information

#### Filter Column * *[single column selection]*

Select the column containing the SQL WHERE statement

#### Probability Column * *[single column selection]*

Select the column containing the Probability information. (How probable is the current Alternative?)

#### Total Count Column * *[single column selection]*

Select the column containing the Total Value Count information. (How many entries does the current Alternative have?)

#### Fill Rate Column * *[single column selection]*

Select the column containing the Fill Rate information. (What is the percentage of valid entries in the current Alternative?)

#### Column Formats Column * *[single column selection]*

Select the column containing the most common Column Format information. (What are the most common formats of the entries in the current Alternative?)

#### Most frequent Values Column * *[single column selection]*

Select the column containing the information about most common values. (What are the most frequent distinct values in the current Alternative?)

#### Separator *[string]*

Separator used to separate values in the most frequent formats and most frequent distinct values column. Defaults to \|

#### Probability Threshold for Preselection *[double]*

Threshold for Probability value. The most probable Alternative for a KPI will be preselected automatically when its Probability is over the given Threshold. Must be value in [0.0..1.0]. Defaults to 0.5.

#### Preselections (leave empty) *[string]*

After executing this Processor you can store its result inside its configuration and run it again obtaining all selected Alternatives in the output.

# KPI Alternatives for Projects

**UUID:** 00000000-0000-0000-0047-000000000002

## Description

Collects KPI Alternatives from a suitable data set input. Alternatives come with information about the column location (schema, table, column name), distinct values and formats, fill rate of the column and total amount of entries.

**For more details refer to the following ****article****.**

## Input(s)

*in.input_kpi*- KPI info Input*in.input_columns*- Column value input

## Output(s)

*out.output*- Output

## Configurations

#### KPI Column * *[single column selection]*

Select the column containing the KPI name

#### Schema Column * *[single column selection]*

Select the column containing the Schema information

#### Table Column * *[single column selection]*

Select the column containing the Column information

#### Column name Column * *[single column selection]*

Select the column containing the Column Name information

#### Filter Column * *[single column selection]*

Select the column containing the SQL WHERE statement

#### Probability Column * *[single column selection]*

Select the column containing the Probability information. (How probable is the current Alternative?)

#### Total Count Column * *[single column selection]*

Select the column containing the Total Value Count information. (How many entries does the current Alternative have?)

#### Fill Rate Column * *[single column selection]*

Select the column containing the Fill Rate information. (What is the percentage of valid entries in the current Alternative?)

#### Column Formats Column * *[single column selection]*

Select the column containing the most common Column Format information. (What are the most common formats of the entries in the current Alternative?)

#### Separator *[string]*

Separator used to separate values in the most frequent formats and most frequent distinct values column. Defaults to \|

#### Most frequent Values Column * *[single column selection]*

Select the column containing the information about most common values. (What are the most frequent distinct values in the current Alternative?)

#### Probability Threshold for Preselection *[double]*

Threshold for Probability value. The most probable Alternative for a KPI will be preselected automatically when its Probability is over the given Threshold. Must be value in [0.0..1.0]. Defaults to 0.5.

#### Preselections (leave empty) *[string]*

After executing this Processor you can store its result inside its configuration and run it again obtaining all selected Alternatives in the output.

#### Project Column * *[single column selection]*

Select the column containing the KPI name

# Lag/Lead Generation

**UUID:** 00000000-0000-0000-0043-000000000001

## Description

This processor generates lag / lead column/-s for a given column in a data set. This means, that for the given column additional columns are appended and returned in the output. These additional columns contain values of the chosen column, but from previous / following rows of the data set (depending on the configuration, there can be an arbitrary interval between the current row, and the lag / lead value). This processor is also able to do the lag / lead generation not based on rows, but based on timestamps, e.g. lag generation and in one column there are equidistant timestamps increasing by 1 second, and we chose to do two lags by 60 seconds, then the input dataset needs to have at least 121 rows, and in that case, the output will be exactly one row, containing all the values from the last row, and additionally the lagged values (for the 60 second lag it will be 60th row value, and for the 120 second lag it will be the 0th row value).

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Lag / Lead generation input

## Output(s)

*out.output*- Lag generation output

## Configurations

#### Column for lag / lead generation * *[single column selection]*

Column for which the lag / lead values should be generated. This column may have any type, no restrictions apply here.

#### Do simple row-based lag / lead generation without the need for equidistant time-series. * *[boolean]*

When set, the column selected in 'Sorting column' is only used for sorting the input data before lag / lead generation and does not need to be a DATETIME column, but can be any sortable column. If no sorting column is given, we assume the input data is already ordered. Each lag / lead is directly referring to its preceding / succeeding row(s). The Interval Multiplicator setting here defines the distance of a lag / lead in rows.

#### Sorting column *[single column selection]*

This configuration option can be used in three different ways. If time-based lag / lead generation is done (No row-based lag / lead generation, equidistance of time-stamps is mandatory), the chosen column needs to contain values of type DATETIME. If row-based lag / lead generation is done (check the row-based lag / lead generation option) this option may either be not set, then we assume the incoming data is sorted, or the option is set to a column that is used for sorting the dataset before applying the lag / lead generation (doesn't need to be a DATETIME column, but can be any column with scale type interval or ratio).

#### Time Interval *[single enum selection]*

The time interval which should be used for lag / lead generation, if it is not row-based. This interval can be further customized by choosing a multiplicator in the next config option

#### Interval Multiplicator * *[integer]*

When using time-based lag / lead generation, the chosen interval can be further customized. E.g. by using the interval seconds and 2 here, we have a time-lag of 2 seconds. When using row-based lag / lead generation, this option is used as a span between the value and its lagged value, e.g. by setting 2 here and lag generation, the first lag of a row is not from the previous row, but from the row before the previous row. The default value for this option is 1

#### Amount of Lags / Leads * *[integer]*

Choose the amount of lags / leads to create

#### Lag / Lead Generation * *[boolean]*

Switch between lag (toggle is off) or lead (toggle is on) generation:

- lag - values for generated column lags are taken from the previous rows of the dataset and new column names have a “_LAG” suffix,
- lead - values for generated column leads are taken from the next rows of the dataset and new column names have a "_LEAD” suffix.

#### Extrapolation * *[single enum selection]*

Which kind of extrapolation should be done:

- Delete edge rows - Only keep rows in the result, for which the lags can be calculated.
- Pad with NULL - Keep the edge rows and set their lags to NULL.
- Fill with first / last value - Keep the edge rows and set their lags to first lag value.

#### Columns for grouping *[multiple columns selection]*

Column, which is used for grouping the data before creating lags / leads of it. Lag / Lead generation will be done for each group seperately. A group is defined as a distinct combination of values present in the selected columns. Selecting a value here implicitly makes it mandatory to select a column in "Sorting column"

# Left Join

**UUID:** 00000000-0000-0000-0015-000000000002

## Description

Left join on two input data sets by one column match. Columns selected as join partners must have same data types. All other columns in the two data sets may not have identical names. The left table is the leading table.

**For more details refer to the following ****article****.**

## Input(s)

*in.in1*- First input*in.in2*- Second input

## Output(s)

*out.out*- Output

## Configurations

#### First join partner * *[single column selection]*

The column in the first input data set that is a join partner. The column will also exist in the output data set.

#### Second join partner * *[single column selection]*

The column in the second input data set that is a join partner. The column won't exist in the output data set anymore.

#### Broadcast right table before join * *[boolean]*

A broadcast join can greatly improve performance in situations where a large table (fact) is joined with relatively small tables (dimensions). You are responsible for making sure the broadcasted table does not exceed memory limits of the workers!

#### Customized autofill values for nonmatching join IDs *[enum composed]*

Select values for different data types that will be inserted in the columns of the right join partner in those lines where the join ID of the left join partner cannot be found in the right input table.

###### Customized autofill values for nonmatching join IDs > Custom replacement value * *[string]*

Custom replacement value to be used for the given type.

# Lexical Binarization

**UUID:** 00000000-0000-0000-0055-000000000001

## Description

Generates multiple binarized columns from one nominal scaled/text column for a limited number of unique values in the nominal scaled/text column. Each binarized column will contain a 1 in a row when the corresponding nominal value is present in it. Otherwise there will be a 0.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Lexical binarization input

## Output(s)

*out.output*- Lexical binarization output*out.outputSummary*- Distinct Summary Output

## Configurations

#### Selected Column Name * *[single column selection]*

Column with nominal scale/text to binarize from.

#### Separator *[string]*

Specify pattern that serves as separator.

#### Prefix for new columns *[string]*

Choose a prefix for the new binarized columns.

#### Maximum number of binarized columns *[integer]*

Choose maximum number of unique values for which binarized columns should be generated. The unique values with most occurences in the column are taken. Default value is 20.

#### Case Sensitive * *[boolean]*

Choose whether binarization should be case sensitive, i.e. if it should distinguish between upper and lower case letters.

# Manipulate Partitions

**UUID:** 00000000-0000-0000-0085-000000000001

## Description

Re-arranges input data in a custom number of partitions either trough coalescing (Only applicable for reducing the number of partitions.) or repartitioning. Can be useful especially after joins or other operations that fragment their output.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.restructured*- Output (restructured input)

## Configurations

#### Number of Resulting Partitions * *[integer]*

The number of data partitions the input should be arranged in.

#### Use Repartition *[composed]*

Specifies whether input data should be re-arranged trough repartitioning. If set to off, coalesce will be used instead. Coalesce will never hurt performance but also does not try to achieve balanced partitions. Repartition might cause a performance decrease but provides more control on partitioning constraints. Unlike coalesce, repartitioning can be used to increase the number of partitions, too.

###### Use Repartition > Partition by *[single column selection]*

The column(s) to partition by. If set, partitioning will respect the selected column(s) and place rows with the same values in the selected column(s) inside the same partition if possible. If no columns are specified, data is grouped in partitions according to the row's hash values.

# Manual Column Specification

**UUID:** 00000000-0000-0000-0074-000000000001

## Description

Auxiliary processor that gives the user control over the column information that is currently present at the client Type Inference (TI). The user can remove columns that are already inside the TI and add new columns with a certain Scale and Representation type. Note, that this processor does not trigger the generation of new columns on server side but is only used to aid client TI. Subsequent processors will fail when columns are selected that are

generated virtually by this processor but are not present in their physical input dataset!

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output with corrected TI

## Configurations

#### Exclude these columns *[multiple columns selection]*

The selected columns will be excluded from the TI.

#### Add these columns *[manual column specification]*

The columns entered here will be included into the TI.

# Mathematical Column Operation

**UUID:** 00000000-0000-0000-0049-000000000001

## Description

Executes an elementary arithmetic operation between two selected columns and saves the result in a new column.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Left hand side of operator * *[single column selection]*

Choose a column with double type as left hand side operator.

#### Right hand side of operator * *[single column selection]*

Choose a column with double type as right hand side operator.

#### New Column Name * *[column name]*

Name of the column that contains the result of the arithmetic operation. Will be added to the data set.

#### Arithmetic Operator * *[single enum selection]*

Choose the arithmetic operation that should performed with the two selected columns.

# Mathematical Operation MC

**UUID:** 00000000-0000-0000-0098-000000000002

## Description

This processor takes an arithmetic operation (calculation) and an operand, and applies these to each value of the selected columns.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Select column/-s *[multiple columns selection]*

Select the columns on which the arithmetic operation should be performed.

#### Operand * *[double]*

The operand to be applied by the selected operator.

#### Arithmetic Operation * *[single enum selection]*

The arithmetic operation to apply in the calculation.

# Mathematical Operation SC

**UUID:** 00000000-0000-0000-0098-000000000001

## Description

This processor takes an arithmetic operation (calculation)Â and an operand, and applies these to each value of the selected columns. The result is saved in a new column with the entered column name.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Select column * *[single column selection]*

Select the column on which the arithmetic operation should be performed.

#### Arithmetic Operation *[single enum selection]*

An arithmetic operation to apply in the calculation.

#### Operand * *[double]*

The operand that is used for the arithmetic operation.

#### Column Name for Result * *[column name]*

Select a name for the arithmetic operation result column.

# Mathematical Timestamp Operation

**UUID:** 00000000-0000-0000-0058-000000000001

## Description

Processor to add or substract a user specified time value in place on one or several selected columns.

**For more details refer to the following ****article****.**

## Input(s)

*in.inputData*- Input Data

## Output(s)

*out.TimestampArithmeticOutput*- Timestamp Arithmetic output

## Configurations

#### Select column/-s * *[multiple columns selection]*

Select the columns on which the arithmetic operation should be performed.

#### Arithmetic Operation * *[single enum selection]*

Operator to apply (Addition/Subtraction)

#### Time Interval * *[single enum selection]*

Interval (Milliseconds/Seconds/Minutes/Hours/Days/Weeks/Months/Years) to add or subtract from the selected column

#### Operand * *[integer]*

The operand that is used for the arithmetic operation.

# Multiple Column Rename

**UUID:** 00000000-0000-0000-0077-000000000002

## Description

Rename multiple columns

**For more details refer to the following ****article****.**

## Input(s)

*in.input2*- Input

## Output(s)

*out.output2*- Output

## Configurations

#### Rename one or more Columns *[composed]*

Select multiple columns to rename individually.

###### Rename one or more Columns > New Column Name * *[column name]*

Specify a new name for the selected column.

# Multiple Doublify

**UUID:** 00000000-0000-0000-0148-000000000001

## Description

Changes type of multiple columns to Ratio-Scaled/Double

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Selected columns * *[multiple columns selection]*

Select columns for which the type should be changed.

#### Default value *[double]*

Value to use for values that cannot be converted to interval-scaled (Default: 0)

# Multiple Intify

**UUID:** 00000000-0000-0000-0148-000000000002

## Description

Changes type of multiple columns to interval-scaled.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Selected columns * *[multiple columns selection]*

Select columns for which the type should be changed

#### Rounding Strategy * *[single enum selection]*

With this option a rounding mode for double-like values in columns can be chosen. The rounding modes are explained in detail here.

#### Default value *[integer]*

Value to use for values that cannot be converted to interval-scaled (Default: 0)

# Ordered Subsetting

**UUID:** 00000000-0000-0000-0051-000000000001

## Description

Selects the first "x" percent of rows ordered by a specified column.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Order by * *[single column selection]*

Select the column by which the data will be ordered.

#### Ordering * *[single enum selection]*

Decide whether you want ascending (from lowest to highest value) or descending order (from highest to lowest value).

#### Percentage *[double]*

Percentage of rows that should be selected after ordering.

# Ordering

**UUID:** 00000000-0000-0000-0102-000000000001

## Description

This processor sorts a data set based on upto three columns.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Data to be sorted

## Output(s)

*out.output*- Output

## Configurations

#### First Sort Column * *[single column selection]*

The first column to sort by

#### Second Sort Column *[single column selection]*

The second (optional) column to sort by

#### Third Sort Column *[single column selection]*

The third (optional) column to sort by

# Ordering and Split

**UUID:** 00000000-0000-0000-0102-000000000002

## Description

This processor allows to order datasets by arbitrary columns, descending and ascending. If necessary the dataset can also be split into two datasets with a given percentage or fixed absolute line count. If no splitting is selected, the left output contains the ordered dataset.

**For more details refer to the following ****article****.**

## Input(s)

*in.inputData*- Data to be sorted

## Output(s)

*out.outputDataLeft*- Left Split*out.outputDataRight*- Right Split

## Configurations

#### Order By *[composed]*

The columns that should be used for ordering the dataset.

###### Order By > Ordering * *[single enum selection]*

Decide whether you want ascending (from lowest to highest value) or descending order (from highest to lowest value).

#### Split Data *[enum composed]*

Strategy for splitting the dataset.

###### Split Data > Split Count/Percentage *[integer]*

Fixed number of rows or percentage where the splitting should happen.

###### Split Data > Ordering Strategy *[single enum selection]*

Point of time where the ordering should be applied. Means that the dataset is splitted before or after ordering.

###### Split Data > Seed *[integer]*

Value used to seed random splitting which will ensure deterministic behavior. If no seed is selected, undeterministic behavior is expected.

###### Split Data > Exact sampling * *[boolean]*

Exact sampling guarantees to draw exactly the number of the entered sample size. Disabling this option can speed up calculation, however the selected number of rows will only be approximated.

# Principal Component Analysis

**UUID:** 00000000-0000-0000-0084-000000000001

## Description

This processor is for a principal component analysis (PCA). PCA is an unsupervised method used in exploratory data analysis and for making predictive models. It's a means of revealing the internal structure of the data in a way that best explains the variance in the data. It can be a very useful step especially for visualising and pre-processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Principal Component Analysis input

## Output(s)

*out.PCAEnhancedOutput*- PCA Enhanced output*out.PCADetailsOutput*- PCA Details output

## Configurations

#### Input Columns * *[multiple columns selection]*

Columns, i.e. variables, for which principal component analysis should be done. Only numeric columns can be used for PCA.

#### Grouping Column * *[single column selection]*

Select the Column to group by. The PCA will be computed for every group separately.

#### Scale data to unit standard deviation * *[boolean]*

Standardize input data after scaling to unit variance.

#### Center data with mean * *[boolean]*

Standardize input data after mean centering.

#### Cache calculated PCA results * *[boolean]*

Caching results after PCA calculation can, for some input datasets, enhance performance of the workflow as the calculation won't need to happen twice (i.e. one for each output). On the other hand it uses more resources (mainly memory). So in case there's not enough memory to cache all the results, i.e. large input datasets, the cached data will be spilled to the disk which can affect the performance. In case the workflow seems to run slow, try to switch off caching.

#### Number of data partitions *[integer]*

The number of data partitions the data should be arranged in. If set, data is re-arranged trough re-partitioning by the rows' hash values before performing 'group by' operation. Can improve performance.

# Query

**UUID:** 00000000-0000-0000-0003-000000000001

## Description

Write a custom query to perform on the input data. Use Spark SQL functions and syntax. The table to be referenced is called "inputTable".

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input for inputTable

## Output(s)

*out.data*- Query result

## Configurations

#### Query * *[string]*

Define custom query to select data. The table name to use in the "FROM" statement is "inputTable". Example: SELECT * FROM "inputTable".

# Double Input Query

**UUID:** 00000000-0000-0000-0103-000000000001

## Description

Write a custom query to perform on two input tables. Use Spark SQL functions and syntax. The tables to reference are called "firstInputTable" and "secondInputTable".

**For more details refer to the following ****article****.**

## Input(s)

*in.table1*- Input for firstInputTable*in.table2*- Input for secondInputTable

## Output(s)

*out.data*- Query Result

## Configurations

#### Query * *[string]*

Write a custom query to select data. The table names to use in "FROM" are always "firstInputTable" and "secondInputTable". (Example: Select f.*,s.* From "firstInputTable" as f, "secondInputTable" as s)

# Multi-Input Query

**UUID:** 00000000-0000-0000-0003-000000000003

## Description

**For more details refer to the following ****article****.**

## Input(s)

*in1_table*- Input for firstInputTable*in2_table*- Input for secondInputTable*in3_table*- Input for thirdInputTable

## Output(s)

*out.data*- Query Result

## Configurations

#### Query * *[string]*

Write a custom query to select data. The table names to use in "FROM" are always "*in1_table*", "*in2_table*" and "*in3_table*". (Example: Select f.*,s.* From ""*in1_table*"" as f, ""*in3_table*"" as s)

# Query Helper

**UUID:** 00000000-0000-0000-0163-000000000001

## Description

Use queries, which are defined in the right input port, on the dataset from the left input port. The query table should contain two columns, the first one containing the column names to select and the second one containing the WHERE clause.

**For more details refer to the following ****article****.**

## Input(s)

*in.table1*- Input for inputTable*in.table2*- Input for queryTable

## Output(s)

*out.data*- Query Result

## Configurations

#### Select Column * *[single column selection]*

Select the column containing the SQL SELECT Statement

#### Filter Column * *[single column selection]*

Select the column containing the SQL WHERE statement

# Rounding

**UUID:** 00000000-0000-0000-0063-000000000001

## Description

This processor rounds numeric columns to a number of decimal places or signficant numbers.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Columns to round *[multiple columns selection]*

The numeric columns that should get rounded.

#### Number of decimal places *[integer]*

The number of decimal places selected columns will be rounded to.

#### Number of significant numbers *[integer]*

The amount of significant numbers the selected columns will be rounded to. If selected, decimal places may not occur anymore. (e.g. Significant Numbers: 2, Decimal Places: 2. NumberToConvert: 26.023 -> 26 but: 0.26023 -> 0.26 because of Significant numbers being 2.

#### Rounding mode * *[single enum selection]*

Choose whether you want to round up or down

# SOAP Request

**UUID:** 00000000-0000-0000-0181-000000000001

## Description

Sends requests to SOAP APIs. Requests are sent sequentially and a delay between the requests can be configured. If no column placeholders are used, only one request is sent, instead of X where X is the number of rows. On big datasets, the memory consumption of the processor might be an issue. If you run into problems here, you can repartition the data into more (smaller) partitions, which should solve these problems.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.soapResponses*- SOAP Responses

## Configurations

#### SOAP Protocol * *[single enum selection]*

Choose the protocol the API supports.

#### Additional header parameters *[composed]*

Parameters that need to be added, e.g. for authentication. For Google Analytics this would look like "Authorization: Bearer randomOath2LoginToken"

###### Additional header parameters > Http Header Key * *[string]*

The key of a header key-value pair. This can be used e.g. for adding authorization information.

###### Additional header parameters > Http Header Value * *[string]*

The value of a header key-value pair. This can be used e.g. for adding authorization information.

#### Endpoint URL * *[string]*

The endpoint of the API

#### SOAP Action * *[string]*

The SOAP action for the request (may be empty)

#### Authentication *[key selection]*

Authentication will be used to provide access to resources

#### Time Between Requests (ms) * *[integer]*

Some APIs may not allow to do more than X requests in a certain time frame. To prevent getting no responses due to this rate limitation you can add a waiting time between the single requests. By default one request will be directly sent after the answer for the previous one was received. The number given here is interpreted as milliseconds.

#### Request Body * *[string]*

The body of the SOAP request. May contain "links" to columns, such that at the specified location the value of the column will be put instead of the placeholder. For each row in the input dataset one request is sent (if placeholders are used, otherwise only one request is sent). Column references are done with the following syntax: "##columnValueFrom####" (without the parentheses). For finding out how the body of the target API looks like, you can use e.g. SOAP UI and import the API description there. SOAP UI can then create sample requests for you, where the body can be copied and used within this processor as a start.

# Sample Data Subsetting

**UUID:** 00000000-0000-0000-0007-000000000001

## Description

Draws random samples without replacement from input data.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Sampled Output

## Configurations

#### Seed *[integer]*

Integer for reproducibility of sampling. A default seed will be used if left empty.

#### Sample size *[integer]*

Defines how many observations should be drawn. If the number of rows is less than the specified integer value, the complete data set is returned. Default is set to 400.

#### Exact Sampling * *[boolean]*

Exact sampling guarantees to draw exactly the number of the entered sample size. Disabling this option can speed up calculation, however the selected number of rows will only be approximated.

# Search and Replace

**UUID:** 00000000-0000-0000-0050-000000000001

## Description

Processor replacing all substrings in the given columns matching the given regular expression. If no column is selected the regex replacement will be applied to all textual columns. If numeric or date columns are chosen for replacement these columns have type Text after this transformation. Null-values in to-be-replaced cells are set to be an empty string.

**For more details refer to the following ****articSearch And Replace Processorle****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Select Column/-s *[multiple columns selection]*

Select columns to apply replacement on. Non-text columns will be transformed. to text-columns automatically.

#### Regular Expression * *[string]*

Enter a Regex command that matches all the substrings that should be replaced in the selected column. For Syntax see documentation of java.util.regex.Pattern.

#### Replacement * *[string]*

String to replace the values that were matched with the regex (default: empty string).

# String Concatenation

**UUID:** 00000000-0000-0000-0016-000000000001

## Description

Concatenates the values of two string columns into one single ordinally scaled string. There is no space character between them.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.out*- Output

## Configurations

#### First part * *[single column selection]*

The first column to generate the concatenated string from.

#### Second part * *[single column selection]*

The second column to generate the concatenated string from.

#### New Column Name * *[column name]*

Name of the column that will contain the result of the string concatenation.

#### Prefix *[string]*

An optional prefix that will be added before the concatenated strings.

#### Infix *[string]*

An optional infix that will be added inbetween the concatenated strings.

#### Suffix *[string]*

An optional suffix that will be added after the concatenated strings.

# Stringify

**UUID:** 00000000-0000-0000-0079-000000000001

## Description

Changes type of a column to Ordinal/String

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Select Column * *[multiple columns selection]*

Select columns for which the type should be changed

# Take First Data

**UUID:** 00000000-0000-0000-0008-000000000001

## Description

Takes the first n rows of the input data.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Number of Rows *[integer]*

Defines how many rows "n" should be selected, starting at the first row. Default is set to 400 if no (or invalid) value is set.

# Time Interval Aggregation

**UUID:** 00000000-0000-0000-0013-000000000001

## Description

Aggregates a data set (all remaining columns) by a fixed time interval of a timestamp column and additional columns to group by. The time interval of the timestamp column and the aggregation function to process can be set.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Time interval * *[single enum selection]*

Indicates the aggregation interval.

#### Timestamp column * *[single column selection]*

A reference to a column containing the timestamp that should be used to aggregate the data set. The scale type of this column has to be DATETIME.

#### Aggregation function *[single enum selection]*

Indicates the aggregation type.

#### Aggregation Columns *[multiple columns selection]*

Columns that should be aggregated. May not overlap with GroupBy columns. If not configured all columns are used.

#### Additional GroupBy Columns *[multiple columns selection]*

Columns to use for grouping. These columns will not be aggregated but used to perform grouping.

# Time Sequence

**UUID:** 00000000-0000-0000-0059-000000000001

## Description

The processor generates a continuous and ordered (ascending) time sequence for a given start and end date and a corresponding row index. The unit defines the distance/difference between two successive dates.

**For more details refer to the following ****article****.**

## Input(s)

*in.timeSequenceInputData*- Time Sequence Input Data

## Output(s)

*out.timeSequenceData*- Generated time sequence dataset

## Configurations

#### Start Date * *[single column selection]*

Select the column that contains the date from which the time sequence should start (needs to be a DateTime datatype).

#### End Date * *[single column selection]*

Select the column that contains the date from where the time sequence should end (needs to be a DateTime datatype).

#### Unit * *[single enum selection]*

Choose the distance between two sucessive dates.

#### Locale * *[fixed values selection]*

Specifies the locale used for interpreting start day of the week.

#### Sequence Column Name * *[column name]*

Name the column that contains the time sequence.

#### Only Weekdays * *[boolean]*

When checked, only weekdays will be included in the generated sequence. The setting is supported only for following 'Unit' types: Milliseconds, Seconds, Minutes, Hours, Days. Using it for other unit type generates a warning but workflow will continue with setting ignored.

# Timestamp Difference

**UUID:** 00000000-0000-0000-0057-000000000001

## Description

This processor calculates the difference between two timestamps (located in different columns of the same dataset) and outputs the result in an additional column.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Timestamp Column 1 * *[single column selection]*

Select the column which should be subtracted from the second one

#### Timestamp Column 2 * *[single column selection]*

Select the column on which the substraction should be performed

#### Time Unit * *[single enum selection]*

Select the desired output unit

#### New Column Name * *[column name]*

Name of the column that will contain the result

# Timestamp Extraction

**UUID:** 00000000-0000-0000-0076-000000000001

## Description

Extracts a specific information (e.g. day, year, month) out of a given timestamp column into one or more new columns.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Timestamp column * *[single column selection]*

Select the timestamp column from which the information should be extracted.

#### Extraction methods * *[multiple enum selection]*

Choose the extraction methods for the given timestamps. Each method extracts a certain part of the timestamp, e.g. "Hour of the day" extracts the hour of the timestamp, "Date with/without time in seconds" converts the timestamp into a Unix timestamp (seconds). "Date without time" returns the date (yyyy-MM-dd) in form of a string.

#### Prefix *[string]*

Choose a prefix for the new extracted columns. The prefix needs to start with a letter and may only contain characters that are valid in column names (e.g. "abc_" would be a valid prefix, and "123" would be invalid. If no prefix is chosen only the discretization methods will be used as column names.

# Train Model

**UUID:** 00000000-0000-0000-0066-000000000001

## Description

Creates a spark/mleap pipeline with the configured models, trains it and saves the trained model.

**For more details refer to the following ****article****.**

## Input(s)

*in.trainingData*- Training Data*in.forecastData*- Foracast Data

## Output(s)

*out.output*- output

## Configurations

#### Save Model * *[model save]*

Save the generated model either as a new model or as a new version of an already existing model. The saved model can then be used with the Model Application.

#### Drop Temporary Columns * *[boolean]*

With this switch temporary columns created by the pipeline can be dropped. In some situations it might be nice to see the content of these columns, in these cases this switch can be set to false.

#### Decision Tree Regression *[composed]*

Decision Tree learning algorithm for regression

###### Decision Tree Regression > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only following column types: int, double, numeric or datetime

###### Decision Tree Regression > Independent columns * *[multiple columns selection]*

Select the independent columns for the creation of the model.

###### Decision Tree Regression > Forecast column name * *[column name]*

Name of the column that gets added to the data set and contains the forecast. Must not contain whitespaces!

###### Decision Tree Regression > Maximum number of bins *[integer]*

Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be >= 2 and >= number of categories in any categorical feature. If not set, the maximum of 32 and the amount distinct values in categories is chosen. If the chosen number is too small, the validation error will contain a hint about the minimal value that may be entered here.

###### Decision Tree Regression > Maximum tree depth *[integer]*

The depth the tree may have maximally, if not set, the default is 5.

###### Decision Tree Regression > Seed *[integer]*

A random number seed for deterministic results. If left empty a completely random seed will be chosen.

###### Decision Tree Regression > Minimum Information Gain *[double]*

Minimum information gain for a split to be considered at a tree node.

###### Decision Tree Regression > Variance column name *[column name]*

Name of the column that gets added to the data set and contains the variance. Must not contain whitespaces!

#### Decision Tree Classification *[composed]*

Decision Tree learning algorithm for classification

###### Decision Tree Classification > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only text columns.

###### Decision Tree Classification > Independent columns * *[multiple columns selection]*

Select the independent columns for the creation of the model.

###### Decision Tree Classification > Forecast column name * *[column name]*

Name of the column that gets added to the data set and contains the forecast. Must not contain whitespaces!

###### Decision Tree Classification > Probability base column name *[column name]*

If the output should contain probabilities for the forecasted data, then a base column name has to be given here. No whitespaces allowed!

###### Decision Tree Classification > Maximum number of bins *[integer]*

Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be >= 2 and >= number of categories in any categorical feature. If not set, the maximum of 32 and the amount distinct values in categories is chosen. If the chosen number is too small, the validation error will contain a hint about the minimal value that may be entered here.

###### Decision Tree Classification > Maximum tree depth *[integer]*

The depth the tree may have maximally, if not set, the default is 5.

###### Decision Tree Classification > Seed *[integer]*

A random number seed for deterministic results. If left empty, a completely random seed will be chosen.

###### Decision Tree Classification > Minimum Information Gain *[double]*

Minimum information gain for a split to be considered at a tree node.

###### Decision Tree Classification > Impurity * *[single enum selection]*

Criterion used for information gain calculation. Default GINI.

#### Random Forest Regression *[composed]*

Random Forest learning algorithm for regression

###### Random Forest Regression > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only following column types: int, double, numeric or datetime

###### Random Forest Regression > Independent columns * *[multiple columns selection]*

Select the independent columns for the creation of the model.

###### Random Forest Regression > Forecast column name * *[column name]*

Name of the column that gets added to the data set and contains the forecast. Must not contain whitespaces!

###### Random Forest Regression > Maximum number of trees *[integer]*

The maximum number of trees to be generated. Default is 10.

###### Random Forest Regression > Maximum number of bins *[integer]*

Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be >= 2 and >= number of categories in any categorical feature. If not set, the maximum of 32 and the amount distinct values in categories is chosen. If the chosen number is too small, the validation error will contain a hint about the minimal value that may be entered here.

###### Random Forest Regression > Maximum depth of each tree *[integer]*

The depth each tree may have maximally, if not set, the default is 5.

###### Random Forest Regression > Seed *[integer]*

A random number seed for deterministic results. If left empty a completly random seed will be chosen.

###### Random Forest Regression > Minimum Information Gain *[double]*

Minimum information gain for a split to be considered at a tree node.

###### Random Forest Regression > Subsampling Rate *[double]*

Fraction of the training data used for learning each decision tree, in range (0, 1].

###### Random Forest Regression > Minimum Number of Instances Per Node *[integer]*

Minimum number of instances each child must have after split.If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid.Should be >= 1.

#### Random Forest Classifier *[composed]*

Random Forest learning algorithm for classification.

###### Random Forest Classifier > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only following column types: text, timestamp and integer

###### Random Forest Classifier > Independent columns * *[multiple columns selection]*

Select the independent columns for the creation of the model.

###### Random Forest Classifier > Forecast column name * *[column name]*

###### Random Forest Classifier > Probability column base name *[column name]*

Base column name for columns containing predicted class conditional probabilities. Must not contain whitespaces!

###### Random Forest Classifier > Maximum number of trees *[integer]*

The maximum number of trees to be generated. Default is 10.

###### Random Forest Classifier > Maximum number of bins *[integer]*

###### Random Forest Classifier > Maximum depth of each tree *[integer]*

The depth each tree may have maximally, if not set, the default is 5.

###### Random Forest Classifier > Seed *[integer]*

A random number seed for deterministic results. If left empty a completely random seed will be chosen.

###### Random Forest Classifier > Impurity * *[single enum selection]*

Criterion used for information gain calculation. Default is 'gini'.

###### Random Forest Classifier > Minimum Information Gain *[double]*

Minimum information gain for a split to be considered at a tree node.

###### Random Forest Classifier > Subsampling Rate *[double]*

Fraction of the training data used for learning each decision tree, in range (0, 1].

###### Random Forest Classifier > Minimum Number of Instances Per Node *[integer]*

Minimum number of instances each child must have after split.If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid.Should be >= 1.

#### Association Rule Recommender *[composed]*

Recommendation based on association rules. Each forecasted row which is matched against multiple rules is duplicated on the output with a matched rules' recommendations (one row per matched rule).

Multiple groups warning:If multiple groups (i.e. ARR models) are defined within a processor amount of output rows will be higher as expected because ARR models will be applied one after another and not in parallel, i.e. input for the next model will be a dataset created by applying a previous model. Because of how the model works single input row can be duplicated on output by applying the first model (multiple matching rules) and each of these rows can be again duplicated by the 2nd model, etc.

###### Association Rule Recommender > Left-hand side * *[single column selection]*

The left-hand-side of the recommendation rule.

###### Association Rule Recommender > Right-hand side * *[single column selection]*

The right-hand-side of the recommendation rule.

###### Association Rule Recommender > Prediction column name * *[column name]*

The name of the column which should contain the recommended items. Must not contain whitespaces!

###### Association Rule Recommender > Confidence column *[single column selection]*

Column containing confidence value for recommended value of a rule. If present a column with rule's confidence will be also added to output (its name will be combination of 'Prediction column name' value followed by _confidence.

###### Association Rule Recommender > Item Separator * *[string]*

The separator of different items in the left-hand-sides of rules. Has to be a single character value!

###### Association Rule Recommender > Rule's left-hand side column name *[column name]*

The name of the column where the left-hand-side of an applied rule in the output should be located. If nothing is given this column will not be created. Must not contain whitespaces!

#### Gaussian Mixture *[composed]*

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability.

###### Gaussian Mixture > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Gaussian Mixture > Cluster column * *[column name]*

Select the name of the column which should contain the clusters.

###### Gaussian Mixture > Gaussians * *[integer]*

Number of independent Gaussians to use in the mixture model.

###### Gaussian Mixture > Probability column base name *[column name]*

Base column name for columns containing predicted cluster probabilities. Must not contain whitespaces!

###### Gaussian Mixture > Convergence tolerance * *[double]*

The largest change in log-likelihood at which convergence is considered to have occurred.

###### Gaussian Mixture > Max iterations *[integer]*

The maximum number of iterations in which log-likelihood changes must be less than value defined for 'Convergence tolerance'.

###### Gaussian Mixture > Seed *[integer]*

Random seed for getting deterministic results. If left empty a completely random seed will be chosen.

#### Generalized Linear Regression *[composed]*

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

###### Generalized Linear Regression > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only following column types: int, double, numeric or datetime

###### Generalized Linear Regression > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Generalized Linear Regression > Forecast column name * *[column name]*

###### Generalized Linear Regression > Family *[single enum selection]*

Description of the error distribution. Supported values are:

- Gaussian - supported link functions: Identitiy, Logartihm, Inverse,
- Binomial - supported link functions: Logit, Probit, cloglog,
- Poisson - supported link functions: Logartihm, Identitiy, Square root,
- Gamma - supported link functions: Inverse, Identitiy, Logartihm,
- Tweedie - supported link functions: .

If no value is selected, 'Gaussian' is used by default.

###### Generalized Linear Regression > Link Function *[single enum selection]*

Provides the relationship between the linear predictor and the mean of the distribution function. Selected function should be supported by the selected 'Generalized Linear Regression'. Available values are:

- Identitiy - The identity function.,
- Logartihm - Logarithmic relationship.,
- Inverse - Inverse relationship.,
- Logit - Inverse of the sigmoid logistic function.,
- Probit - The quantile function associated with the standard normal distribution.,
- cloglog - Complementary log-log relationship.,
- Square root - Relationship by the square root..

If not specified, defaults for each error distribution are as follows:

- Gaussian - Identitiy
- Binomial - Logit
- Poisson - Logartihm
- Gamma - Inverse
- Tweedie - not applicable

###### Generalized Linear Regression > Link prediction column name *[column name]*

Link prediction (linear predictor) column name.

###### Generalized Linear Regression > Power Link Function *[double]*

Index in the power link function. Only applicable to the Tweedie family. Note that link power 0, 1, -1 or 0.5 corresponds to the Logartihm, Identitiy, Inverse or Square root link, respectively. When not set, this value defaults to 1 - , which matches the R 'statmod' package

###### Generalized Linear Regression > Variance Power *[double]*

The power in the variance function of the Tweedie distribution which provides the relationship between the variance and mean of the distribution. Only applicable to the Tweedie family. Supported values: 0 and [1, Inf). Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma family, respectively. Default is 0.0.

###### Generalized Linear Regression > Solver algorithm *[single enum selection]*

The solver algorithm used for optimization.

###### Generalized Linear Regression > Weight Column *[single column selection]*

Name of a column with weights. If this is not set or empty, we treat all instance weights as 1.0. With 'Family' set to 'Binomial', weights correspond to number of trials and should be integer. Non-integer weights are rounded to integer in AIC calculation.

###### Generalized Linear Regression > Max iterations *[integer]*

The maximum number of iterations.

###### Generalized Linear Regression > Convergence tolerance *[double]*

The convergence tolerance of iterations.

###### Generalized Linear Regression > Regularization *[double]*

The regularization parameter for L2 regularization. Default is 0.0.

#### Linear Regression *[composed]*

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called linear regression.

###### Linear Regression > Dependent column * *[single column selection]*

###### Linear Regression > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Linear Regression > Forecast column name * *[column name]*

###### Linear Regression > Standardization * *[boolean]*

Whether to standardize the training features before fitting the model.

###### Linear Regression > Solver algorithm *[single enum selection]*

The solver algorithm used for optimization. Supported values are:

- l-bfgs - Limited-memory BFGS which is a limited-memory quasi-Newton optimization method.,
- normal - Normal Equation as an analytical solution to the linear regression problem. This uses weighted least squares.,
- auto - Solver algorithm is selected automatically..

If no value is selected, 'normal' is used by default.

###### Linear Regression > ElasticNet mixing parameter *[double]*

For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.

###### Linear Regression > Weight Column *[single column selection]*

Name of a column with weights. If this is not set or empty, we treat all instance weights as 1.0.

###### Linear Regression > Max iterations *[integer]*

The maximum number of iterations.

###### Linear Regression > Convergence tolerance *[double]*

The convergence tolerance of iterations.

###### Linear Regression > Regularization *[double]*

The regularization parameter.

###### Linear Regression > Aggregation Depth *[integer]*

Suggested depth for treeAggregate

###### Linear Regression > Fit an intercept term * *[boolean]*

Flag to indicate whether the intercept term should be fitted.

#### Multilayer Perceptron Classifier *[composed]*

Multilayer perceptron classifier (MLPC) is a classifier based on the feedforward artificial neural network. MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the nodeâ€™s weights w and bias b and applying an activation function.

###### Multilayer Perceptron Classifier > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only text columns.

###### Multilayer Perceptron Classifier > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Multilayer Perceptron Classifier > Forecast column name * *[column name]*

###### Multilayer Perceptron Classifier > Intermediate layers * *[string]*

Comma separated list of intermediate layers for neural network. Each value in a list must be a positive integer

###### Multilayer Perceptron Classifier > Probability column base name *[column name]*

Base column name for columns containing predicted class conditional probabilities. Must not contain whitespaces!

###### Multilayer Perceptron Classifier > Solver algorithm *[single enum selection]*

The solver algorithm used for optimization. Supported values are:

- LBFGS - Limited-memory BFGS which is a limited-memory quasi-Newton optimization method.,
- GD - Minibatch gradient descent solver..

If no value is selected, 'LBFGS' is used by default.

###### Multilayer Perceptron Classifier > Max iterations *[integer]*

The maximum number of iterations.

###### Multilayer Perceptron Classifier > Convergence tolerance *[double]*

The convergence tolerance of iterations. Smaller value will lead to higher accuracy with the cost of more iterations.

###### Multilayer Perceptron Classifier > Step size *[double]*

Step size to be used for each iteration of optimization (applicable only for solver 'GD').

###### Multilayer Perceptron Classifier > Seed *[integer]*

A seed for having deterministic forecasts with multiple executions. If left empty a completely random seed will be chosen.

#### Binomial Logistic Regression *[composed]*

In statistics, binomial logistic regression is a classification method that generalizes logistic regression to problems where the dependent variable has exactly two outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a binomially distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.).

###### Binomial Logistic Regression > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only text columns.

###### Binomial Logistic Regression > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Binomial Logistic Regression > Forecast column name * *[column name]*

###### Binomial Logistic Regression > Probability base column name *[column name]*

If the output should contain probabilities for the forecasted data, then a base column name has to be given here. No whitespaces allowed!

###### Binomial Logistic Regression > Standardization * *[boolean]*

Whether to standardize the training features before fitting the model.

###### Binomial Logistic Regression > ElasticNet mixing parameter *[double]*

For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.

###### Binomial Logistic Regression > Weight Column *[single column selection]*

Name of a column with weights. If this is not set or empty, we treat all instance weights as 1.0.

###### Binomial Logistic Regression > Max iterations *[integer]*

The maximum number of iterations.

###### Binomial Logistic Regression > Convergence tolerance *[double]*

The convergence tolerance of iterations.

###### Binomial Logistic Regression > Regularization *[double]*

The regularization parameter.

###### Binomial Logistic Regression > Aggregation Depth *[integer]*

Suggested depth for treeAggregate

###### Binomial Logistic Regression > Fit an intercept term * *[boolean]*

Flag to indicate whether the intercept term should be fitted.

#### Multinomial Logistic Regression *[composed]*

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.).

###### Multinomial Logistic Regression > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only text columns.

###### Multinomial Logistic Regression > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Multinomial Logistic Regression > Forecast column name * *[column name]*

###### Multinomial Logistic Regression > Probability base column name *[column name]*

If the output should contain probabilities for the forecasted data, then a base column name has to be given here. No whitespaces allowed!

###### Multinomial Logistic Regression > Standardization * *[boolean]*

Whether to standardize the training features before fitting the model.

###### Multinomial Logistic Regression > ElasticNet mixing parameter *[double]*

For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.

###### Multinomial Logistic Regression > Weight Column *[single column selection]*

Name of a column with weights. If this is not set or empty, we treat all instance weights as 1.0.

###### Multinomial Logistic Regression > Max iterations *[integer]*

The maximum number of iterations.

###### Multinomial Logistic Regression > Convergence tolerance *[double]*

The convergence tolerance of iterations.

###### Multinomial Logistic Regression > Regularization *[double]*

The regularization parameter.

###### Multinomial Logistic Regression > Aggregation Depth *[integer]*

Suggested depth for treeAggregate

###### Multinomial Logistic Regression > Fit an intercept term * *[boolean]*

Flag to indicate whether the intercept term should be fitted.

#### Naive Bayes Classifier *[composed]*

Naive Bayes classifiers are a family of simple probabilistic, multiclass classifiers based on applying Bayesian theorem with strong (naive) independence assumptions between every pair of features.

###### Naive Bayes Classifier > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only text columns.

###### Naive Bayes Classifier > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Naive Bayes Classifier > Forecast column name * *[column name]*

###### Naive Bayes Classifier > Model type * *[single enum selection]*

Naive Bayes model type:

- Bernoulli - Naive Bayes models with boolean features,
- Multinomial - Naive Bayes models with discrete features.

###### Naive Bayes Classifier > Probability base column name *[column name]*

###### Naive Bayes Classifier > Weight column *[single column selection]*

Name of a column with weights. If this is not set or empty, we treat all instance weights as 1.0.

###### Naive Bayes Classifier > Additive smoothing *[double]*

In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing as used in image processing), or Lidstone smoothing, is a technique used to smooth categorical data. Default is 1.0

###### Naive Bayes Classifier > Thresholds *[string]*

Comma separated list of double values (with dot (.) as a decimal separator) for thresholds in multi-class classification to adjust the probability of predicting each class. List must have as many double values as there is number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.

#### Principal Component Analysis *[composed]*

PCA is an unsupervised method used in exploratory data analysis and for making predictive models. It's a means of revealing the internal structure of the data in a way that best explains the variance in the data. It can be a very useful step especially for visualizing and pre-processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible.

###### Principal Component Analysis > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Principal Component Analysis > Principal components * *[integer]*

Number of principal components.

###### Principal Component Analysis > Base output column name * *[column name]*

Base name for output columns generated by PCA for input columns defined in 'Feature columns'

###### Principal Component Analysis > Output column start index * *[integer]*

Index value from which to start naming output columns. Names of the output columns will start with value defined in 'Base output column name' followed by '_', '_<startidx+1>' ...

###### Principal Component Analysis > Center data with mean * *[boolean]*

Standardize input data after mean centering.

###### Principal Component Analysis > Scale data to unit standard deviation * *[boolean]*

Standardize input data after scaling to unit variance.

#### Survival Regression *[composed]*

Survival regression model is based on Accelerated failure time (AFT) model. In the statistical area of survival analysis, an accelerated failure time model (AFT model) is a parametric model that provides an alternative to the commonly used proportional hazards models. Whereas a proportional hazards model assumes that the effect of a covariate is to multiply the hazard by some constant, an AFT model assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant. This is especially appealing in a technical context where the 'disease' is a result of some mechanical process with a known sequence of intermediary stages.

###### Survival Regression > Dependent column * *[single column selection]*

Select the dependent column for the creation of the model. This model supports only text columns.

###### Survival Regression > Feature columns * *[multiple columns selection]*

Select the feature columns for the creation of the model.

###### Survival Regression > Forecast column name * *[column name]*

###### Survival Regression > Quantile probabilities * *[string]*

Comma separated list of double values (with dot (.) as a decimal separator) for quantile probabilities. Each value should be in range of (0,1) and should contain at least one value if set.

###### Survival Regression > Quantiles base column name *[column name]*

If set quantiles of corresponding 'Quantile probabilities' will be outputed in columns starting with specified base column name. No whitespaces allowed!

###### Survival Regression > Censor column * *[single column selection]*

Select the censor column. The value of this column could be 0 or 1. If the value is 1, it means the event has occurred i.e. uncensored; otherwise censored.

###### Survival Regression > Aggregation Depth *[integer]*

Suggested depth for treeAggregate

###### Survival Regression > Max iterations *[integer]*

The maximum number of iterations.

###### Survival Regression > Convergence tolerance *[double]*

The convergence tolerance of iterations.

###### Survival Regression > Fit an intercept term * *[boolean]*

Flag to indicate whether the intercept term should be fitted.

# Transposition

**UUID:** 00000000-0000-0000-0029-000000000001

## Description

Changes the format of the table by inserting the values of multiple selected columns all in one column, creating a new column that specifies which of the values in the new columns belonged to which of the former columns (contains the column names) and replicating the values in the other columns so that all columns have the same amount of rows. The new data set will have more rows than the original dataset. Example: If two columns are chosen and it's values should be inserted in one column the dataset will have the double amount of rows as this two columns are "appended" and the values in the other columns duplicated.

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input to be converted

## Output(s)

*out.output*- Converted output

## Configurations

#### Column selection method * *[multiple enum selection]*

One or multiple methods to select columns. If multiple methods are selected the union of the set of columns matching each method gets transposed. Note: All columns which are transposed but are not specified explicitly will still be included in type inference and thus selectable in subsequent processors.

#### Columns to convert to rows. *[multiple columns selection]*

Select the columns for which the values should be all entered in one of the resulting columns.

#### Regular expression pattern for selecting columns *[string]*

Defines a regular expression pattern for selecting columns to transpose. Matching is case insensitive. The pattern has to adhere to the Java rules for regular expressions (https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). Examples: "^T_" (colums starting with "T_"), "Z$" (colums ending with "Z"), "AA" (colums containing "AA").

#### Identifier column name * *[column name]*

Enter a column name for the new column containing the column names of the columns/variables that were all inserted in one column. Gives the identification which values in the newly created column belonged to which column/variable.

#### Value column name * *[column name]*

Enter the column name for the new column that contains all the values of the former columns/variables that were inserted in the new column.

# Union

**UUID:** 00000000-0000-0000-0025-000000000001

## Description

Concatenates two datasets horizontally. The second input will be appended to the first input. The second input must contain the same columns (name and representation type) as the first input. Additional columns that appear in the second input but not in the first input will be ignored. No configuration needed.

**For more details refer to the following ****article****.**

## Input(s)

*in.input1*- First Input*in.input2*- Second Input

## Output(s)

*out.union*- Union

# Versioning

**UUID:** 00000000-0000-0000-0101-000000000001

## Description

This processor appends versioning information taken from the second input to the first input.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- data to be versioned*in.versioningData*- versioning data

## Output(s)

*out.output*- Output

## Configurations

#### Versioning Column * *[single column selection]*

The column containing the version information.

#### Version Column Name * *[column name]*

The name of the column to be added for the version number.

#### Timestamp Column Name * *[column name]*

The name of the column to be added for the timestamp.

# XML / JSON Parsing

**UUID:** 00000000-0000-0000-0182-000000000001

## Description

Creates from a single text column (XML/JSON format) an output dataset. The content of the output dataset is given by XPATH / JSONPath expressions, where each expression creates one column. An online tool for testing XPath 1.0 expression can be found here, for JSONPath you can use the following link.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.parsedXml*- Parsed Content

## Configurations

#### XML / JSON Column * *[single column selection]*

The column containing the XML / JSON.

#### Index Column *[single column selection]*

The column given here will be appended to the output. This is useful, if you use multiple parsing processors and want to join some parts. Per converted XML / JSON the index will be exactly the one of the given column such that the mapping is identical for all parsing processors getting the same input data. Uniqueness of the index column is not enforced. By default no index column is used / appended to the output.

#### Content Format * *[single enum selection]*

The format of the column content that should be parsed.

#### Common Path Expression Prefix * *[string]*

This prefix will be added to all given path expressions below.

#### XPath / JSONPath Expressions * *[composed]*

By adding XPath / JSONPath expressions you can specify which parts of the XML / JSON should be extracted. Each given expression will be one column in the output dataset (all have only type string). All given paths have to have the same number of results. If they would have different results, you need to use a separate Parsing Processor for the other part as this processor does not join the data internally.

###### XPath / JSONPath Expressions > XPath / JSONPath * *[string]*

The XPath / JSONPath expression for extracting content from the XML. Has to comply to the XPath 1.0 Standard, and all namespaces have to be ignored.

###### XPath / JSONPath Expressions > Column Name * *[column name]*

The column name for the extracted content.

# Zero Value Correction

**UUID:** 00000000-0000-0000-0099-000000000001

## Description

Corrects certain columns of a data set by delta values. The deltas are created from a second input by computing the average value for each column selected.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input*in.delta*- Delta Input

## Output(s)

*out.out*- Output

## Configurations

#### Selected columns *[multiple columns selection]*

Columns to correct. Selected columns must be present in both input data sets. If no column is selected all numeric columns which are present in both input datasets are used.

# AI TS SC Aggregation (Deprecated!)

**UUID:** 00000000-0000-0000-0012-000000000001
**Deprecated**: *Please use the Auto-Interval Aggregation processor instead.*
**Replaced by:** *Auto-Interval Aggregation*
**Removed:** *true*

## Description

Aggregates a data set based on a aggregation column. Based on the number of data points to be computed by the processor the aggregation interval varies automatically.

## Input(s)

*in.data*- Aggregation Input

## Output(s)

*out.out*- Aggregated Output

## Configurations

#### Cuts * *[integer]*

Amount of aggregated data points the processor should compute.

#### Aggregation column * *[single column selection]*

Column containing values to aggregate over.

#### Aggregation Function *[single enum selection]*

Aggregation function over all columns except for those selected in Aggregation Column.

# Column Rename (Deprecated!)

**UUID:** 00000000-0000-0000-0077-000000000001

**Deprecated**: *Please use the Multiple Column Rename processor instead.*

**Replaced by:** *Multiple Column Rename*

**Removed:** *true*

## Description

Rename one or multiple columns

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Column Name * *[single column selection]*

Column to rename

#### New Column Name * *[column name]*

Specify new name for selected column.

#### Rename Further Columns *[composed]*

Select multiple columns to rename individually.

###### RENAME FURTHER COLUMNS > NEW COLUMN NAME * *[COLUMN NAME]*

Specify new name for selected column.

# Doublify (Deprecated!)

**UUID:** 00000000-0000-0000-0078-000000000001
**Deprecated**: *Please use the Multiple Doublify processor instead.*
**Replaced by:** *Multiple Doublify*
**Removed:** *true*

## Description

Changes type of a column to Ratio-Scaled/Double

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Select Column * *[single column selection]*

Select columns for which the type should be changed.

# Multiple Timestamp Parsing (Deprecated!)

**UUID:** 00000000-0000-0000-0149-000000000001
**Deprecated**: *Please use the Data Type Conversion processor instead.*
**Replaced by:** *Data Type Conversion*
**Removed:** *true*

## Description

Parses multiple string columns to a datetime representation. Sets default value for illegal formats.

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Output

## Configurations

#### Select Column/-s * *[multiple columns selection]*

Select columns to convert into datetime representation.

#### Format *[string]*

Specify the datetime format as it is displayed in the selected columns.

Example: If the format of the selected column looks like 14/3/2016, specify the datetime as dd/M/yyyy.

#### Locale * *[fixed values selection]*

Specifies the Locale used for parsing the date.Important for interpreting e.g. week in year or day of week.

#### Default Value For Timestamp *[string]*

The default value that is used if a given timestamp-column cannot be parsed using the given timestamp-format. If no default value is given, 1970-01-01 is used.

# Peak Elimination (Deprecated!)

**UUID:** 00000000-0000-0000-0152-000000000001
**Deprecated**: *Please use the Grouped Peak Elimination processor instead.*
**Replaced by:** *Grouped Peak Elimination*
**Removed:** *true*

## Description

Eliminates peaks in input signals.

## Input(s)

*in.data*- Input

## Output(s)

*out.smoothed*- Input data without peaks

## Configurations

#### Column for Sorting * *[single column selection]*

The column used for sorting the input data.

#### Peak elimination columns *[multiple columns selection]*

Columns in which peaks should be eliminated

#### Smoothing Strategy * *[single enum selection]*

The smoothing strategy that should be used for eliminating peaks.

#### Minimal difference before replacement * *[double]*

Minimal difference between the computed and current value that needs to exist before we replace the current value by the computed one.

#### Relative Index in Windows *[integer]*

Relative index of current element in sliding window. By choosing 0, the window is only consisting of the current value and values after that, by choosing the maximal window size, the window consists only of the current value and values before that, every other index inbetween will have both, values before and after the current one. Default is 0.

#### Sliding Window Size *[integer]*

The size of the sliding window, which is used for computing smoothed values.

# Timestamp Parsing (Deprecated!)

**UUID:** 00000000-0000-0000-0035-000000000001
**Deprecated**: *Please use the Data Type Conversion processor instead.*
**Replaced by:** *Data Type Conversion*
**Removed:** *true*

## Description

Converts a selected column into a timestamp column given a valid date/time format.

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Output

## Configurations

#### Selected column * *[single column selection]*

Column with date/time values that should be converted to the ONE DATA datetime format.

#### Selected column type * *[single enum selection]*

Select type of chosen column.

#### Format *[string]*

Specify the datetime format as it is given in the selected column.

Example: If the format of the selected column looks like 14/3/2016, specify the format as "d/M/yyyy" or "dd/MM/yyyy".

#### Locale * *[fixed values selection]*

Specifies the Locale used for parsing the date.Important for interpreting e.g. week in year or day of week.