Result Table

UUID: 00000000-0000-0000-0014-000000000001

Description

Collects the data set at the input and returns it as a result


Input(s)

  • in.data - Input


Configurations

Take Max [integer]

Maximum number of elements to be collected. Defaults to 400 if left empty. Must be positive


Group by [single column selection]

Optional column to define groups. If selected, the result of this processor will contain several tables, one for each distinct value in the selected column. Note that the selected column will not be included in the results but used as an identifier for each result of a group.


Sort by [single column selection]

Optinal column to sort by. If selected the result in each group will be sorted by this column


Sorting Order [single enum selection]

Select order in which the result should be sorted


For more details refer to the following article



Data Table Save

UUID: 00000000-0000-0000-0005-000000000004

Description

Stores the input either into a new or into an existing Data Table. The content of an existing Data Table can either be replaced by the input or the input can be appended.


Input(s)

  • in.data - Input


Configurations

Data Table Save Configuration * [data table save]

Fill in the corresponding fields holding information about the data table to save (Select Data Table, Resolve Data Table by ID, Storage Format, Save Procedure, Schema Mismatch Behavior).


For more details refer to the following article



Microservice Output

UUID: 00000000-0000-0000-0096-000000000001


Description

Processor used to export Data for exposed Workflows that serve as Microservices. Produces Result Table for test purpose.


Input(s)

  • in.toReturn - Data to return


Configurations

Identifier * [string]

Identifier used in the calls to the microservice to identify this processor. Must be unique for all Microservice Output processors in one workflow.


For more details refer to the following article


Old Dataset Save (Deprecated!)

UUID: 00000000-0000-0000-0005-000000000001
Deprecated: Please use the Data Table Save processor instead.
Replaced by: Data Table Save
Removed: true

Description

Saves the input data set to a .csv file


Input(s)

  • in.data - Input


Configurations

Directly select Dataset [data selection]

Select an existing dataset to append to or replace. Has no effect when saving new datasets and will be ignored. When appending to or replacing a dataset and having both this configuration and "Name of Dataset" configured, the value of this configuration will be used for selecting the dataset.


Name of Dataset [string]

Mandatory for NEW datasets! Symbolic name for the data set to be saved by this processor. Will be shown in the "data sets" interface. When appending or replacing datasets, this configuration can also be used to reference an existing dataset by its name rather than selecting it in the "Directly select Dataset" configuration. When appending to or replacing a dataset and having both this configuration and "Name of Dataset" configured, the value of this configuration will be ignored. Instead, the selected dataset in "Directly select Dataset" will be used.


Save procedure * [single enum selection]

Indicates whether to save input as an independent new data set, append it to an existing data set or replace an existing data set.


Count Rows * [boolean]

Trigger a row count prior to saving the data set. This may take a while for large data sets.


Repartition data * [boolean]

Reduce output fragmentation to a minimum. This can cause problems after complex joins, so if you experience performance drops or errors, try to disable this option. If enabled the row count will always be computed disregarding the related config element.


Dataset Save (Deprecated!)

UUID: 00000000-0000-0000-0005-000000000003
Deprecated: Please use the Data Table Save processor instead.
Replaced by: Data Table Save
Removed: False

Description

Can save input Datasets to different file formats


Input(s)

  • in.data - Input


Configurations

Dataset Save Configuration * [dataset save]

Configuration value for a ConfigDatasetSave element. Contains the save procedure type and the reference to a dataset, either by name or by ID.


File Format * [single enum selection]

Choose between the file format options. Read- and filtering-performance varies based on the format chosen. Note, that replacing datasets also allows you to use a new File Format.


Compute amount of rows written * [boolean]

Computes the amount of rows written in the dataset. This has the negative side effect of hiding the visualization of the SQL execution in the Spark UI.


Compute content length (experimental) * [boolean]

Experimental feature to compute the approximate amount of tokens in the saved dataset. Can hurt performance and may not be reliable! This also requires "Compute amount of rows written" to be set!


Enforce Schema match * [boolean]

This toggle is a safety measure. If it is enabled, inputs to this processor that have a schema that differs from a dataset selected to append to or to replace, will result in an error. When this protection is disabled, appended inputs will be cut off or augmented to match the schema of the dataset that is selected in this processor. Missing columns will be filled in with null values and superfluos columns will be dropped. When an original dataset gets replaced and this toggle is disabled, the original dataset will be replaced even if the replacement has a different schema (column names/types/count). Note that this kind of overwrites can render other workflows useless when they use the saved dataset and expect a specific structure. On any mismatch in the schema, a warning will be generated when safety measure is disabled. If it is enabled, an error will be created and the original datasets will stay untouched.


For more details refer to the following article


Filterable Result Table (Deprecated!)

UUID: 00000000-0000-0000-0014-000000000002
Deprecated: Please use the result Table processor instead.
Removed: true

Description

Stores the input to a queryable storage for post-execution filtering and sorting.


Input(s)

  • in.toStore - Data to store


Configurations

Name [string]

The name which is used for storing and later on finding the stored dataset. If not set, it will default to something similar to "Filterable Result frtId" where frtId is the UUID of the stored dataset.Beware that only datasets with custom set names appear in the datasets overview!


Data Location (Match by direct selection) [data selection]

Please directly select a FRT data set you want to replace! Only one way of selecting a dataset can be used. If both selections have values selected, an error will be generated.


Save procedure * [single enum selection]

Indicates whether to save the input as an independent new FRT or replace an existing one. If you choose replace be aware that it also affects FRTs inside old jobs (if you ignore schema mismatches FRTs in old jobs may not work at all)! In case you select any of the append or replace modes a new FRT will be created if there is no dataset specified or if the specified one does not exist yet.


Manually set keys [composed]

Use these Configurations to manually define primary and partition keys for the data stored in KUDU. In addition to selected column(s), ONE DATA will generate an extra ID column to make sure there is at least one unique column which allows FRT to be editable. If this configuration is disabled, the generated ID column is used as both, primary and partition key. If this configuration is enabled, the settings will be validated and applied if possible. While appending to an existing FRT these settings will have no effect.


MANUALLY SET KEYS > PRIMARY KEY [MULTIPLE COLUMNS SELECTION]

Select one or more columns that form the Primary Key of the stored table.


MANUALLY SET KEYS > PARTITIONING KEY [MULTIPLE COLUMNS SELECTION]

Select a subset of the chosen primary key's columns to use as partitioning columns. The data stored will be clustered into chunks according to the key range arising from the selected columns. Leave empty to use the whole primary key for partitioning. The partitioning has a major impact on query performance of the generated result. Partitioning by columns most frequently used filtering the generated result is recommended.


Generate Summaries [composed]

If enabled, ONE DATA will conduct a top-k distinct values analysis for all columns. Additionally there is an option to also compute numeric summaries for all numeric columns as well. Disable this to speed up processing time (especially when appending to datasets processing time increases significantly). If disabled, the result will not contain preselection values for filters to be applied to columns. Warning: Enabling this may cause leakage of restricted data in context of Analysis Authorization.


GENERATE SUMMARIES > AMOUNT OF DISTINCT VALUES * [INTEGER]

Determines how many top-k values will be extracted for each column. These values can later be used to select filter values. Additionally to the preselectable values, there is the possibility to use a manual filter specification.


GENERATE SUMMARIES > ENABLE NUMERIC SUMMARIES * [BOOLEAN]

If enabled, ONE DATA will also compute numeric summaries (min, max, mean, median, etc.) for suitable columns.


Enable Caching * [boolean]

Triggers pre-execution caching of the input. If enabled, drastically speeds up execution time. Only disable when the input is already cached or should not be cached due to its huge size. It is highly recommended to leave caching enabled!


Enable Fast Random Sampling * [boolean]

Adds an artificial column to enable fast random sampling on the resulting Filterable Result table. Only has minimum impact on performance at creation time but makes random sampling faster.


Generate Samples [composed]

Generates Samples while running the workflow. Warning: Enabling this may cause leakage of restricted data in context of Analysis Authorization because a random sample of the data is stored within the result of the processor. Please also note that the result of the processor is not subject to Analysis Authorization.


GENERATE SAMPLES > NUMBER OF RANDOM SAMPLES * [INTEGER]

Number of random samples to use for fast random sampling.


Storage Type * [fixed values selection]

The storage type to use when storing and loading data frames.


Compression Hint * [single enum selection]

A hint to the processor for which compression method should be used to store data. Depending on how the data is stored this will be interpreted differently or even be ignored. Defaults to the compression method which is configured as to be the default on this ONE DATA instance for the respective data store.


For more details refer to the following article