Data Table Load

UUID: 00000000-0000-0000-0001-000000000004


Description

Loads a data set into the workflow. The user has to specify the data location.


For more details refer to the following article.


Output(s)

  • out.data - Loaded Data Table
  • out.invalidRows - Invalid Rows

Configurations


Data Table Selection (Match by name or direct selection) [data selection][string]

Directly select a data table you want to load or enter its name.


Parser Mode on Invalid Rows * [single enum selection]

Determines the way, invalid lines will be treated.



Cassandra Load

UUID: 00000000-0000-0000-0001-000000000003


Description

Loads a dataset from a keyspace of an Apache Cassandra database. The loaded data can be preprocessed using a custom SQL query.


For more details refer to the following article.


Output(s)

  • out.output - Output

Configurations

Table * [string]

The table that should be loaded.


Keyspace * [string]

The keyspace of the table that should be loaded.


Query [string]

A non-mandatory query to preprocess the loaded data. E.g. multiple values in List/Set/Map-Columns can be exploded to several rows, but also the amount of data can be limited with where clauses. In principal all SQL that is working in the Query should be working here, too.


Custom Input Table

UUID: 00000000-0000-0000-0024-000000000001

Description

This processor has a user interface that enables the user to bring self-defined tables into a workflow. These tables are treated like datasets but are not persisted outside the workflow.


For more details refer to the following article.


Output(s)

  • out.output - Table

Configurations

Table Input * [table input]

Adds the user input as a transient dataset to the workflow.


Data Type Recommendation

UUID: 00000000-0000-0000-0051-000000000002

Description

Recommends possible types for each column in a Data Set.


For more details refer to the following article.


Configurations

Data Set * [data selection]

Select a Data Set to determine possible types for


Convert empty cells to NULL values * [boolean]

Indicates whether or not empty cells will be converted to NULL values


Allow NULL values * [boolean]

Indicates whether or not there can be null values in a column. If set to TRUE, null values do not influence the recommendations.


Sample Size [integer]

Defines how many observations should be drawn. If the number of rows is less than the specified integer value, the complete data set is used for analysis.


Top-K Sampling * [boolean]

When a Sample Size is set, enable or disable this to select Top-K or Random Sampling technique. When enadled, the first rows of the dataset will be used for the recommendation. When disabled, a random sample will be drawn to recommend the types. Please note, that random sampling on large datasets has a longer execution time because the whole dataset has to be loaded for random sampling.


Additional Timestamp Masks [composed]

Define one or many additional Timestamp formats to try out when parsing cells to timestamps. The format Strings have to comply with the specification of SimpleDateFormat. 

For more details, check the following link.

ADDITIONAL TIMESTAMP MASKS > VALUE * [STRING]

Value of an additional Timestamp Mask to add.


Database Connection Load

UUID: 00000000-0000-0000-1140-000000000001


Description

Loads a table of a Database Connection into the workflow. The user has to specify the data location


For more details refer to the following article.


Output(s)

  • out.data - Loaded Data Table


Configurations

Connection * [selectConnection]

A Database Connection to load a table from

SQL Select Statement * [sql]
Use a SQL select statement to be executed on the remote Relational Database


Auto Cast to String * [boolean]
If true, all returned columns will be automatically cast to STRING. If false, no auto-casting will be applied, hence the query might need to contain suitable castings


Additional Properties [composed]
Define a set of additional key-value pairs to hand over to a DB connection. Has no effect, when a non-DB-backed Data Table is used. Note, that username and password will already be present implicitly. Property values are stored in clear-text. Mind this when storing additional credentials or other sensitive information

Flexible ETL Load

UUID: 00000000-0000-0000-0001-000000000002

 

Description

The processer allows to write one SELECT query to a database with the corresponding query dialect---Oracle, DB2, … . Joins on the whole database, e.g., over multiple tables, are possible. It is also possible to use views and synonyms.


Example [in Oracle dialect]:

select a.column_a, b.column_b from schema.table_A  a left join schema.table_B on  a.column_a = b.column_b


Attention: 

  • The Flexible ETL Processor does never cast datatypes of read columns on its own. There are two possibilities to deal with the datatypes:
    • The configuration of the processor offers a toggle for "Auto Cast to String". If toggled on, every column is automatically set to and parsed as String as datatype.
    • If the toggle is set to false, then datatypes have to be dealt with in the written query of the processor in order to make them usable by ONE DATA afterwards. Please note that you need to use datatypes that are recognized by ONE DATA, see the next table below.
      Here is an example of how to cast one of your columns to another datatype:
      CAST(colName AS TYPE) colName

      If the cast datatypes are incorrect and can therefore not be used by ONE DATA directly, an exception and error message in the processor indicate this circumstance.

  • Since there are many distinct “DateTime” types within different databases, all timestamps are automatically cast to
    Strings.


In particular, ONE DATA handles only these five types:


ONE DATA Representation TypeONE DATA Datatype (Java Types)Spark Datatype
INT
Long
64 Bitorg.apache.spark.sql.types.DataTypes.LongType
NUMERIC
Double64 Bitorg.apache.spark.sql.types.DataTypes.DoubleType
DOUBLE
Double64 Bitorg.apache.spark.sql.types.DataTypes.DoubleType
DateTime
java.sql.Timestamp (ms)
org.apache.spark.sql.types.DataTypes.TimestampType
String
String
org.apache.spark.sql.types.DataTypes.StringType


When using the Flexible-ETL-Load-Processor, ONE DATA uses a JDBC driver for the connection resulting in the following transformation matrix:


Real Type/Is recognized as

Jdbc Types

ONE DATA Type

Possible Exceptions

STRING

JDBCType.VARCHAR,

 JDBCType.NVARCHAR,

 JDBCType.CHAR,

 JDBCType.NCHAR,

 JDBCType.LONGVARCHAR,

JDBCType.LONGNVARCHAR,

 JDBCType.BLOB,

 JDBCType.CLOB,

 JDBCType.NCLOB

STRING

 

BOOLEAN

JDBCType.BOOLEAN,

 JDBCType.BIT


Exception in processing: IllegalArgumentException: Type BooleanType could not be mapped

NUMERIC

JDBCType.NUMERIC,

 JDBCType.DECIMAL

NUMERIC/DOUBLE

 

INTEGER

JDBCType.INTEGER,

 JDBCType.TINYINT,

 JDBCType.SMALLINT

INTEGER

 

BIGINT

JDBCType.BIGINT

INTEGER


DOUBLE

JDBCType.DOUBLE,

 JDBCType.REAL,

 JDBCType.FLOAT

DOUBLE

 

TIMESTAMP

JDBCType.TIMESTAMP

DATETIME


TIME

JDBCType.TIME

DATETIME


DATE

JDBCType.DATE

DATETIME


 

Everything else


Exception


For more details refer to the following article.


Output(s)

out.data - Loaded data set


Configurations

Data Location * [data selection]

Please select a data set you want to load!

 

Table Name or custom SQL * [string]

Type in a table name or use a custom SQL statement to be executed on the remote Relational Database.

 

Auto Cast to String * [boolean]

If true, all returned columns will be automatically cast to STRING. If false, no auto-casting will be applied, hence the query might need to contain suitable castings.

 

Schema [string]

Select schema


Commonly known Issues

Be aware that there are database-wise differences for writing a query, which are known to cause common difficulties. Among those are the following examples:

  • Escaping using "…":
    • Postgre: Be aware to escape properly, e.g. column or table names
    • CamelCase needs to be escaped
    • Mysql: use `…` instead for the excaping
    • When you have a column/table name being equal to a key-word
  • Renaming of a selection:
    • Oracle: SELECT * FROM (SELECT * FROM TableName) FOO -- Oracle does not recognize the keyword AS at this location
    • other DBs: SELECT * FROM (SELECT * FROM TableName) AS FOO
  • Limiting the number of resulting rows:
    • Oracle: Select * from TableName ROWNUM=1 -- Oracle does not know the keyword LIMIT at this location (this might be fixed in newer versions)
    • other DBs: Select * from TableName LIMIT 1


Note: Many error messages in combination with this processor do not originate not from ONE DATA or the processor itself, but are rather related to the respectively used database. Therefore it is recommended to check provided error codes with respect to the used target database.

Flexible REST API

UUID: 00000000-0000-0000-1139-000000000001

Description

Uses REST connections to execute REST API GET or POST requests (Access to the network where the REST API resides is necessary, for distributed computation also the cluster nodes need that). A path through the retrieved Json can be provided to further specify the content of the created rows. All possible JsonPath expressions are described in this link and a JsonPath expression tester can be found in the following link.


For more details refer to the following article.


Input(s)

  • in.input - Input with urls

Output(s)

  • out.output - Requests with success status
  • out.errorOutput - Requests with a non success status or other errors occured

Configurations

Connection * [connection selection]

A REST Connection with predefined authentication and base url


Authentication [key selection]

This Key will be used instead of the defined key in the connection to provide access to resources. Once using credentials in the header section referencing an authentication is ignored


URL Addition * [single column selection]

The URL segment that should be added to the base url. When querying Google Analytics this could e.g. be base URL: https://www.googleapis.com/analytics/v3/ and url addition: management/segments


Request Method * [single enum selection]

Choose the method of the request


Additional header parameters [composed]

Parameters that need to be added, e.g. for authentication. For Google Analytics this would look like "Authorization: Bearer randomOath2LoginToken"


ADDITIONAL HEADER PARAMETERS > HTTP HEADER KEY * [STRING]

The key of a header key-value pair. This can be used e.g. for adding authorization information.


ADDITIONAL HEADER PARAMETERS > HTTP HEADER VALUE * [STRING]

The value of a header key-value pair. This can be used e.g. for adding authorization information.


Do Rest-API Queries distributed * [boolean]

With this option you can change the way how the data is processed. By default every given URL is processed (one after the other) on the master node of this ONE DATA instance. By enabling distributed computation, the processing will be done on many nodes of this ONE DATA cluster in parallel. While the latter option will be faster and especially less memory consuming on the master, there are many circumstances that can prevent this processor to work in a distributed manner. E.g. the nodes do not have a connection to the network where the REST API resides, or there is some configuration regarding SSL certificates that was only added at the master node.


Delay (in milli seconds) between Rest API Calls * [integer]

If operated in non-distributed mode, this processor waits for the given delay in milliseconds between the Rest API calls.


Request Body [string]

The body of the REST request. May contain "links" to columns, such that at the specified location the value of the column will be put instead of the placeholder. For each row in the input dataset one request is sent (if placeholders are used, otherwise only one request is sent). Column references are done with the following syntax: "##columnValueFrom##< columnName >##" (without the parentheses).


Kafka Consumer

UUID: 00000000-0000-0000-0254-000000000001


Description

This processor can consume data from a kafka topic. For that purpose it blocks the execution and waits for the specifed time interval. All messages that are arriving within this time (and also messages that have arrived at the kafka server before, but were not read already) will be part of the output dataset.



Output(s)

  • out.message - Messages


Configurations

Streaming Time in Milliseconds * [integer]

Select time in milliseconds to consume from kafka topic.


Kafka Host IP * [string]

The IP of the kafka server, including the port (usually 9092), e.g. 192.168.1.2:9092.


Topic * [column name]

The topic that should be consumed from the kafka server.


Microservice Input

UUID: 00000000-0000-0000-0097-000000000001

Description

Processor used to import Data for exposed Workflows that serve as Microservices. Use the input table for test purpose.


For more details refer to the following article.


Output(s)

  • out.inputData - Microservice payload or test data

Configurations

Identifier * [string]

Identifier used in the calls to the microservice to identify this processor. Must be unique for all Microservice Input processors in one workflow


Ignore Schema Mismatch * [boolean]

Setting this option to true, will allow the workflow to run even if the columns defined in this processor do not match the columns sent by the microservice call.


Test Input * [table input]

This content will be replaced on Microservice calls to the workflow. Use it for testing your Workflow


Multi URL API Load

UUID: 00000000-0000-0000-0028-000000000003

Description

Loads data from REST APIs that can be targeted with GET requests (Access to the network where the REST API resides is necessary, for distributed computation also the cluster nodes need that). A path through the retrieved Json can be provided to further specify the content of the created rows. All possible JsonPath expressions are described in the following link and a JsonPath expression tester can be found in the following link.


For more details refer to the following article.


Input(s)

  • in.input - Input with urls

Output(s)

  • out.output - Parsed Json
  • out.errorOutput - Urls that could not be parsed or other errors occured

Configurations

The URL that should be used as a query * [single column selection]

The URL that should be used as a query. When querying Google Analytics this could e.g. be the following link


Should the URL be added as an output column * [boolean]

Better traceability where some values come from, but leads to data duplication


Additional header parameters [composed]

Parameters that need to be added, e.g. for authentication. For Google Analytics this would look like "Authorization: Bearer randomOath2LoginToken"


ADDITIONAL HEADER PARAMETERS > HTTP HEADER KEY * [STRING]

The key of a header key-value pair. This can be used e.g. for adding authorization information.


ADDITIONAL HEADER PARAMETERS > HTTP HEADER VALUE * [STRING]

The value of a header key-value pair. This can be used e.g. for adding authorization information.


Authentication [key selection]

Authentication will be used to provide access to resources. Once using credentials in the header section referencing an authentication is ignored


JsonPath to the rows * [string]

The JsonPath, whose associated values should be used as data source. A JsonPath can e.g. be "$.importantValuesKey[*]" where $ stands for the root of the given json, importantValuesKey is the key in the top-level dictionary from which we want to retrieve the content (an array, that should be converted into rows). JsonPath can do much more, e.g. filtering data based on given values or selecting only a certain subset of array elements [startIndex:endIndex]. A full documentation of all possibibilites can be found in the following link and a JsonPath expression tester can be found in the following link


JsonPath to the Column Names [string]

In some cases proper column names cannot be easily derived by the given JsonPath for the content. Then you can specify a different JsonPath here which will be used to find the correct column names. If the amount of the found column names does not match the amount of columns in the computed rows, as fallback all columns will be named in the format COLUMN_0, COLUMN_1, ...


Default Values for Columns [composed]

With this option certain column names can be given that will definitely appear in the Parsed Json output. The default value is used in cases where no value can be found in the input, or if specified also in case we find and invalid value, which has to be replaced with something valid.


DEFAULT VALUES FOR COLUMNS > COLUMN NAME AND TYPE [MANUAL COLUMN SPECIFICATION]

The column name that should definitely appear in the output with its respective Scale and Representation Type.


DEFAULT VALUES FOR COLUMNS > DEFAULT VALUE * [STRING]

The default value for the column that should definitely appear in the output. It has to be a valid value for the specified Representation type.


DEFAULT VALUES FOR COLUMNS > ABORT JSON CONVERSION UPON FINDING INVALID VALUES * [BOOLEAN]

If this option is enabled, the whole transformation of (one of) the given URLs is stopped and an error is added to the Urls that could not be parsed or other errors occured. By disabling this option invalid values are replaced with the given default values.


Do Rest-API Queries distributed * [boolean]

With this option you can change the way how the data is processed. By default every given URL is processed (one after the other) on the master node of this ONE DATA instance. By enabling distributed computation, the processing will be done on many nodes of this ONE DATA cluster in parallel. While the latter option will be faster and especially less memory consuming on the master, there are many circumstances that can prevent this processor to work in a distributed manner. E.g. the nodes do not have a connection to the network where the REST API resides, or there is some configuration regarding SSL certificates that was only added at the master node.


Delay (in milli seconds) between Rest API Calls * [integer]

If operated in non-distributed mode, this processor waits for the given delay in milliseconds between the Rest API calls.


All columns as STRING and allow URLs with different columns. * [boolean]

All columns will be of type STRING. Missing values or nulls will be null in the output. Any column that is present in at least one URL will be present in the output. The columns order will not correspond to their order in the JSON loaded. Instead, they will appear in lexicographic order. Must not be enabled together with default values. Note, that even with this option cases exist in which JSON data cannot be transformed into a tabular representation successfully.


Random Number Generator

UUID: 00000000-0000-0000-0023-000000000001

Description

Generates random samples in one or more columns.


For more details refer to the following article.


Output(s)

  • out.randomData - Randomly generated dataset

Configurations

Partitions * [integer]

Amount of partitions to generate. Each partition holds as many rows as specified in "Rows per partition". Increasing this value also increases the level of parallelism for the data generation.


Rows per partition * [integer]

Amount of rows that should be generated per partition (range: [1 .. 1.000.000]). Consider using more partitions when you want to generate more rows.


Random Generators * [random generator]

Adds one or more random value columns with the given distributions.


API Load (Deprecated!)

UUID: 00000000-0000-0000-0028-000000000001
DeprecatedPlease use the Multi URL API Load processor instead
Replaced by: Multi URL API Load
Removed: true

Description

Loads data from REST APIs that can be targeted with GET requests.


Output(s)

  • out.output1 - Output for config elements relating to output 1
  • out.output2 - Output for config elements relating to output 2

Configurations

The URL that should be used as a query * [string]

The URL that should be used as a query. When querying Google Analytics this could e.g. be the following link.


Additional header parameters [string]

Parameters that need to be added, e.g. for authentication. For Google Analytics this would look like "Authorization: Bearer randomOath2LoginToken"


Only use values associated to the given Json key (optional) at Output 1 * [string]

The (top-level) json key, whose associated values should be used as data source. When left empty, all top-level json keys with a single value (no wrapped arrays / json objects) are given out. If something deeply nested should be taken a path in the form "path/to/top-level" has to be given.


A path through the retrieved json object for finding column headers at Output 1 [string]

In some APIs that can be queried with this processor, the column header information is not available directly in the columns (e.g. as a dictionary), but instead the column values are separated into an array, as well as the column names. In order to still have meaningful names you can specify a path through the json object here. The path should have the form "/key1/key2/key3/*/key4" where the keys are field accesses of dictionaries and the wildcard signalizes the array over which should be iterated to get the column headers (they might be in another dictionary which is here referred to as "key4". If accessing a single array element is necessary (ie. if all column names are in a nested array of another array, this can be retrieved with "[index]" (the indices are zero-based).


Only use values associated to the given Json key (optional) at Output 2 [string]

The (top-level) json key, whose associated values should be used as data source. When not configured, the second output will be an empty dataset. If something deeply nested should be taken a path in the form "path/to/top-level" has to be given.


Improved API Load (Deprecated!)

UUID: 00000000-0000-0000-0028-000000000002
DeprecatedPlease use the Multi URL API Load processor instead
Replaced by: Multi URL API Load
Removed: true

Description

Loads data from REST APIs that can be targeted with GET requests. A path through the retrieved Json can be provided to further specify the content of the created rows. All possible JsonPath expressions are described in the following link and a JsonPath expression tester can be found in the following link.



Output(s)

  • out.output - Output for config elements relating to output 1


Configurations

The URL that should be used as a query * [string]

The URL that should be used as a query. When querying Google Analytics this could e.g. be the following link


Additional header parameters [composed]

Parameters that need to be added, e.g. for authentication. For Google Analytics this would look like "Authorization: Bearer randomOath2LoginToken"


ADDITIONAL HEADER PARAMETERS > HTTP HEADER KEY * [STRING]

The key of a header key-value pair. This can be used e.g. for adding authorization information.


ADDITIONAL HEADER PARAMETERS > HTTP HEADER VALUE * [STRING]

The value of a header key-value pair. This can be used e.g. for adding authorization information.


JsonPath to the rows * [string]

The JsonPath, whose associated values should be used as data source. A JsonPath can e.g. be "$.importantValuesKey[*]" where $ stands for the root of the given json, importantValuesKey is the key in the top-level dictionary from which we want to retrieve the content (an array, that should be converted into rows). JsonPath can do much more, e.g. filtering data based on given values or selecting only a certain subset of array elements [startIndex:endIndex]. A full documentation of all possibibilites can be found in the following link and a JsonPath expression tester can be found at link.


JsonPath to the Column Names [string]

In some cases proper column names cannot be easily derived by the given JsonPath for the content. Then you can specify a different JsonPath here which will be used to find the correct column names. If the amount of the found column names does not match the amount of columns in the computed rows, as fallback all columns will be named in the format COLUMN_0, COLUMN_1, ...


Default Values For Columns [composed]

With this option certain column names can be given that will definitely appear in the Output for config elements relating to output 1 output. The default value is used in cases where no value can be found in the input, or if specified also in case we find and invalid value, which has to be replaced with something valid.


DEFAULT VALUES FOR COLUMNS > COLUMN FOR DEFAULT VALUE [MANUAL COLUMN SPECIFICATION]

The column name that should definitely appear in the output with its respective Scale and Representation Type.


DEFAULT VALUES FOR COLUMNS > DEFAULT VALUE * [STRING]

The default value for the column that should definitely appear in the output. It has to be a valid value for the specified Representation type.


DEFAULT VALUES FOR COLUMNS > ABORT JSON CONVERSION UPON FINDING INVALID VALUES * [BOOLEAN]

If this option is enabled, the whole transformation of (one of) the given URLs is stopped and an error is created.


Dataset Load (Deprecated!)

UUID: 00000000-0000-0000-0001-000000000001
DeprecatedPlease use the Data Table load processor instead
Replaced by: Data Table load
Removed: true


Description

Loads a data set into the workflow. The user has to specify the data location. Timestamp formats are currently limited to: "yyyy-[m]m-[d]d hh:mm:ss[.f...]"


For more details refer to the following article.


Output(s)

  • out.data - Loaded data set
  • out.invalidRows - Invalid Rows


Configurations

Name of Dataset to load (Match by name) [string]

Will resolve a Dataset by its name instead of selecting it in below configuration. Only one way of selecting a dataset can be used. If both selections have values selected, this configuration will be ignored.


Data Location (Match by direct selection) [data selection]

Please directly select a data set you want to load! Only one way of selecting a dataset can be used. If both selections have values selected, the value of this configuration is used.


Sanitize Loaded data * [boolean]

Please select if loaded data should be sanitized with default values.


Parser Mode * [single enum selection]

Determines the way, invalid lines will be treated.


Warn on invalid rows * [boolean]

When activated, a warning will be added if the load processor encounters invalid rows. Disable this feature to speed up execution times.


Relational Database Load (Deprecated!)

UUID: 00000000-0000-0000-0041-000000000001
DeprecatedPlease use the Data Table Load processor instead.
Replaced by: Data Table Load
Removed: true

Description

Loads a table from a relational database into workflow.

Output(s)

  • out.data - data

Configurations

Table Name * [string]

Select table


Password * [string]

Select your password


Username * [string]

Select unsername for your database


Database URL * [string]

Select a relational database url


Selected database type * [single enum selection]

What type of database is used


Schema [string]

Select schema