Table of Contents

Overview

The Input-based Random Number Generator Processor creates a data set filled with random values in one or more columns. The configuration of the random generators is done via the Processor's input. This Processor can be useful for generating a large data set for testing in a short amount of time.

Input

The Processor uses the columns of the input dataset as an additional configuration. Therefore all columns must contain a non-empty, SQL compatible name, a distribution type and a numeric seed. Invalid configuration entries in the input data result in errors, unless the Processor option "Error on invalid configuration" is disabled.


Three types of distributions can be configured: uniform, normal and discrete. A valid input would be the following:

typenameseedminmeanmaxstandard_deviationvalueprobability
uniformcolumn11230
100


normalcolumn2456
100
10.5

normalcolumn3789
0.50
0.25

discretecolumn4100



val10.75
discretecolumn4100



val20.25

Uniform:

  • Must always have a numeric value for their min and max settings (min < max).
  • There should only be one line for each configuration (mapped via name). 
  • Result columns will contain numeric content.

Normal:

  • Must always have a numeric value for their mean and a positive numeric value for their standard_deviation.
  • There should only be one line for each configuration (mapped via name).
  • Result columns will contain numeric content.

Discrete:

  • Entries of this kind always must have a non-null value in their value column and a positive numeric value in their probability column.
  • Usually span across multiple lines. Mapping is done via their name. 
  • Result columns will always contain strings.


Configuration


Partitions: Specify the number of run-throughs. The level of parallelism that Spark uses to generate the actual data increases with increasing number of partitions.


This refers to Spark Resilient Distributed Datasets (RDD) partitions. It has no effect in small values, but for huge generated tables, it is wise to have more partitions since one single partition must fit in the working memory of the server at any given time. 


Rows per partition: The effect of this configuration depends on "Use long-list format" being enabled or disabled:

  • Enabled: Amount of rows that should be created for every generator configured. The generators will be distributed across the configured partitions.
  • Disabled: Amount of rows that should be generated per partition (range: [1 .. 1.000.000]). Consider using more partitions when you want to generate more rows. Multiplying the number of partitions with the number of rows per partition will result in the total number of generated rows. 


Use Long-list format: As an example, we use three random generators named generator_1, generator_2 and generator_3.

With this option disabled, a broad list is being generated. The output will contain one column per generator:


generator_1
generator_2
generator_3
1.03
0.01
val1
0.34
0.32
val1
2.56
0.09
val2
3.21
0.87
val1
4.20
0.56
val2


With this option enabled, there will be three columns regardless of the amount of generators:

  • name is the generator name, which was the column name in above table
  • drawing corresponds to the row number in above table
  • value is the drawn value of the generator (the value type being string)

name
drawing
value
generator_1
1
1.03
generator_1
20.34
generator_1
3
2.56
generator_1
4
3.21
generator_1
5
4.20
generator_2
10.01
generator_2
2
0.32
generator_2
3
0.09
generator_2
4
0.87
generator_2
50.56
generator_3
1
val1
generator_3
2val1
generator_3
3
val2
generator_3
4
val1
generator_3
5
val2


Error on invalid configuration:

Per default, errors are thrown and Workflow execution stopped when encountering invalid configurations in the Processor input. By disabling this setting, the Processor will be more permissive and ignore such invalid 

configurations. Instead of errors, a warning for each invalid configuration will then be generated by the Processor. 

If all input configurations are invalid or there are problems with the input schema, the Processor will generate an error regardless of this setting.


Disabling this setting can lead to unexpected RNG configurations when executing with varying input.


Output

A data set with random values according to the configuration is generated. The number of rows depends on the number of partitions and rows per partition, as well as the chosen format, as already seen in the explanation for configuration "Use Long-list format".


Example

In this example, we show the results in a Result Table, but you can also use it as direct input for another processor or save it as a Data Table.

Workflow


Example Input

As input, the example input from before was used.


typenameseedminmeanmaxstandard_deviationvalueprobability
uniformcolumn11230
100


normalcolumn2456
100
10.5

normalcolumn3789
0.50
0.25

discretecolumn4100



val10.75
discretecolumn4100



val20.25


Example Configuration

The configuration is set as the image shows, with two partitions and three rows per partition. We first run the workflow with "Use Long-list format" disabled, and then enabled.


Result

The result columns will then be filled with values according to the configuration in the input.

  • column1: between 0 and 100
  • column2 and column3: around the mean values
  • column4: either val1 or val2 (in this case only val1)


Having "Use Long-list format" disabled, the result will be the following.

 

Having "Use Long-list format" enabled, the result will be the following. As explained previously, the "value" column is of type string and the "drawing" column shows the row number of the rows in the broad-list format.