The Random Number Generator Processor creates a data set filled with random values in one or more columns.
This processor can be useful for generating a large data set for testing in a small amount of time.
Partitions: Specify the number of run-throughs. The level of parallelism that Spark uses to generate the actual data increases with increasing number of partitions.
This refers to Spark Resilient Distributed Datasets (RDD) partitions. It has no effect in small values, but for huge generated tables, it is wise to have more partitions since one single partition must fit in the memory at any given time.
Rows per partition: The effect of this configuration depends on "Use long-list format" being enabled or disabled:
- Enabled: Amount of rows that should be generated for every generator configured. The generators will be distributed across the configured partitions.
- Disabled: Amount of rows that should be generated per partition (range: [1 .. 1.000.000]). Consider using more partitions when you want to generate more rows.
Use long-list format: Toggles the output format either being a broad or a long list. As an example, we use three random generators named generator_1, generator_2 and generator_3.
With this option disabled, output will be as follows (one column per generator):
With this option enabled, there will be three columns regardless of the amount of generators:
- name is the generator name (column name in above table)
- drawing corresponds to the row number in above table
- value is the drawn value of the generator (Note, that the value type will always be STRING!)
Random Generators: Configure one or more random variables/columns that should be generated:
- Select CSV File Button: You can upload a CSV file for generating the random column here, e.g. a pool of text values. (An example CSV file is attached bellow)
- Filter Config Rows: You can filter a column to configure.
- Type: Enter the type of distribution which the generated random numbers should follow. Available Options: Uniform (default), Normal and Discrete.
- Name: Enter a name for the column containing the generated random numbers.
- Seed: Enter a number. It will help reproducing the exact same data set.
- Further Options: Depending on the chosen distribution (type) several other parameters to configure like mean or standard deviation are available.
Add Distribution: Click on this button to add another column with random numbers and an additional configuration.
Multiple configurations within the same Random Number Generator are independent.
A data set with random values according to the configuration is generated. The number of rows depends on the number of partitions and rows per partition chosen. The number of columns depends on the number of distributions added to the configuration.
In this example we will use the Random Number Generator Processor to generate a random data set containing two columns. The first column containing numbers from a normal distribution with a mean of 10 and a standard deviation of 2. The second column containing numbers from a uniform distribution with a minimum of 0 and a maximum of 100. The workflow is given below.
- Partitions: 254232 (use slider to increase, decrease the value or manually enter a value in the VARIABLES field)
- Rows per partition: 10
- Random generators:
Resulting tables consist of two columns with randomly generated numbers based on the provided configuration.