With ONE DATA there are many ways to process data within a workflow using different processors. But sometimes, it is necessary or easier to use custom computation methods. An R-Script, for instance, is a good way to customize the processing of data because the programming language is predestined for such tasks. If you want to know why, information can be found on the offcial site of the R-Project.
Since R-Scripts are a very useful tool for data science tasks, it is possible to include them in ONE DATA workflows with the R-Script processors. In this article, we will focus on the R-Script Dual Input Processor.
The R-Script Dual Input Processor executes an R-Script on the input datasets. An overview on how R-Scripts are processed in ONE DATA and useful tips on how to install R packages can be found here.
The processor can operate on any valid dataset produced by ONE DATA. It has two input nodes which may be connected to two different datasets that should be processed.
Column names in the input dataset should not contain any special characters or consist only of numbers. Otherwise ONE DATA will not recognize it.
If the input is valid, the column names are displayed at the bottom of the editor in the processor configuration. Also the R-Script is inserted here:
In the upper textfields, it is possible to specify the name of the input datasets in the R-Script. The default values are "input1" and "input2". The datasets can be saved into R dataframes using:
dataset1 <- input1 dataset2 <- input2
The processor configuration gives some additional options on how the data should be processed and some definitions for the script execution itself.
Timeout For Script Execution
Time (in seconds) to wait for the R Server to return the calculation results of the script. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released. The timeout starts on the Processor submitting the R script and the data to the R Server.
The default value is 300 seconds.
With this configuration option it is possible to specify what scale and representation type the output columns have, in order to provide the correct type inference in ONE DATA.
Possible scale types: nominal, interval, ordinal, ratio. Further information on scale types can be found here.
If it is not possible to convert the values of a column to the specified representation type, the processor will take the type that fits best for their representation. If these still do not fit the purpose, it is recommended to use the Data type Conversion Processor.
The output of the R-Script Processor is the dataset that was produced by the script in the configuration. There are two things to note on how the output has to be specified:
- The output of the R-Script needs to have the type dataframe in R. Please make sure to convert the output to type dataframe.
- The last executed statement of the R-Script needs to include the return() command in R and include the data that should be returned as dataframe.
return (as.data.frame( "insert name of output data here" ))
In this example we used the R-Script Dual Input processor to filter observations that exist in the first input dataset A by values of a second input dataset B in ONE DATA. The input datasets were generated using a Custom Input Table processor.
Example Script and Configuration
The example script filters table A by the values of table B:
# Select only companies that are in second input table output <- tableA[tableA$Company%in%tableB$Categories,] return(as.data.frame(output))
We name the left input "tableA" and the second one "tableB".
For the other configuration options we choose the defaults, so timeout 300 seconds and no Manual TI configuration.