With ONE DATA there are many ways to process data within a workflow using different processors. But sometimes, it is necessary or easier to use custom computation methods. An R-Script, for instance, is a good way to customize the processing of data because the programming language is predestined for such tasks. If you want to know why, information can be found on the offcial site of the R-Project.
Since R-Scripts are a very useful tool for data science tasks it is possible to include them in ONE DATA workflows with the R-Script processors. In this article, we will focus on the R-Script Single Input Processor.
The R-Script Single Input Processor takes a dataset as input, executes a R-Script on it and forwards the result to ONE DATA. An overview on how R-Scripts are processed in ONE DATA and useful tips on how to install R packages can be found here.
The processor can operate on any valid dataset produced by ONE DATA. Column names in the input dataset should not contain any special characters or consist only of numbers. Otherwise ONE DATA will not recognize the input dataset.
If the input is valid, the column names are displayed at the bottom of the editor in the processor configuration. Also the R-Script is inserted here:
In the "Input Name In Script" textfield, it is possible to specify the name of the input dataset in the R-Script. The default value is "input". The dataset can be saved into a R data frame using:
dataset <- input
The processor configuration gives some additional options on how the data should be processed and some definitions for the script execution itself.
Timeout For Script Execution
Time (in seconds) to wait for the R Server to return the script calculation results. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released and a ProcessorExecutionError is thrown. The timeout starts as soon as the Processor is submitting the R script and the data to the R Server. When it is exceeded, the
The default value is 300 seconds.
With this configuration option it is possible to specify what scale and representation type the columns of the output dataset have, in order to provide the correct type inference in ONE DATA.
Possible scale types: nominal, interval, ordinal, ratio. Further information on scale types can be found here.
Possible representation types: string, int, double, datetime, numeric
If it is not possible to convert the values of a column to the specified representation type, the processor will take the type that fits best for their representation. If types still do not fit the purpose, it is recommended to use the Data type Conversion Processor.
The output of the R-Script Processor is the dataset that was produced by the R-Script in the configuration. There are two things to note on how the output has to be specified:
- The output of the R-Script needs to have the type dataframe in R. Make sure to convert the output to type dataframe.
- The last executed statement of the R-Script needs to include the return() command in R and the data that should be returned as dataframe.
return (as.data.frame( "insert name of output data here" ))
In this example we want to get the best rated books out of a books dataset.
The following table represents a snippet of the dataset that we will use.
|1||Harry Potter and the Half-Blood Prince (Harry Potter #6)||J.K. Rowling||4.56||0439785960||978043|
|2||Harry Potter and the Order of the Phoenix (Harry Potter #5)||J.K. Rowling||4.49|
|3||Harry Potter and the Sorcerer's Stone (Harry Potter #1)||J.K. Rowling||4.47||0439554934||978043|
Example Script and Configuration
From the input we want to extract all books that have a rating higher than or equal 4.5, and only get the columns "authors", "title", "average_rating" and "ratings_count". This is the corresponding R-Script for it:
output <- subset(books, average_rating>=4.5, select=c("authors", "title", "average_rating", "ratings_count")) return(output)
In the processor configuration we choose "books" as name for the input dataset. We select the default timeout and do not configure Manual TI.
The workflow loads the book dataset with a Data Table Load Processor, passes it to the R-Scipt Single Input Processor and saves the output to a Result Table.
Here is a snippet of the results: