Motivation

With ONE DATA there are many ways to process data within a workflow using different processors. But sometimes, it is necessary or easier to use custom computation methods. An R-Script, for instance, is a good way to customize the processing of data because the programming language is predestined for such tasks. If you want to know why, information can be found on the offcial site of the R-Project.

Since R-Scripts are a very useful tool for data science tasks it is possible to include them in ONE DATA workflows with the R-Script processors. In this article, we will focus on the R-Script Single Input Processor.


Overview

The R-Script Single Input Processor takes a dataset as input, executes a R-Script on it and forwards the result to ONE DATA. An overview on how R-Scripts are processed in ONE DATA and useful tips on how to install R packages can be found here. 


Input

The processor can operate on any valid dataset produced by ONE DATA. Column names in the input dataset should not contain any special characters or consist only of numbers. Otherwise ONE DATA will not recognize the input dataset. 

If the input is valid, the column names are displayed at the bottom of the editor in the processor configuration. Also the R-Script is inserted here:

In the "Input Name In Script" textfield, it is possible to specify the name of the input dataset in the R-Script. The default value is "input". The dataset can be saved into a R data frame using:

dataset <- input



 Configuration

The processor configuration gives some additional options on how the data should be processed and some definitions for the script execution itself.

Timeout For Script Execution

Time (in seconds) to wait for the R Server to return the script calculation results. If this timeout is exceeded, the calculation will be interrupted and the connection of this Processor to the R Server will be released and a ProcessorExecutionError is thrown. The timeout starts as soon as the Processor is submitting the R script and the data to the R Server. When it is exceeded, the

The default value is 300 seconds.



Manual TI

With this configuration option it is possible to specify what scale and representation type the columns of the output dataset have, in order to provide the correct type inference in ONE DATA.

Possible scale types: nominal, interval, ordinal, ratio. Further information on scale types can be found here.

Possible representation types: string, int, double, datetime, numeric

If it is not possible to convert the values of a column to the specified representation type, the processor will take the type that fits best for their representation. If types still do not fit the purpose, it is recommended to use the Data type Conversion Processor.


Output

The output of the R-Script Processor is the dataset that was produced by the R-Script in the configuration. There are two things to note on how the output has to be specified:

  • The output of the R-Script needs to have the type dataframe in R. Make sure to convert the output to type dataframe.
  • The last executed statement of the R-Script needs to include the return() command in R and the data that should be returned as dataframe.
return (as.data.frame( "insert name of output data here" ))



Example

In this example we want to get the best rated books out of a books dataset.


Example Input

The following table represents a snippet of the dataset that we will use.

bookIDtitle
authors
average
_rating
isbn
isbn13
language
_code
#_num
_pages
ratings
_count
text
_reviews
_count

1Harry Potter and the Half-Blood Prince (Harry Potter  #6)
J.K. Rowling
4.56
0439785960
978043
9785969
eng652
1944099
26249
2
Harry Potter and the Order of the Phoenix (Harry Potter  #5)
J.K. Rowling
4.49

0439358078
978043
9358071
eng
870
1996446
27613
3
Harry Potter and the Sorcerer's Stone (Harry Potter  #1)
J.K. Rowling
4.47
0439554934
978043
9554930
eng
320
5629932
70390



Example Script and Configuration

From the input we want to extract all books that have a rating higher than or equal 4.5, and only get the columns "authors", "title", "average_rating" and "ratings_count". This is the corresponding R-Script for it:

output <- subset(books, average_rating>=4.5, select=c("authors", "title", "average_rating", "ratings_count"))
return(output)

In the processor configuration we choose "books" as name for the input dataset. We select the default timeout and do not configure Manual TI.


Workflow

The workflow loads the book dataset with a Data Table Load Processor, passes it to the R-Scipt Single Input Processor and saves the output to a Result Table.



Result

Here is a snippet of the results:


Related Articles

Using R in ONE DATA

R-Script Dual Input Processor

R- Script Data Generator Processor