The Caching Processor caches an input dataset and forwards the cached dataset. This can improve performance on big iterative and complex calculations on the same data set by caching the result and not recalculating the entire (previous) process again. Once run successfully the workflow will use the Caching Processor as starting point for further calculations.


The processor can have any valid dataset as input.


"Dataframe-Based Caching" decides whether the dataframe or the raw Resilient Distributed Datasets (RDD) gets cached. Caching the dataframe saves cache space and may improve subsequent and overall query performance due to more optimization options for Spark. Dataframe caching is switched on by default.


The cached dataset acts as a starting point for further calculations. The actual content of the dataset is not changed by the processor.