- Article Content
- Advanced Configuration
- Related Articles
This processor stores data into a new or existing data table.
This processor is a save processor, so it is generally used at the end of the Workflow. The input is a dataset that should be stored as data table in ONE DATA.
When opening the processor, the following configuration interface pops-up:
For the target storage type, there are three options available:
- Data Table (default): Uses a data table within OD as target storage.
- Connection: File System: Directly store your data to a filesystem without needing to define a respective data table in OD. A more detailed explanation can be found below.
- Connection: Column Family: Directly store your data to a Cassandra DB without needing to define a respective data table in OD. Similar to the file system, a Connection needs to be configured for it.
User has two options to declare the data table to save:
- By name: using the input field
- By Variables: it is very handy to save the data in a specific pattern (UserName_Workflow ...) and this can be done by using System and/or Workflow variables:
The processor also has some optional fields, which we will explain in the following sections.
The format in which data will be stored, this supports external and internal formats:
Create a new data table or Replace/Append an existing one.
Thereby the different modes do the following:
- Replace: If no data has been persisted yet, this mode will create new data at the specified location. If data already existed before, this "old" data will be replaced by the new data.
- Append: If no data has been persisted yet, this mode will create new data at the specified location. If data already existed before, this mode will append the newly created data to the existing one.
- Create Once: If no data has been persisted yet, this mode will create new data at the specified location. If data already existed before this mode will not replace the existing one, so nothing happens.
Please note that you should always specify a name or a data table to save your data!
Schema Mismatch Behaviour
When choosing to replace/append an existing data table and the input data does not match the schema of the target data, user has to provide an option to determine how to solve this mismatch:
Save to a Filesystem Connection
When you want to save your data to a Filesystem Connection, there is some additional configuration needed. The following menu will appear as soon as you select the according target storage type.
With this option it is possible to define how the saved result should be compressed. There are three possibilities:
- PLAIN: If this option is selected, ONE DATA does not compress the result and saves the content in text format to the specified file fromat in the "File name" section.
- ZIP: With this option the file containing the result content will be compressed by adding it to a ZIP archive.
- GZIP: Similar to the ZIP configuration, the file will be compressed using a GZIP archive.
File Name of the Packaging File
Define the name of the surrounding packaging file.
If needed it is also possible here to define a relative path (relative to the path inside the connection), e.g. subfolder/filename.csv. If the specified path does not exist yet, it will be created automatically
Define the file name of in which the data will be saved.
If needed it is also possible here to define a relative path (relative to the path inside the connection or packaging file), e.g. subfolder/filename.csv. Only GZIP does not support paths inside them.
Please note the following: - The extension containing the data type of the file (e.g. ".txt" or ".csv") won't be added automatically, so if you need it, it has to be defined here. - The file name is still important if you have chosen ZIP or GZIP as packaging type, because the file name will then determine the name of the file contained inside the ZIP or GZIP.
Data Type of File
With these configuration options it is possible to define the data type of the result file and how it should be parsed.
- Select the data type. At the moment only CSV is available.
- The delimiter token is used to indicate the boundary between two separate data entities. For example, if one row in a CSV looks like "Bier", "Butter", "Brezn" the comma indicates the separation into the three different words.
- The escape token is used to indicate that a character is not used as a control character, but as the actual character. For example if a cell contains a string with a comma, the escape token will be placed before it, so it is not recognized as column separator by a CSV parser. So for example, if you have a row containing the string "not\, separated", you will receive the desired single data entry "not, separated", rather than the string being split into "not" and "separated" (assuming that the comma is selected as delimiter in the other configuration).
- The string escape token is used to indicate where a string starts.
- Select the file encoding. Currently only UTF-8 is supported.
Note that even though the data type of the result file is CSV, and it is parsed as such, the file can for example be saved as ".txt" file if the extension is defined in the "File Name" section. For the mentioned example this works without errors, but this is not the case for every file type. So be careful what extension you define in the file name to avoid corrupted files.
This processor will either generate a data table, a file in the configured file system or will store the data directly to a data base. According to the configuration, this data table will be created or an existing one will be modified i.e replaced or appended.
In this example, some processors were used to apply some transformations on an input dataset:
- Ordering Processor: used to sort the input dataset.
- Columization Processor: used to generate new columns.
- Bucketing Processor: used to generate buckets.
- Column Selection Processor: used to select some columns from the input.