TABLE OF CONTENTS
- Advanced Configuration
- Expert: Data Table Partitioning
- Expert: Truncate Table
- Related Articles
This Processor stores data into a new or existing Data Table.
This Processor is a Save Processor, so it is generally used at the end of the Workflow. The input is a dataset that should be stored as Data Table in ONE DATA.
When opening the Processor, the following configuration interface pops-up:
For the target storage type, there are three options available:
- Data Table (default): Uses a Data Table within OD as target storage.
- Connection: File System: Directly store your data to a filesystem without needing to define a respective Data Table in OD. A more detailed explanation can be found below.
- Connection: Column Family: Directly store your data to a Cassandra DB without needing to define a respective Data Table in OD. Similar to the file system, a Connection needs to be configured for it.
User has two options to declare the Data Table to save:
- By name: using the input field
- By Variables: it is very handy to save the data in a specific pattern (UserName_Workflow ...) and this can be done by using System and/or Workflow variables:
The Processor also has some optional fields, which we will explain in the following sections.
The format in which data will be stored, this supports external and internal formats:
Create a new Data Table or Replace/Append an existing one.
Thereby the different modes do the following:
- Replace: If no data has been persisted yet, this mode will create new data at the specified location. If data already existed before, this "old" data will be replaced by the new data.
- Append: If no data has been persisted yet, this mode will create new data at the specified location. If data already existed before, this mode will append the newly created data to the existing one.
- Create Once: If no data has been persisted yet, this mode will create new data at the specified location. If data already existed before this mode will not replace the existing one, so nothing happens.
Please note that you should always specify a name or a data table to save your data!
Schema Mismatch Behaviour
When choosing to replace/append an existing data table and the input data does not match the schema of the target data, user has to provide an option to determine how to solve this mismatch:
Save to a Filesystem Connection
When you want to save your data to a Filesystem Connection, there is some additional configuration needed. The following menu will appear as soon as you select the according target storage type.
With this option it is possible to define how the saved result should be compressed. There are three possibilities:
- PLAIN: If this option is selected, ONE DATA does not compress the result and saves the content in text format to the specified file format in the "File name" section.
- ZIP: With this option the file containing the result content will be compressed by adding it to a ZIP archive.
- GZIP: Similar to the ZIP configuration, the file will be compressed using a GZIP archive.
File Name of the Packaging File
Define the name of the surrounding packaging file.
If needed it is also possible here to define a relative path (relative to the path inside the connection), e.g. subfolder/filename.csv. If the specified path does not exist yet, it will be created automatically.
Define the name of the file in which the data will be saved.
If needed it is also possible here to define a relative path (relative to the path inside the connection or packaging file), e.g. subfolder/filename.csv. Only GZIP does not support paths inside them.
Please note the following: - The extension containing the data type of the file (e.g. ".txt" or ".csv") won't be added automatically, so if you need it, it has to be defined here. - The file name is still important if you have chosen ZIP or GZIP as packaging type, because the file name will then determine the name of the file contained inside the ZIP or GZIP.
Data Type of File
With these configuration options it is possible to define the data type of the result file and how it should be parsed.
- Select the data type. At the moment only CSV is available.
- The delimiter token is used to indicate the boundary between two separate data entities. For example, if one row in a CSV looks like "Bier", "Butter", "Brezn" the comma indicates the separation into the three different words.
- The escape token is used to indicate that a character is not used as a control character, but as the actual character. For example if a cell contains a string with a comma, the escape token will be placed before it, so it is not recognized as column separator by a CSV parser. So for example, if you have a row containing the string "not\, separated", you will receive the desired single data entry "not, separated", rather than the string being split into "not" and "separated" (assuming that the comma is selected as delimiter in the other configuration).
- The string escape token is used to indicate where a string starts.
- Select the file encoding. Currently, only UTF-8 is supported.
Note that even though the data type of the result file is CSV, and it is parsed as such, the file can for example be saved as ".txt" file if the extension is defined in the "File Name" section. For the mentioned example this works without errors, but this is not the case for every file type. So be careful what extension you define in the file name to avoid corrupted files.
This Processor will either generate a Data Table, a file in the configured file system or will store the data directly to a data base. According to the configuration, this Data Table will be created or an existing one will be modified i.e replaced or appended.
In this example, some Processors were used to apply some transformations on an input dataset:
- Ordering Processor: used to sort the input dataset.
- Columization Processor: used to generate new columns.
- Bucketing Processor: used to generate buckets.
- Column Selection Processor: used to select some columns from the input.
Expert: Data Table Partitioning
The Data Table Save Processor has an Expert Mode, in which you can adjust the Processor configuration through a JSON object. With this, a list of columns to use for partitioning can be added. Partitioning divides data into more manageable segments, while knowing what data exists in each segment.
This is useful for huge datasets (e.g. 8 billion rows), which are often and regularly used or joined. Due to partitioning, it is not necessary to scan all rows for the necessary information, as it is known which partition contains which data.
For example, a table could be partitioned by a date column. Regardless of whether the analysis needs data from the last 3 months or years, only the data from the partitions containing these time frames are scanned.
How to Set Up Partitions
- Create and open a Data Table Save Processor
- Select a Data Table by name, or name your new Data Table
- Adjustments to the normal configuration (e.g. Save Procedure) need to be done now. More on this here.
- Press Alt+Shift+E to enter Expert Mode
- Modify the JSON by adding "partitions": ["columnName1", "columnName2", ...] under "storageSpecification".
- Save the changes and run the Workflow
- Open the Processor again. You will see your dataset. The Processor configuration will still be in Expert Mode.
- To go back to the normal mode, press the arrow. A warning will pop up about the configuration being reset. This is why it is important to make adjustments to the normal configuration before opening Expert Mode.
- Select the Data Table by name. Additional to the information about the amount of rows and columns, partitioning information will now be visible.
Normal Configuration with Partitions
Changes to the normal configuration need to be done before entering Expert Mode, as leaving Expert Mode resets the entire configuration.
Depending on the options Save Procedure and Schema Mismatch Behaviour, partition configuration is handled in different ways:
|Append + Force exact match||Use the same partitioning and ignore the Processor configuration. Error if existing partitioning and Processor configuration don't match (seen below this table).|
|Append + Take schema from stored data||Keeps the existing partitioning the Data Table has and ignores the Processor configuration |
|Replace + Force exact match||Use the same partitioning and ignore the Processor configuration. Error if existing partitioning and Processor config don't match (seen below this table). |
|Replace + Take schema from stored data||Keeps the existing partitioning the Data Table has and ignores the Processor configuration |
|Replace + Take schema from new data||Takes what's in the Processor configuration. Without Expert Mode, no partitioning will be done. |
Expert: Truncate Table
Using the Expert Mode, the option truncateTable can be added to the configuration JSON. truncateTable is a boolean, turned off by default. When turned on, it allows ONE DATA to truncate and insert data directly to the target table. Normally, a temporary duplicate is created.
This brings the advantage, that tables are preserved exactly as they are. Properties not supported by ONE DATA (PostgreSQL partitioning, ORACLE access settings) will therefore not be removed. Also it does not require the users to have CREATE and DROP permissions. However, it is not concurrency-safe.
How to Use truncateTable
Expert Mode is entered by pressing Alt+Shift+E. Please follow the step-by-step guide for partitioning in Expert Mode to see when to enter and leave the Expert Mode without accidentally resetting your configuration.
The option needs to be added inside "saveSpecification".