Article Content
Introduction
ONE DATA introduces an efficient way to connect to exterior sources of data (relational databases, APIs, filesystems), with or without the use of Credentials.
Connections are mainly used to make resource manipulation in ONE DATA easier. In a way, if multiple Processors use the same resource (Host name and port name for DB Connection, URL for API Connection, etc.), simply updating the Connection configuration will update it for all Processors which are connected to it.
Overview
In the CONNECTIONS overview all Connections a user has created/access to within a project are listed in a time descending order by default. Within the CONNECTIONS overview the user can search for existing Connections by typing the name of a Connection or a keyword in the "Search Connections" field. In order to select several Connections of the list the user can move the mouse to the left beneath the Connection name, then small checkboxes appear next to the Connection's name that can be ticked. The selected Connections then can be either deselected, opened, deleted, or shared to other projects by choosing the respective option appearing on the left menu.
Create a Connection
- Press "Add New" on the left menu bar to create a new Connection.
- By clicking it, the following dialog appears and the user can enter the name of the Connection. Additionally, a description, keywords and notes can be added:
- Then the user is asked to select a Connection type:
Relational Database Connection
ONE DATA provides at the moment nine different Connection types to relational databases. Clicking on one of the options below, a new Connection is created and the user can enter the details and the Credentials (either using existing ones or adding new ones):
MySQL
PostgreSQL
DB2
MSSQL
ORACLE
MSSQL Kerberos
Impala
Hive
In the dialog, the Connection information to the database has to be configured. The fields marked with "*" are mandatory:
- Connection Name: The name of the Connection within ONE DATA.
- Host: The host address of the running database to which the Connection should be established.
- Port: The respective port of the database server.
- Database Name: The name name of the database.
- Credentials: The Credentials used for Connection authentication. This information is manadatory except for MSSQL Kerberos.
Also, the default Connection Schema can be set, but this configuration is optional.
Once all mandatory information is configured, the user can save and test the Connection via the "Test Connection" button.
Additional Information for Hive
- Currently, the Hive Connection can only be used for read operations from your database. Writing to the Hive database Connection will follow soon.
- Depending on your configuration for your Hive database, please mind to consider the correct port. The default port defined for a Hive database is 10000.
- There is currently no auto-discovery in place to detect your available database (names) as well as your schema, so make sure you know them beforehand.
- You will always have to support Credentials, even if you configured your Hive database to be accessible without them.
- If "schema" is left empty, the default schema will be applied.
Hive Connection Performance Improvement
Connections to Hive metastores have been adapted to improve their performance. To increase the performance of reading data from Hive tables we use the Hive support in Spark SQL, which reads the Hive table's location from Hive's metastore and then uses it to read data in parallel directly from MapR-FS.
To show the effect this change has, these use cases were tested with the following results:
Use Case | New Time | Old Time |
---|---|---|
Simple row count on 2 billion rows | less than a minute | 19 minutes |
Replicate big hive table (2 billion rows) by saving it as parquet in ONE DATA | 20 minutes | 2-3 hours |
Queries with "String contains match" were also faster than before |
Prerequesites
So that Spark SQL can be used in the Database Connection Load Processor when using Hive Connections, it needs to be enabled first. This is done via instance configuration. More information on this can be found here.
Restriction
Be aware that it is only possible to enable this improvement for one Hive metastore Connection (Spark limitation). In case of more than one, the others will be connected as-is via JDBC.
Additional Information for ORACLE
Oracle has a various number of column types and it is possible to create Data Tables inside ONE DATA for ORACLE Tables independently of the column types. If an ORACLE column type is not supported, ONE DATA maps it to STRING. However, there are certain types which cannot be loaded directly (TIMESTAMP WITH LOCAL TIME ZONE, INTERVAL YEAR, INTERVAL DAY), e.g. within a Workflow. In this case you are asked to load the data via the Database Connection Load Processor and to cast the problematic column directly in the sql statement into something supported, e.g. VARCHAR.
Filter Accessible Tables/Views by Schema
For ORACLE Connections, a dropdown to filter tables by schema is introduced to the Connection viewThe dropdown contains three different options:
- Show table/views of default schema of Connection: Shows only tables/views owned by the current schema
- Show all tables/views: Show tables/views from all schemas
- Show all tables/views except blacklisted schemas: Show tables from all schemas the user has access to, expect blacklisted ones. (If the schema blacklist is empty, this option will be disabled).
Troubleshoot for ORACLE 10 Connections in ONE DATA:
If you experience problems connecting an ORACLE 10 database to ONE DATA using standard Credentials, try entering the Credentials in Capslock, even if they are not persisted in Capslock in the respective database.
For example, while "dbname: ORACLE10DB, schema: oracle_schema, SID: ORACLEID" does not work, "dbname: ORACLE10DB, schema: ORACLESCHEMA, SID: ORACLEID" might do the trick, even if the schema is persisted as "oracleschema".
Public Cloud Provider Connection
API Connection
By selecting the REST option, the user can create a new (REST) API Connection. The user should enter the Connection details and the Credentials (either using existing ones or adding new ones).
Multiple headers with different HTTP methods can be set (GET, POST, DELETE). Also the base URL can be extended by typing a different path for each header.
Once all mandatory information (Connection name and base URL) are filled, the user can save and test the Connection via the "Test Connection" button.
Credentials are optional for this Connection type.
Filesystem Connection
By selecting SERVER the user can create a new FILESYSTEM Connection. With that, files of the following datatype can be read directly from the file system: .csv, .txt, .zip, and .gz. If one of the latter two is read, every contained single file can be accessed later. It is then possible to create a data table out of every single file, or create a combined data table by merging the associated files. (This is the standard way of reading files from the file system in ONE DATA, which should mainly be used. However, there is also the possibility of reading CSV files via python, which gives more flexibility, e.g. if headers are more complicated, but comes with some drawbacks, so please read the article carefully first: Reading CSV Files with the Python Script Data Generator Processor)
> Prerequisite to using the FILESYSTEM Connection is that the chosen directory/file path can be accessed by the file system user for the ONE DATA instance (in most ONE DATA installations, this is the "docker" user).
In general, it is reasonable to use a directory/file path in the home directory of that user (e.g. "/home/docker/FS_directory" for the user named "docker" and an arbitrary subdirectory "FS_directory"), as necessary access is already supported there. Otherwise it needs to be granted to the user.
Important note: If other file system users or root users put files which should be used via the FILESYSTEM Connection in the directory, then the file system user for ONE DATA does also need R/W rights (read and write rights) on these files!
> Please note here that the actual file content and its structure is of highest importance. In order to later transform your read data into a data table via ONE DATA (which then can be used in your workflows), the content of the file needs to be in a "CSV-like" structure, meaning that common CSV rules apply: first row specifies your table headers, following rows describe the content, every entry needs to be delimited with a respective delimiter (which can be chosen later on in the process).
The necessary configuration for a FILESYSTEM Connection are its name and the directory that the Connection should read from:
.
Once all mandatory information (Connection directory and name) are filled, the user can save and test the Connection via the "Test Connection" button.
No Credentials needed for this Connection type.
When a FILESYSTEM Connection is established, its overview is opened which already shows all files that are contained in the directory and which are accessible via ONE DATA:
The shown files can now be transformed into data tables in order to make them accessible via ONE DATA. As mentioned above, this can be done in one of two ways:
- Transform a single file into a data table (orange buttons in the figure above, either via prompt (left button) or directly with the respectively chosen file (button on the right available for every entry listed).
- Transform multiple files into a combined data table by choosing the merge option (blue button in the figure above).
Creating a data table out of a single file opens the following prompt, asking you to define an identifying name for your to be created data table and also the possibility for a description and keywords:
When continued, you are asked to specify the CSV structure of your read file in order to properly transform it into a data table:
The following properties can be specified:
- Delimiter Token: This token defines the delimiter that is contained between the columns of your data. Prominent examples are commas (e.g. "a,b"), semicolons (e.g. "a;b"), or pipes (e.g. "a|b").
- String Escape Token: This token is important whenever you have content in your data that contains the character that you use for your "delimiter", but you want to symbolize that the string still belongs together. This can be introduced by applying the "String Escape Token".
As an example imagine you have the entry "I want to create a data table out of my files, too!". If now the Column Separator Token is set to a comma, without adding the String Escape you would receive two entries in the data table: "I want to create a data table out of my files" and "too!". With the String Escape Token, for example set to "&" (in order to make it more explicit), the result of having the entry " &I want to create a data table out of my files, too!& " in your data would be one entry "I want to create a data table out of my files, too!" in the data table. - Escape Token: This token is used whenever you use characters that need escaping in certain encodings. Prominent examples would be the quotation mark ".
As an example, defining the Escape Token to "\" and having the entry " \"escapes\" " in your read data, would lead to the entry " "escapes" " in your data table. - Encoding: The encoding that will be used for the created data table (you can read about what encodings are here).
Creating a merged file opens the following prompt:
Here you can define merge patterns in order to merge several files conveniently. The patterns have to be defined in JSON format, in which you can use regular expressions to create more sophisticated rules, so you also do not have to specifically write down every file name for the merge for example.The "fileName" property of the merge rules will only be used as display name in the overview of read files, so anything can be entered here that helps identifying what has been read.
By default, a merge rule will match files in the directory and all subdirectories (which are whitelisted automatically, when the top folder is whitelisted). It is however possible, to match only files of a specified (sub-)directory.
There are some constraints on merge rules:
> Only files of the same file type can be merged.
> The files to be merged need to have the same schema (number of columns, data types, ...)
> When using a merge rule on a .zip file, always entail the rule with a ".*", otherwise it will not work (this has to do with the internal structure of the .zip)
As an example, imagine you have the following files and subdirectories stored in your directory from which a created FILESYSTEM Connections reads from:
- "ABC_I_123.csv" and "ABC_II_123_csv"
- "FS_entry_1", "FS_entry_2", and "FS_entry_3"
- A .zip file named "xyz_rows_combined.zip"
- As well as the subfolders: "Folder_1", "Folder_2", and "Folder_3"
Now the following merge rules should be applied:
[
{
"mergeRule":"ABC_.*_123.csv",
"fileName":"yourSymbolicName.csv"
},
{
"mergeRule":"FS_entry_.*",
"fileName":"file2"
},
{
"mergeRule":"FolderName\\\\Folder_.*",
"fileName":"filesInFolder"
},
{
"mergeRule": ".*rows_combined.zip.*",
"fileName": "combined_files.zip"
}
]
The following merges would be created out of the merge rules above:
- The first pattern merges the files "ABC_I_123.csv" and "ABC_II_123.csv" to "yourSymbolicName.csv".
- The second pattern merges the files "FS_entry_1", "FS_entry_2" and "FS_entry_3" to "file2".
- The third pattern merges the files from specific folder "Folder_1", "Folder_2" and "Folder_3" to "filesInFolder".
- The fourth pattern merges all files contained in the .zip file "xyz_rows_combined.zip" to "combined_files.zip"
Once merged, the merge files will then appear in the overview of your available files in the FILESYSTEM Connection. You can create a data table out of those just as with a single file.
When you created a data table out of your read file(s), you always have the possibility to check the data table if your data has been converted correctly. The first thing you should always check is if all your intended columns are created. Many errors can be detected by seeing only one column which combines all your data - which most likely means that your delimiters have not been set correctly or your CSV file is not structured correctly.
Column Family Connection
At the moment it is only possible to configure a Connection to Cassandra DB. The configuration dialog looks as follows:
- Connection Name: The name of the Connection within ONE DATA
- Host: The host address of the running database to which the Connection should be established.
- Port: The respective port of the database server.
- Keyspace Name: Cassandra Keyspace name for data replication.
- Datacenter: The Datacenter name of the respective cluster.
- Credentials: The Credentials used for Connection authentication. This information is manadatory except for MSSQL Kerberos.
Final Demo
Connection creation
The following demo shows how to create a Connection of type REST API.
Further use
The created Connections may be used in some ONE DATA processors for API requests and other use cases. A Demo is available in the Flexible Rest API Processor documentation.