This Processor is similar to the Flexible ETL Load Processor (Deprecated Processor) with the addition that it does not need a Data Table to be created from a Connection beforehand. The Connection itself will be used.


TABLE OF CONTENTS

Overview

The Database Connection Load Processor operates on a special type of dataset, namely a pre-defined ETL (Extract, Transform, Load) source. The ETL source can be re-used within this Processor and therefore sensible information (e.g., user name and password) must not be repeatedly inserted. It loads the requested data for use in the Workflow.

Input

The Processor does not have an input port.


Configuration

  • Connection: Instructions on how to create a Connection can be found here.
  • Table Name or custom SQL: Be aware, that the remote table name is not necessarily identical to the ONE DATA Data Table name.

Output

The loaded Data Table can be used as input in further Processors.


Example

Workflow


Example Configuration

The "oracle19" is the Connection name of a ORACLE 19 database Connection inside ONE DATA. Within this Connection, we have access to a "PRODUCTS" table.

The configuration of additional parameters is optional. However, in the following example we can see how different parameters interact with each other:

  • fetchSize: number of rows per fetch
  • configuration of partitioning for parallel read via numberPartitions, lowerBound, upperBound, partitionColumn


Result


Troubleshooting

It might happen that the database returns certain column types, that are not supported by ONE DATA and the following error message occurs "Please cast the problematic column manually within the provided select query. "

In this case it is important to determine the problematic column and to cast it within the SELECT statement into something that ONE DATA understands, e.g. STRING. 

Some problematic ORACLE column types can be found here under "Additional Information for ORACLE".


Handling Oracle Binary Columns

ORACLE supports several binary column types, e.g. RAW, LONG_RAW, BLOB, etc.

Their content can be properly read via the Data Table Load Processor as seen in the following example.


However, when reading such columns via the Database Connection Load Processor, The configured select query is executed directly on the target database and the result shows exactly what was provided. For binary columns, ONE DATA shows their STRING value, which might be the "base64" encoded visualization. The following figure shows the previous result returned by the ORACLE database.


In order to display the decoded binary content, the following options are available:

  • Set the "Auto Cast to String" toggle to TRUE (while keeping the original select query). In this case, all returned columns are properly cast to STRING and this also includes a proper decoding.


  • Decoding and casting of the binary column via an SQL statement, e.g.
    SELECT UTL_RAW.cast_to_varchar2(RAW_COL), ... FROM TABLE_NAME

    The result is seen in the following figure (casting of two columns)


Hive Connection Performance Improvement

Connections to Hive metastores have been adapted to improve their performance. To increase the performance of reading data from Hive tables we use the Hive support in Spark SQL, which reads the Hive table's location from Hive's metastore and then uses it to read data in parallel directly from MapR-FS. 


To show the effect this change has, these use cases were tested with the following results:

Use CaseNew TimeOld Time
Simple row count on 2 billion rowsless than a minute19 minutes
Replicate big hive table (2 billion rows) by saving it as parquet in ONE DATA20 minutes2-3 hours
Queries with "String contains match" were also faster than before

Prerequesites

So that Spark SQL can be used in the Database Connection Load Processor when using Hive Connections, it needs to be enabled first. This is done via instance configuration. More information on this can be found here.

Restriction

Be aware that it is only possible to enable this improvement for one Hive metastore Connection (Spark limitation). In case of more than one, the others will be connected as-is via JDBC.