This Processor is similar to the Flexible ETL Load Processor (Deprecated Processor) with the addition that it does not need a Data Table to be created from a Connection beforehand. The Connection itself will be used.
TABLE OF CONTENTS
- Handling Oracle Binary Columns
- Hive Connection Performance Improvement
Handling Oracle Binary Columns
ORACLE supports several binary column types, e.g. RAW, LONG_RAW, BLOB, etc.
Their content can be properly read via the Data Table Load Processor as seen in the following example.
However, when reading such columns via the Database Connection Load Processor, The configured select query is executed directly on the target database and the result shows exactly what was provided. For binary columns, ONE DATA shows their STRING value, which might be the "base64" encoded visualization. The following figure shows the previous result returned by the ORACLE database.
In order to display the decoded binary content, the following options are available:
- Set the "Auto Cast to String" toggle to TRUE (while keeping the original select query). In this case, all returned columns are properly cast to STRING and this also includes a proper decoding.
- Decoding and casting of the binary column via an SQL statement, e.g.
SELECT UTL_RAW.cast_to_varchar2(RAW_COL), ... FROM TABLE_NAME
The result is seen in the following figure (casting of two columns)
Hive Connection Performance Improvement
Connections to Hive metastores have been adapted to improve their performance. To increase the performance of reading data from Hive tables we use the Hive support in Spark SQL, which reads the Hive table's location from Hive's metastore and then uses it to read data in parallel directly from MapR-FS.
To show the effect this change has, these use cases were tested with the following results:
|Use Case||New Time||Old Time|
|Simple row count on 2 billion rows||less than a minute||19 minutes|
|Replicate big hive table (2 billion rows) by saving it as parquet in ONE DATA||20 minutes||2-3 hours|
|Queries with "String contains match" were also faster than before|
So that Spark SQL can be used in the Database Connection Load Processor when using Hive Connections, it needs to be enabled first. This is done via instance configuration. More information on this can be found here.
Be aware that it is only possible to enable this improvement for one Hive metastore Connection (Spark limitation). In case of more than one, the others will be connected as-is via JDBC.