TABLE OF CONTENTS

Introduction

As described in the Connections article and the Database Connection Load Processor, it's possible to improve performance for one Hive Connection. This is done through the Hive support in Spark SQL. It needs to be enabled in the instance configuration however.

Changes for the hive-jdbc driver

  • The jar files related to the hive-jdbc driver needs to be removed from the classpath (both onedata and spark executors)
  • As a replacement the Standalone Hive JDBC driver jar needs to be placed on the local file system where ONE DATA can access it
    • Mounted somewhere into the docker container but not inside the tomcat lib folder, e.g. to /tmp/hive-jdbc-2.3.8-standalone.jar 
    • Download: https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/2.3.8/hive-jdbc-2.3.8-standalone.jar (version numbers can be replaced with the desired Hive version)
    • Set the config property hive.jdbc.standalone.jar to /tmp/hive-jdbc-2.3.8-standalone.jar (see also README.md)
    • This needs to be done for the onedata container and all onedata-computation containers

Enabling Hive support for Spark

To enable the solution following changes must be done when running a docker container with ONE DATA's main and computation server(s):

  • use proper set of JARs to support Spark in YARN cluster,
  • add libraries for adding Hive support to Spark to the driver (i.e. the docker container),
  • create hive-site.xml with connection details to Hive metastore DB,
  • create and configure directory containing the JARs necessary for connection to Hive metastore,
  • configure Spark to connect to the Hive metastore.

Directories and files mentioned in the guide are available and set on edge node of ONE LOGIC's cluster and have to be adjusted to the environment.


Spark in YARN cluster

Currently, the Spark JARs used in YARN cluster (for mapr-dev installation) are at maprfs:///apps/onedata-dev/spark/2.4.5/jars directory, but they contain wrong Hive specific JARs (version 2.3.7 instead of the native ones shipped with Spark 2.4.5). 

To use the proper ones, there are 2 options:

  • Download the Spark 2.4.5 distribution and copy the JARs from there to a directory in MapR-FS or
  • use the list of JARs currently used for MapR-DEV testing at maprfs:///apps/onedata-dev/spark/2.4.5-hive/jars.


Spark Hive libraries on driver

To enable support for Hive in Spark in driver (i.e. docker container containing ONE DATA computation server) following libraries must be added to Tomcat's lib directory:

  • hive-exec-1.2.1.spark2.jar
  • hive-metastore-1.2.1.spark2.jar
  • spark-hive_2.11-2.4.5.jar

These files can be copied from the JARs added to MapR-FS in the previous step or copy them from the edge node from the directory /docker/mapr-dev/lib-spark-hive:


docker-compose.yaml

volumes:
- /docker/mapr-dev/lib-spark-hive/hive-exec-1.2.1.spark2.jar:/opt/tomcat/lib/hive-exec-1.2.1.spark2.jar
- /docker/mapr-dev/lib-spark-hive/hive-metastore-1.2.1.spark2.jar:/opt/tomcat/lib/hive-metastore-1.2.1.spark2.jar
- /docker/mapr-dev/lib-spark-hive/spark-hive_2.11-2.4.5.jar:/opt/tomcat/lib/spark-hive_2.11-2.4.5.jar
- /docker/mapr-dev/lib-spark-hive/libthrift-0.13.0.jar:/opt/tomcat/lib/libthrift-0.13.0.jar

if onedata does not start try /opt/webapps/onedata/WEB-INF/lib (/opt/webapps/onedata.computation/WEB-INF/lib for the onedata-computation containers) instead of /opt/tomcat/lib

→ libthrift is needed since 48.0.0, can be downloaded from https://repo1.maven.org/maven2/org/apache/thrift/libthrift/0.13.0/libthrift-0.13.0.jar


Hive metastore JARs

To connect to Hive metastore run in MapR cluster a proper set of JARs need to be passed to Spark via spark.sql.hive.metastore.jars property:

  1. Create a directory on local FS (e.g. /docker/mapr-dev/lib-hive-mapr) which will be mounted to the docker container.
  2. Copy JARs from any of the nodes where Hive is installed (i.e. /opt/mapr/hive/hive-2.3/lib) to this directory.
  3. Mount the directory to the container (e.g. to /tmp/hive-metastore-libs).


docker-compose.yaml

volumes:
- /docker/mapr-dev/lib-hive-mapr:/tmp/hive-metastore-libs

Create hive-site.xml file and add them to driver container

Hive support in Spark requires proper connection details to be specified, this can be done by creating a hive-site.xml file and adding them to Tomcat's lib directory.

  • Create a hive-site.xml file on local FS (e.g. /docker/mapr-dev/spark-sql-conf/hive-site.xml) and fill in following:


hive-site.xml

<?xml version="1.0"?>
<configuration>
    <!-- Database -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://stg-hel-mapr01-node01.onedata.de:3306/hive?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
    </property>
<!--    <property>-->
<!--        <name>hadoop.security.credential.provider.path</name>-->
<!--        <value>jceks://maprfs/user/mapr/hivemetastore.jceks</value>-->
<!--        <description>password to use against metastore database</description>-->
<!--    </property>-->
</configuration>

Fill in the proper settings for target Hive metastore DB. You can either use javax.jdo.option.ConnectionPassword to specify password or hadoop.security.credential.provider.path to use a keystore file with the password (needs to be accessible by driver and executors; haven't tested it yet on ONE LOGIC's MapR cluster).

  • Mount the created XML file to Tomcat's lib as hive-site.xml.
  • Based on the type of DB used for Hive metastore mount the proper driver JAR to Tomcat's lib (should be present in /opt/mapr/hive/hive-2.3/lib directory mentioned in previous step).


docker-compose.yaml

volumes:
- /docker/mapr-dev/spark-sql-conf/hive-site.xml:/opt/tomcat/lib/hive-site.xml
- /docker/mapr-dev/lib/mysql-connector-java-8.0.19.jar:/opt/mapr-hive-libs/mysql-connector-java-8.0.19.jar

Spark config properties

Following Spark properties should be added to the docker container as environment variables:


docker-compose.yaml

environment:
- "spark.sql.catalogImplementation=hive"
- "hive.metastore.schema.verification=false"
- "spark.sql.hive.metastore.version=2.3.3"
- "spark.sql.hive.metastore.sharedPrefixes=com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs,com.mapr.security.JNISecurity,com.sun.crypto.provider.SealedObjectForKeyProtector"
- "spark.sql.hive.metastore.jars=/tmp/hive-metastore-libs/*"

Selecting Hive Connections which are also accessible via spark

To select Hive Connections which are backed by the same metastore as the metastore configured to be used by Spark the following config properties need to be set: connection.hive.load.spark.hostname and connection.hive.load.spark.port. Spark only supports the usage of one Hive metastore, but those settings can be set differently for each container (onedata and onedata-computation) because they use different Spark contexts.


Example:

Given are a Hive metastore running on metastore-host and a hive-server running on hive-server-host port 10000. Both that hive-server and Spark are configured to use the metastore on metastore-host. We then set connection.hive.load.spark.hostname=hive-server-host and connection.hive.load.spark.port=10000 so all Hive Connections in ONE DATA having equal settings for hostname and port will use the feature, all others will use Hive via JDBC.