Working with External Hive Metastore on Analytics Engine powered by Apache Spark

In traditional databases like RDBMS like MySQL and PostgreSQL, you must have used the metadata. In most of the databases, the metadata is a read-only set of tables that provide information about the database objects. Metadata is primarily used by administrators and the developers when they want to get some details about the database objects.

How metadata is relevant to spark ? Any kind of SQL engine requires metadata store. As we all know that from Apache Spark we can run SQL query on data stored on external storages like databases and Object storage. Here is a blog on how to connect with databases and Object storage. In typical Hadoop cluster where hive is already installed on that cluster, Spark and Hive metadata are stored in the hive metastore. But if you want to run only spark jobs, then setting up full stack Hadoop cluster is unnecessary.

Running Spark on cloud services will simplify the setup process. User can run spark jobs on IBM Watson Studio Spark environments or IBM Cloud Pak for Data which is IBM’s native cloud Enterprise solution. You can follow official documentation of Analytics Engine powered by Apache Spark to run spark jobs or notebooks on IBM Cloud Pak for Data.

Why user should externalize spark metadata ? Let’s take simple use case to understand this,

I am creating machine learning model and i want to run it in iterative manner.

I have spun Spark runtime on Watson Studio to load data which is in object store. I have build machine learning model, as part of that i have created few table definitions. The table definitions are stored in local metadata store of spark. Because default metadata store of the spark cluster is local metadata store.

Now I am spinning new spark runtime to train model. Since my data is in object store, I can easily access those data files from new clusters as well. If I try to execute the SELECT * FROM MYTABLE, I will get an error as the table does not exists. Every time you try to run the model this will happen.

What is happening here ? the metadata store of pervious cluster was local. The newly create cluster no longer has access to that metadata store.

Now to fix this issue what are the options user has ?
Either Execute all the Create Table scripts in this cluster which is very dumb thing to do.


He can point to external metadata store to store the information and every time he create spark runtimes/cluster he can point to external metadata store.

Following are the detailed instruction on how to externalize hive metastore to IBM Compose for MySQL for a spark job or runtimes.

Note : For this we need mysql instance, setting up Spark and MySQL on your local computer can be tedious and tricky.

You need to provision a Compose for MySQL service instance from IBM Cloud.

Next, you need to create service credential. Click on view credentials, It is in json format. The json key uri_cli_1 holds cli command to connect with mysql via cli client. It has hostname, username, password and port information, Please note it down information, it will be used when you run notebook.

Login to your IBM® Cloud Pak for Data cluster

Next, go to Watson Studio homepage, click on Create a project.

Next, hover around and click on Create an empty project.

Select “Environments” tab.

Scroll down, and choose environment definition Default Spark 2.4 & Scala 2.11 to run on Spark 2.4 or Default Spark Scala 2.11 to run on Spark 2.3

click on New Notebook

Once, Notebook is started, Copy paste following code in notebook.

import org.apache.spark.sql.{Row, SaveMode, SparkSession}case class Record(key: Int, value: String)// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.hadoop.javax.jdo.option.ConnectionUserName", "REPLACE_WITH_MYSQL_USERNAME")
.config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:mysql://REPLACE_WITH_MYSQL_HOSTNAME:REPLACE_WITH_MYSQL_PORT/REPLACE_WITH_MYSQL_DBNAME?createDatabaseIfNotExist=true")
import spark.implicits._
import spark.sql
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")sql("LOAD DATA LOCAL INPATH '/opt/ibm/spark/examples/src/main/resources/kv1.txt' INTO TABLE src")// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()

Replace MySql hostname, username, password, and port value in notebook.

As part of notebook execution, it connects with IBM Cloud MySql instance. Once it is connected we are creating table with name src. We are using data which comes with spark already and it is available under examples directory. To verify whether data was loaded successfully we are running select query on same table.

How to validate the hive metastore configuration?

Use value of uri_cli_1 from MySQL service credential, to connect to the MYSQL, and use your database

$ mysql -u REPLACE_WITH_USERNAME -pREPLACE_WITH_PASSWORD --host REPLACE_WITH_HOSTNAME --port 19071 --ssl-mode=REQUIREDmysql> show databases;+--------------------+| Database           |+--------------------+| information_schema || hbdb               || mysql              || performance_schema || sys                |+--------------------+7 rows in set (0.16 sec)mysql> use REPLACE_WITH_MYSQL_DBNAME;Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -ADatabase changed

List the metadata tables. DBS and TBLS are the metadata tables for hive databases and hive tables.

mysql> show tables;+---------------------------+| Tables_in_hbdb            |+---------------------------+| BUCKETING_COLS            || CDS                       || COLUMNS_V2                || DATABASE_PARAMS           || DBS                       || FUNCS                     || TBLS                      |.....+---------------------------+29 rows in set (0.17 sec)

Run following query on DBS table, to see the DB metadata pointing to your hiveDB

mysql> select * from DBS;+-------+-----------------------+-----------------------------------------+---------+------------+------------+| DB_ID | DESC                  | DB_LOCATION_URI                         | NAME    | OWNER_NAME | OWNER_TYPE |+-------+-----------------------+-----------------------------------------+---------+------------+------------+|     1 | Default Hive database | file:/home/spark/shared/spark-warehouse | default | public     | ROLE       |+-------+-----------------------+-----------------------------------------+---------+------------+------------+1 row in set (0.16 sec)

Run following query on TBLS, to validate the metadata of table which we have created as part of notebook execution.

mysql> select * from TBLS;+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+|      1 |  1566587117 |     1 |                0 | spark |         0 |     1 | src      | MANAGED_TABLE | NULL               | NULL               |+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+1 rows in set (0.15 sec)

Analytics Engine powered by Apache Spark on Cloud Pak for Data provides user a platform where user can access, transform and analyze the data and they can integrate with multiple AI solution to build an complete solution for their use cases to grow their business.

This Story is co-authored with Dharmesh Jain who is one of the Architect for IBM Analytics Engine.

References :

Please give your feedback on this article and reach out to me Sunil Ganatra via twitter using my handler sunil_ganatra or reach out to me on LinkedIn.

Senior Engineer@IBM Analytics Engine, Watson Studio Spark Environments and Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data. #k8s #docker