Accessing Multiple Data sources on Analytics Engine powered by Apache Spark

Sunil Ganatra
5 min readDec 3, 2019

Apache Spark is bigdata processing engine, it is very fast, flexible and user-friendly. It has built in modules for streaming, SQL, machine learning and graph processing. Spark processes data in memory and stores it temporarily in memory, but that’s not persistent storage. In real life use-case you need persistent storage like database or repository to store data, from where spark can access data. Spark can access data that’s in:

  • SQL Databases (Anything that can be connected using JDBC/ODBC driver)
  • Cloud storage (eg. Amazon S3)
  • NoSQL databases.
  • Local files
  • and many more…

In this blog we will talk about how user can work with Database / Object Storage for their spark jobs. Running Spark on IBM Watson Studio Spark environments or IBM Cloud Pak for Data which is IBM’s native cloud Enterprise solution will simplify the setup process, because it provides sophisticated drivers to connect with Cloudant, DB2, and Object Storage.

Setting Up Environment on Watson Studio Local to Run Spark Notebook

Login to your IBM Cloud Pak for Data cluster

Next, go to Watson Studio homepage, click on Create a project.

Watson Studio homepage

Next, hover around and click on Create an empty project.

Click on Create an empty project

Select Environments tab.

1. Connect with DB2 using Jdbc and Odbc Connectors

1.1 Connect with DB2 using JDBC in Python environment

Create new notebook with environment definition Default Spark 2.4 & Python 3.6

Select Environment Definition Default Spark Python 3.6

Next, You can create notebook from following code.

credentials_1 = {
'jdbcurl':'jdbc:db2://REPLACE_WITH_DB2_HOSTNAME:REPLACE_WITH_DB2_PORT/REPLACE_WITH_DBNAME',
'username':'REPLACE_WITH_DB2_USERNAME',
'password':'REPLACE_WITH_DB2_PASSWORD'
}
transportation = sqlContext.read.jdbc(credentials_1["jdbcurl"],"REPLACE_WITH_DB2_USERNAME.MYTABLE",properties = {"user" : credentials_1["username"], "password" : credentials_1["password"],"driver" : "com.ibm.db2.jcc.DB2Driver"})
print(transportation.show())

Once, Notebook is started, Replace DB2 hostname, port, username, and password with your DB2 instance’s credentials. I had pre-created MYTABLE in my database and i am creating data frame from this table.

Connect to Db2 via JDBC Driver From Python Notebook

1.2 Connect with DB2 using JDBC in Scala environment

Create new notebook with environment definition Default Spark 2.4 & Scala 2.11

Select Environment Definition Scala

You can create notebook from following code.

import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "REPLACE_WITH_DB2_USERNAME")
connectionProperties.put("password", "REPLACE_WITH_DB2_PASSWORD")

val transportation = spark.read
.jdbc("jdbc:db2://REPLACE_WITH_DB2_HOSTNAME:REPLACE_WITH_DB2_PORT/REPLACE_WITH_DBNAME", "REPLACE_WITH_DB2_USERNAME.MYTABLE", connectionProperties)
transportation.count()transportation.collect.foreach(println)

Replace DB2 hostname, port, username, and password with your DB2 instance’s credentials. Change table name based on you tables available in your DB2 database. This notebook will create data frame from that table. You can choose do more transformation on your data based on your use case.

Connect to Db2 via JDBC Driver From Scala Notebook

1.3 Connect with DB2 using ODBC in Python environment

You can create notebook from following code.

from ibmdbpy import IdaDataBase, IdaDataFrameidadb_7b95653958cb435aaa3b542f2f711967 = IdaDataBase(dsn='DASHDB;Database=REPLACE_WITH_DBNAME;\
Hostname=REPLACE_WITH_DB2_HOSTNAME;Port=REPLACE_WITH_DB2_PORT;PROTOCOL=TCPIP;UID=REPLACE_WITH_DB2_USERNAME;PWD=REPLACE_WITH_DB2_PASSWORD')
ida_df_1 = IdaDataFrame(idadb_7b95653958cb435aaa3b542f2f711967, 'SAMPLES.ANCESTRY',indexer="ANC_CODE")print(ida_df_1.head())

Same as 1.1, create notebook with Default Spark 2.4 & Python 3.6 environment definition. Replace DB2 hostname, port, username, and password with your DB2 instance’s credentials. In this example, i am creating data frame from different table.

Connect to Db2 via ODBC Driver From Python Notebook

2. Cloudant Connectors

2.1 Connect with Cloudant using BahirCloudant connector in python environment.

You can create notebook from following code.

data_df = sqlContext.read.format("org.apache.bahir.cloudant") \
.option("cloudant.host","REPLACE_WITH_CLOUDANT_HOSTNAME")\
.option("cloudant.username","REPLACE_WITH_CLOUDANT_USERNAME")\
.option("cloudant.password","REPLACE_WITH_CLOUDANT_PASSWORD")\
.load("REPLACE_WITH_CLOUDANT_DOCUMENT_ID")
data_df.registerTempTable("tablei")print(sqlContext.sql("SELECT name from tablei").show())

Once, Notebook is started, Replace cloudant hostname, username, and password

Notebook bahirCloudantPython.ipynb is started

For example, We have user following document for notebook execution. Document name test.

NOTE: we will be using same data in other examples

{
"_id": "ed5e83a3dc3d644391decda741d775a7",
"_rev": "1-f8142fd7dc417e11895870df30aec154",
"name": "IAECloudantTest"
}

As part of Notebook execution, we are loading cloudant document. Next we are creating table and running select query on table to validate whether data is loaded correctly or not.

Example Notebook execution output

2.2 Connect with Cloudant using Scala BahirCloudant connector

Take reference from 1.2 and create Create new notebook with environment definition Default Spark 2.4 & Scala 2.11. Next, You can create notebook using following code.

val data_df = spark.sqlContext.read.format("org.apache.bahir.cloudant")
.option("cloudant.host","REPLACE_WITH_CLOUDANT_HOSTNAME")
.option("cloudant.username","REPLACE_WITH_CLOUDANT_USERNAME")
.option("cloudant.password","REPLACE_WITH_CLOUDANT_PASSWORD")
.load("REPLACE_WITH_CLOUDANT_DOCUMENT_ID")
data_df.registerTempTable("tablei")spark.sqlContext.sql("SELECT name from tablei").take(1).foreach(println)

Once, Notebook is started, Replace cloudant hostname, username, and password.

Notebook bahirCloudantScala.ipynb is started

Using same cloudant data, following is the output of example notebook

Notebook execution output

3. How to connect with Object storage

You can follow an excellent blog by Rachit Arora in which he talks about Efficient way to connect to Object storage in IBM Watson Studio — Spark Environments

Analytics Engine powered by Apache Spark on Cloud Pak for Data provides user a platform where user can access, transform and analyze the data and they can integrate with multiple AI solution to build an complete solution for their use cases to grow their business.

Reference:

https://medium.com/@rachit1arora/efficient-way-to-connect-to-object-storage-in-ibm-watson-studio-spark-environments-d6c1199f9f97

Please give your feedback on this article and reach out to me Sunil Ganatra via twitter using my handler sunil_ganatra or reach out to me on LinkedIn.

This Story is co-authored with Shobhit Shukla who is one of the Developer for IBM Analytics Engine, IBM Watson Studio Spark Environments and Analytics Engine power by Apache Spark on IBM Cloud Pak for Data

--

--

Sunil Ganatra

Senior Engineer@IBM Analytics Engine, Watson Studio Spark Environments and Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data. #k8s #docker