Accessing Multiple Data sources on Analytics Engine powered by Apache Spark
Apache Spark is bigdata processing engine, it is very fast, flexible and user-friendly. It has built in modules for streaming, SQL, machine learning and graph processing. Spark processes data in memory and stores it temporarily in memory, but that’s not persistent storage. In real life use-case you need persistent storage like database or repository to store data, from where spark can access data. Spark can access data that’s in:
- SQL Databases (Anything that can be connected using JDBC/ODBC driver)
- Cloud storage (eg. Amazon S3)
- NoSQL databases.
- Local files
- and many more…
In this blog we will talk about how user can work with Database / Object Storage for their spark jobs. Running Spark on IBM Watson Studio Spark environments or IBM Cloud Pak for Data which is IBM’s native cloud Enterprise solution will simplify the setup process, because it provides sophisticated drivers to connect with Cloudant, DB2, and Object Storage.
Setting Up Environment on Watson Studio Local to Run Spark Notebook
Login to your IBM Cloud Pak for Data cluster
Next, go to Watson Studio homepage, click on Create a project.
Next, hover around and click on Create an empty project.
Select Environments tab.
1. Connect with DB2 using Jdbc and Odbc Connectors
1.1 Connect with DB2 using JDBC in Python environment
Create new notebook with environment definition Default Spark 2.4 & Python 3.6
Next, You can create notebook from following code.
credentials_1 = {
'jdbcurl':'jdbc:db2://REPLACE_WITH_DB2_HOSTNAME:REPLACE_WITH_DB2_PORT/REPLACE_WITH_DBNAME',
'username':'REPLACE_WITH_DB2_USERNAME',
'password':'REPLACE_WITH_DB2_PASSWORD'
}
transportation = sqlContext.read.jdbc(credentials_1["jdbcurl"],"REPLACE_WITH_DB2_USERNAME.MYTABLE",properties = {"user" : credentials_1["username"], "password" : credentials_1["password"],"driver" : "com.ibm.db2.jcc.DB2Driver"})print(transportation.show())
Once, Notebook is started, Replace DB2 hostname, port, username, and password with your DB2 instance’s credentials. I had pre-created MYTABLE in my database and i am creating data frame from this table.
1.2 Connect with DB2 using JDBC in Scala environment
Create new notebook with environment definition Default Spark 2.4 & Scala 2.11
You can create notebook from following code.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "REPLACE_WITH_DB2_USERNAME")
connectionProperties.put("password", "REPLACE_WITH_DB2_PASSWORD")
val transportation = spark.read
.jdbc("jdbc:db2://REPLACE_WITH_DB2_HOSTNAME:REPLACE_WITH_DB2_PORT/REPLACE_WITH_DBNAME", "REPLACE_WITH_DB2_USERNAME.MYTABLE", connectionProperties)transportation.count()transportation.collect.foreach(println)
Replace DB2 hostname, port, username, and password with your DB2 instance’s credentials. Change table name based on you tables available in your DB2 database. This notebook will create data frame from that table. You can choose do more transformation on your data based on your use case.
1.3 Connect with DB2 using ODBC in Python environment
You can create notebook from following code.
from ibmdbpy import IdaDataBase, IdaDataFrameidadb_7b95653958cb435aaa3b542f2f711967 = IdaDataBase(dsn='DASHDB;Database=REPLACE_WITH_DBNAME;\
Hostname=REPLACE_WITH_DB2_HOSTNAME;Port=REPLACE_WITH_DB2_PORT;PROTOCOL=TCPIP;UID=REPLACE_WITH_DB2_USERNAME;PWD=REPLACE_WITH_DB2_PASSWORD')ida_df_1 = IdaDataFrame(idadb_7b95653958cb435aaa3b542f2f711967, 'SAMPLES.ANCESTRY',indexer="ANC_CODE")print(ida_df_1.head())
Same as 1.1, create notebook with Default Spark 2.4 & Python 3.6 environment definition. Replace DB2 hostname, port, username, and password with your DB2 instance’s credentials. In this example, i am creating data frame from different table.
2. Cloudant Connectors
2.1 Connect with Cloudant using BahirCloudant connector in python environment.
You can create notebook from following code.
data_df = sqlContext.read.format("org.apache.bahir.cloudant") \
.option("cloudant.host","REPLACE_WITH_CLOUDANT_HOSTNAME")\
.option("cloudant.username","REPLACE_WITH_CLOUDANT_USERNAME")\
.option("cloudant.password","REPLACE_WITH_CLOUDANT_PASSWORD")\
.load("REPLACE_WITH_CLOUDANT_DOCUMENT_ID")data_df.registerTempTable("tablei")print(sqlContext.sql("SELECT name from tablei").show())
Once, Notebook is started, Replace cloudant hostname, username, and password
For example, We have user following document for notebook execution. Document name test.
NOTE: we will be using same data in other examples
{
"_id": "ed5e83a3dc3d644391decda741d775a7",
"_rev": "1-f8142fd7dc417e11895870df30aec154",
"name": "IAECloudantTest"
}
As part of Notebook execution, we are loading cloudant document. Next we are creating table and running select query on table to validate whether data is loaded correctly or not.
2.2 Connect with Cloudant using Scala BahirCloudant connector
Take reference from 1.2 and create Create new notebook with environment definition Default Spark 2.4 & Scala 2.11. Next, You can create notebook using following code.
val data_df = spark.sqlContext.read.format("org.apache.bahir.cloudant")
.option("cloudant.host","REPLACE_WITH_CLOUDANT_HOSTNAME")
.option("cloudant.username","REPLACE_WITH_CLOUDANT_USERNAME")
.option("cloudant.password","REPLACE_WITH_CLOUDANT_PASSWORD")
.load("REPLACE_WITH_CLOUDANT_DOCUMENT_ID")data_df.registerTempTable("tablei")spark.sqlContext.sql("SELECT name from tablei").take(1).foreach(println)
Once, Notebook is started, Replace cloudant hostname, username, and password.
Using same cloudant data, following is the output of example notebook
3. How to connect with Object storage
You can follow an excellent blog by Rachit Arora in which he talks about Efficient way to connect to Object storage in IBM Watson Studio — Spark Environments
Analytics Engine powered by Apache Spark on Cloud Pak for Data provides user a platform where user can access, transform and analyze the data and they can integrate with multiple AI solution to build an complete solution for their use cases to grow their business.
Reference:
Please give your feedback on this article and reach out to me Sunil Ganatra via twitter using my handler sunil_ganatra or reach out to me on LinkedIn.
This Story is co-authored with Shobhit Shukla who is one of the Developer for IBM Analytics Engine, IBM Watson Studio Spark Environments and Analytics Engine power by Apache Spark on IBM Cloud Pak for Data