Databricks Structured streaming using EventHub, Kafka & PowerBI

Sanajit Ghosh
4 min readMay 3, 2020

The use case for this scenario is to capture real time streaming data from sensors and IoT devices and then process it for data analysis and visualization purposes.

Databricks Streaming using Event hub

Use Case:

Consider a wind farm, where hundreds of wind turbines are harnessing energy from wind and storing them as energy in large cells. All these big equipments are located in remote offshore and often requires engineers to travel long distance to troubleshoot them.

Leveraging IoT, Machine level data processing and streaming can save a lot to the industry.

The above architecture is a prototype of industrial cloud automation using sensor data.

For the given scenario, I have created a small python application that generates dummy sensor readings to Azure Event hub/Kafka. I have used Azure Databricks for capturing the streams from the event hub and PoweBI for data Visualization of the received data.

  1. Step 1: Preparing the data source

At first, create an event hub in the azure portal and note down its namespace, access-key name and value. If you are new to event hubs and message brokers then follow this link.

Once the event hub is ready, install the following packages(for IDE, you can create an application in VS code or simply use jupyter notebook in Anaconda environment.

Packages required

Copy the credentials from eventhub namespace and paste it in the below fields. I have made the source code available in my GitHub handle.

Once the code is ready, try running it and check whether any real time spike is generated in the dashboard metrics of event hub.

Below output is a simulation of parameters that are actually being used in a wind turbine. This code generated is json metrics and after every 10s it is sent to Azure Event Hub.

Our next idea is to process this streaming data with the help of databricks structured streaming.

Python application/turbine source

2.Structured streaming using Databricks and EventHub

The idea in structured streaming is to process and analyse the streaming data from eventhub. For this we need to connect the event hub to databricks using event hub endpoint connection strings. Use this documentation to get familiar with event hub connection parameters and service endpoints.

Note: Try installing any of these Maven packages across all clusters in the available databricks libraries.

com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1

com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13

com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6

On successful establishment of connection with event hub, run the below code to check whether any stream have started.

FYR, The entire databricks code is made available here. You can also refer to this doc for your understanding.

Note: We are writing our stream into a memory and then fetching the results.

.format(memory) #testing purpose

.queryName(“real_hub) #kind of “select * from read_hub” to get the stream

The below spikes shows how data ingestion takes place from event hub to databricks.

3. Getting into the actual scenario.

Our objective is not only to view the data in real time, but also to save it in a format that is much faster(parquet) in terms of processing and then visualizing it in other external sources for drawing insights.

I have made some tweaks in the code by appending the incoming streams into rows and then saving the streams in parquet files.

Before doing so, I have defined the incoming schema and appended the incoming data into a dataframe.

Write the stream to Parquet/json

After the stream has been captured, the objective will be to append all the rows(parquet files) into a single dataframe and then create a temp view out of it.

Appending all parquet files

From the below snap, you can see that all the parquet files are processed into a single dataframe and then an external hive table is created for data analysis purpose.

Saving the stream to Hive table

Once the table is created from the dataframe, I can easily query out filters from it. Thus, the idea to capture streaming data and representation it into tabular format is successful.

In my next post ,I will show you how to access this table and visualize the wind turbine metrics in powerBI.

--

--