PySpark — Structured Streaming Read from Kafka

Subham Khandelwal
5 min readJan 9, 2023

Spark streaming acts as a real time data processing engine that allows you to process from various data sources including Apache Kafka. One of the benefits of using Spark streaming with Kafka is that it allows you to process large volume of data in real time and make near instantaneous decision based on the data.

Representation Image

We are going to work on the same use-case, but this time we are going to read data from a Kafka cluster, get the average temperature per device per Customer for each day and write to console after computations.

Checkout the Usecase — https://urlit.me/blog/pyspark-structured-streaming-read-from-files/

To get started quickly with the setup for Kafka, Spark and Jupyter notebook, checkout Github for my custom docker image— https://github.com/subhamkharwal/docker-images/tree/master/pyspark-kafka-cluster

Checkout this YouTube video to complete the setup.

Once the setup is complete, lets quickly create the Kafka topic “devices” to post our data.

Kafka topic “devices” would be used by Source data to post data and Spark Streaming Consumer will use the same to continuously read data and process it using various transformations and actions.

Now if you are new to Spark, PySpark or want to learn more — I teach Big

--

--