PySpark — Structured Streaming Read from Files
Often Streaming use case needs to read data from files/folders and process for downstream requirements. And to elaborate this, lets consider the following use case.
A popular thermostat company takes reading from their customer devices to understand device usage. The device data JSON files are generated and dumped in an input folder. The notification team requires the data to be flattened in real-time and written in output in JSON format for sending out notifications and reporting accordingly.
A sample JSON input file contains the following data points
Now, to design the real time streaming pipeline to ingest the file and flatten, lets create the SparkSession
# Create the Spark Session
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Streaming Process Files") \
.config("spark.streaming.stopGracefullyOnShutdown", True) \
.master("local[*]") \
.getOrCreate()
spark
Create the DataFrameStreamReader dataframe
# To allow automatic schemaInference while reading
spark.conf.set("spark.sql.streaming.schemaInference", True)
#…