PySpark — Structured Streaming Read from Files

Subham Khandelwal
3 min readJan 5, 2023

Often Streaming use case needs to read data from files/folders and process for downstream requirements. And to elaborate this, lets consider the following use case.

A popular thermostat company takes reading from their customer devices to understand device usage. The device data JSON files are generated and dumped in an input folder. The notification team requires the data to be flattened in real-time and written in output in JSON format for sending out notifications and reporting accordingly.

Representation Image

A sample JSON input file contains the following data points

Sample JSON file data format

Now, to design the real time streaming pipeline to ingest the file and flatten, lets create the SparkSession

# Create the Spark Session
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Streaming Process Files") \
.config("spark.streaming.stopGracefullyOnShutdown", True) \
.master("local[*]") \
.getOrCreate()

spark

Create the DataFrameStreamReader dataframe

# To allow automatic schemaInference while reading
spark.conf.set("spark.sql.streaming.schemaInference", True)

#

--

--