Member-only story

PySpark — Structured Streaming Read from Files

3 min readJan 5, 2023

Often Streaming use case needs to read data from files/folders and process for downstream requirements. And to elaborate this, lets consider the following use case.

A popular thermostat company takes reading from their customer devices to understand device usage. The device data JSON files are generated and dumped in an input folder. The notification team requires the data to be flattened in real-time and written in output in JSON format for sending out notifications and reporting accordingly.

A sample JSON input file contains the following data points

Now, to design the real time streaming pipeline to ingest the file and flatten, lets create the SparkSession

# Create the Spark Session
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Streaming Process Files") \
    .config("spark.streaming.stopGracefullyOnShutdown", True) \
    .master("local[*]") \
    .getOrCreate()

spark

Create the DataFrameStreamReader dataframe

# To allow automatic schemaInference while reading
spark.conf.set("spark.sql.streaming.schemaInference", True)

#…

PySpark — Structured Streaming Read from Files

Written by Subham Khandelwal

No responses yet