PySpark — Structured Streaming Read from Files

Often Streaming use case needs to read data from files/folders and process for downstream requirements. And to elaborate this, lets consider the following use case.

A popular thermostat company takes reading from their customer devices to understand device usage. The device data JSON files are generated and dumped in an input folder. The notification team requires the data to be flattened in real-time and written in output in JSON format for sending out notifications and reporting accordingly.

Representation Image

A sample JSON input file contains the following data points

Sample JSON file data format

Now, to design the real time streaming pipeline to ingest the file and flatten, lets create the SparkSession

Create the DataFrameStreamReader dataframe

Few points to note:

  1. Option cleanSource — It can archive or delete the source file after processing. Values can be archive, delete and default is off.
  2. Option sourceArchiveDir — Archive directory if the cleanSource option is set to archive.
  3. Option maxFilesPerTrigger — Sets how many files to process in a single go.
  4. We need to set spark.sql.streaming.schemaInference to True to allow streaming schemaInference.

Now, if we need to check the Schema, just replace the readStream to read for debugging

Schema structure

Since the data is for multiple devices in list/array we need to explode the dataframe before flattening

Flattening dataframe

Lets write the streaming data to output folder in JSON format

As soon we put a new file in input directory, same is processed in real-time. See in action:

Output files generated

Lets check the output

Flattened output data

Check the Spark UI, to see the micro bach executions

Spark UI

Checkout Structured Streaming basics — https://urlit.me/blog/pyspark-the-basics-of-structured-streaming/

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/32_spark_streaming_read_from_files.ipynb

Checkout the docker images to quickly start — https://github.com/subhamkharwal/docker-images

Checkout my Personal blog — https://urlit.me/blog/

Checkout the PySpark Medium Series — https://medium.com/@subhamkharwal/learnbigdata101-spark-series-940160ff4d30

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store