PySpark — The Basics of Structured Streaming

Spark Structured Streaming is a streaming framework built up on Spark SQL Engine. It leverages the existing APIs framework to enhance the power of Streaming Computation.

Streaming Data (image source: Spark Documentation)

So, if you are familiar with DataFrame, Dataset and SQL on Spark, this one will be easy. The Structured Streaming framework can be assumed as micro batch framework which appends data in the end of the table continuously. However, with a small twist to it.

To start and keep it simple, we can break the Structured Streaming into 4 parts — What, How, When and Where ? If you have answer for all 4 then the job is done.

Lets understand all 4, one by one.

What: simply means what is your Input Source? Spark supports the following as Input Sources:

How: simply means how are you processing the data. It involves the transformations(same as batch but with few restrictions) and the Output mode for the sink. Spark supports the following Output modes:

Not all output sinks supports all Output methods. We will discuss all these when we run our examples.

When: basically refers to the triggers for the streaming pipeline. It defines how the pipeline will trigger.

Where: defines the output sink where the data will be displayed/stored. Few supported sinks:

Currently all this might sound a bit new or weird, but in upcoming posts we will go through all of them with examples.

For more details checkout the Spark documentation — https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Checkout my personal blog — https://urlit.me/blog/

Checkout PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

Wish to Buy me a Coffee: Buy Subham a Coffee

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store