PySpark — The Basics of Structured Streaming
Spark Structured Streaming is a streaming framework built up on Spark SQL Engine. It leverages the existing APIs framework to enhance the power of Streaming Computation.
So, if you are familiar with DataFrame, Dataset and SQL on Spark, this one will be easy. The Structured Streaming framework can be assumed as micro batch framework which appends data in the end of the table continuously. However, with a small twist to it.
To start and keep it simple, we can break the Structured Streaming into 4 parts — What, How, When and Where ? If you have answer for all 4 then the job is done.
Lets understand all 4, one by one.
What: simply means what is your Input Source? Spark supports the following as Input Sources:
- Streaming Input sources such as Kafka, Azure EventHub, Kinesis etc.
- Files systems such as HDFS, S3 etc.
- Sockets
How: simply means how are you processing the data. It involves the transformations(same as batch but with few restrictions) and the Output mode for the sink. Spark supports the following Output modes:
- Append: Adds new records
- Update: Updates changed records
- Complete: Overwrite all records