PySpark — Distributed Broadcast Variable
Spark enables distributed immutable variables which can be shared across cluster efficiently without any function encapsulation and Broadcast variable is the best example of it.
The simplest way of using a variable is to mention it in a function/code and pass it along to task with the code, but it can be inefficient in case of large variables such as tables when those needs to be deserialized on worker nodes multiple times. Thus to remedy comes “Broadcast variables”.
Broadcast variables are shared and cached on every node instead of sending them with each tasks, which avoids serialization issue avoiding performance hits in case of large systems.
How can we use this functionality, lets checkout with a simple example.
For our use case we are supplying two broadcast variables — Department name and establishment year, which will be used in our Students data.
Lets generate our example dataset
# Our example dataset_data = [
["Ramesh", "D001", "Apache Spark"],
["Siv", "D001", "C++"],
["Imran", "D002", "English"],
["Akshay", "D003", "Hindi"],
["Somesh", "D002", "Scala"],
["Hitesh", "D001", "Physics"]
]_cols = ["NAME", "DEPT_CODE", "FAV_SUBJECT"]