PySpark — Distributed Broadcast Variable

Spark enables distributed immutable variables which can be shared across cluster efficiently without any function encapsulation and Broadcast variable is the best example of it.

Broadcasting (image-credit: Wikipedia)

The simplest way of using a variable is to mention it in a function/code and pass it along to task with the code, but it can be inefficient in case of large variables such as tables when those needs to be deserialized on worker nodes multiple times. Thus to remedy comes “Broadcast variables”.

Broadcast variables are shared and cached on every node instead of sending them with each tasks, which avoids serialization issue avoiding performance hits in case of large systems.

How can we use this functionality, lets checkout with a simple example.

For our use case we are supplying two broadcast variables — Department name and establishment year, which will be used in our Students data.

Lets generate our example dataset

Students Data frame

Our Broadcast variables

Broadcast variables

Now, lets use the broadcast variable to get data for out Student’s data frame

Final Data Frame

Default size for Broadcast variable is 4M which can be controlled through spark.broadcast.blockSize parameter.

Current example might not justify the use case of the Broadcast variable, but consider you are dealing with a big machine learning model that needs to be distributed for performance benefit.

This example very well gives you an idea of how to use it.

Checkout iPython Notebook on Github —

Checkout my personal blog —

Checkout PySpark Medium Series —

Wish to Buy me a Coffee: Buy Subham a Coffee



⚒️ Senior Data Engineer with 10+ YOE | 📽️ YouTube channel: | 📞 TopMate :

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store