PySpark —Data Frame Joins on Multiple conditions

We often run into situations where we have to join two Spark Data Frames on multiple conditions and those conditions can be complex and may change as per requirement. We will work on a simple hack that will make our join conditions way much more effective and simpler to use.

Representation Image

Generate a SparkSession to start.

Spark Session

Now, lets create our employee and department data frames for explanation. Both data frames has department ids as common joining column.

Employee and Department data frames

So to make things simpler we will consider few joining scenarios.

Condition 1: Join Employee and Department based on dept id and active record = 1

We will create a list with all joining condition and use the same while joining datasets.

Condition 1

Note: All conditions mentioned in list automatically are treated as logical AND. So, the joining condition will transform in SQL to: df_emp.dept_id = df_dept.id and df_emp.active = 1

Looks cleaner and easy, Right ? Lets see some complex example now.

Condition 2: Join Employee and Department based on dept id and active record = 1 or salary > 5000

We will create a new list with join conditions specified.

Condition 2

The logical OR (|) operator is mentioned in the condition. So, if we transform this to SQL: df_emp.dept_id = df_dept.id and ((df_emp.active = 1) or (df_emp.salary > 5000))

Note: The brackets “()”are very important if you are specifying more the one condition.

What if, we only need the first condition from the condition list (join_cond) ? See the next example.

Condition 3: Join Emp and Dept based on dept id and active record = 1

We can re-use the above join condition and list and use index of the list to specify the join conditions. And for first condition we use the index as 0 (i.e. df_emp.dept_id == df_dept.id)

Condition 3

Since we ignored any other condition and joined based on dept id only, we have all four records from employee.

Is it possible to take first 2 from the condition list? Yes.

Condition 4: Join both datasets based of first 2 conditions — id and active = 1

Condition 4

We are using Python indices to play with the required conditions for the joining the tables.

Conclusion: If we start using Python list to specify our join conditions, it becomes way lot simpler to maintain and understand our code. The code also becomes much more flexible to use.

Make sure to Like and Subscribe.

Checkout Ease With Data YouTube Channel: https://www.youtube.com/@easewithdata

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/35_join_on_multiple_conditions.ipynb

Checkout my Personal Blog — https://urlit.me/blog/

Checkout the PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

--

--

⚒️ Senior Data Engineer with 10+ YOE | 📽️ YouTube channel: https://www.youtube.com/@easewithdata | 📞 TopMate : https://topmate.io/subham_khandelwal

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store