PySpark —Data Frame Joins on Multiple conditions
4 min readMar 3, 2023
We often run into situations where we have to join two Spark Data Frames on multiple conditions and those conditions can be complex and may change as per requirement. We will work on a simple hack that will make our join conditions way much more effective and simpler to use.
Generate a SparkSession to start.
# Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Join Condition") \
.master("local[*]") \
.getOrCreate()
spark
Now, lets create our employee and department data frames for explanation. Both data frames has department ids as common joining column.
# Create dataset
# Employee dataset
_emp_data = [
["Subham", "D01", 5000, 1],
["Rakesh", "D02", 6000, 0],
["Rohit", "D02", 7000, 1],
["Dinesh", "D03", 10000, 1]
]
# Employee schema
_emp_schema = ["name", "dept_id", "salary", "active"]
# Department dataset
_dept_data = [
["D01", "Economics"],
["D02", "Science"],
["D03", "Arts"]
]
# Department schema
_dept_schema = ["id", "name"]
# Create Employee and Department dataframes
# Employee Dataframe…