PySpark —Data Frame Joins on Multiple conditions

Subham Khandelwal
4 min readMar 3, 2023

We often run into situations where we have to join two Spark Data Frames on multiple conditions and those conditions can be complex and may change as per requirement. We will work on a simple hack that will make our join conditions way much more effective and simpler to use.

Representation Image

Generate a SparkSession to start.

# Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Join Condition") \
.master("local[*]") \
.getOrCreate()

spark
Spark Session

Now, lets create our employee and department data frames for explanation. Both data frames has department ids as common joining column.

# Create dataset
# Employee dataset

_emp_data = [
["Subham", "D01", 5000, 1],
["Rakesh", "D02", 6000, 0],
["Rohit", "D02", 7000, 1],
["Dinesh", "D03", 10000, 1]
]

# Employee schema
_emp_schema = ["name", "dept_id", "salary", "active"]

# Department dataset
_dept_data = [
["D01", "Economics"],
["D02", "Science"],
["D03", "Arts"]
]

# Department schema
_dept_schema = ["id", "name"]
# Create Employee and Department dataframes
# Employee Dataframe

--

--

No responses yet