Member-only story

PySpark —Data Frame Joins on Multiple conditions

4 min readMar 3, 2023

We often run into situations where we have to join two Spark Data Frames on multiple conditions and those conditions can be complex and may change as per requirement. We will work on a simple hack that will make our join conditions way much more effective and simpler to use.

Generate a SparkSession to start.

# Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Join Condition") \
    .master("local[*]") \
    .getOrCreate()

spark

Now, lets create our employee and department data frames for explanation. Both data frames has department ids as common joining column.

# Create dataset
# Employee dataset
_emp_data = [
    ["Subham", "D01", 5000, 1],
    ["Rakesh", "D02", 6000, 0],
    ["Rohit", "D02", 7000, 1],
    ["Dinesh", "D03", 10000, 1]
]

# Employee schema
_emp_schema = ["name", "dept_id", "salary", "active"]

# Department dataset
_dept_data = [
    ["D01", "Economics"],
    ["D02", "Science"],
    ["D03", "Arts"]
]

# Department schema
_dept_schema = ["id", "name"]

# Create Employee and Department dataframes
# Employee Dataframe…

PySpark —Data Frame Joins on Multiple conditions

Written by Subham Khandelwal

No responses yet