PySpark — Merge Data Frames with different Schema
In order to merge data from multiple systems, we often come across situations where we might need to merge data frames which doesn’t have same columns or the columns are in different order.
union() and unionByName() are two famous method that comes into play when we want to merge two Data Frames. But, there is a small catch to it.
Union works with column sequences i.e. both Data Frames should have same columns and in-order. On the other hand, UnionByName does the same job but with column names. So, until we have same columns in both data frames we can merge them easily.
Lets check out this in action. First we will create our example Data Frames
# Example DataFrame 1
_data = [
["C101", "Akshay", 21, "22-10-2001"],
["C102", "Sivay", 20, "07-09-2000"],
["C103", "Aslam", 23, "04-05-1998"],
]_cols = ["ID", "NAME", "AGE", "DOB"]df_1 = spark.createDataFrame(data = _data, schema = _cols)
df_1.printSchema()
df_1.show(10, False)
# Example DataFrame 2
_data = [
["C106", "Suku", "Indore", ["Maths", "English"]],
["C110", "Jack", "Mumbai", ["Maths", "English", "Science"]],
["C113", "Gopi", "Rajkot", ["Social Science"]],
]_cols = ["ID", "NAME", "ADDRESS", "SUBJECTS"]