PySpark — Flatten JSON/Struct Data Frame dynamically

Flatten JSON data dynamically in tabular structure

Subham Khandelwal
3 min readOct 7, 2022

We always have use cases where we have to flatten the complex JSON/Struct Data Frame into flattened simple Data Frame just like the example below:

example.this.that => example_this_that

Flatten JSON/Struct Data Frame Data

Following code snippet does the exact job dynamically. No manual effort required to expand the data structure or to determine the schema.

Lets first create an example Data Frame for the job

# Lets create an Example Data Frame to hold JSON data# Example Data Frame with column having JSON data
_data = [
['EMP001', '{"dept" : "account", "fname": "Ramesh", "lname": "Singh", "skills": ["excel", "tally", "word"]}'],
['EMP002', '{"dept" : "sales", "fname": "Siv", "lname": "Kumar", "skills": ["biking", "sales"]}'],
['EMP003', '{"dept" : "hr", "fname": "MS Raghvan", "skills": ["communication", "soft-skills"], "hobbies" : {"cycling": "expert", "computers":"basic"}}']
]
# Columns for the data
_cols = ['emp_no', 'raw_data']
# Lets create the raw Data Frame
df_raw = spark.createDataFrame(data = _data, schema = _cols)
# Determine the schema of the JSON payload from the column

--

--