PySpark — Fix Column Header with Spaces

Subham Khandelwal
2 min readOct 31, 2022

We often run into a situation where the Source files have header but with spaces in column names. And moving the data to different systems becomes challenging due to these limitations of spaces in column names in other systems.

First Name => first_name

We will run through a small code for fixing the column names dynamically. There will be no need to do any manual effort.

Lets jump into the example. We will start from checking the data for our example file.

%%shmore "dataset/students.csv"
Example Students dataset

Python code to fix the header and generate the list of fixed headers. We would replace the spaces with underscore “-”

# Python function to read the column name and fix the space with underscore "_"
from pyspark.sql import DataFrame
def fix_header(df: DataFrame) -> list:
fixed_col_list: list = []
for col in df.columns:
fixed_col_list.append(f"`{str(col).strip()}` as {str(col).strip().replace(' ','_').lower()}")

return fixed_col_list
Python code to fix header

Now, lets read the CSV file to generate the raw DataFrame

# Read the CSV file with malformed header
raw_df = spark.read.format("csv").option("header", True).load("dataset/students.csv")
raw_df.printSchema()
raw_df.show()
Raw Data Frame

Run the Raw Data Frame though the Python code to get the header list and fix the headers in the Data Frame

# Create a new dataframe with fixed column names
fixed_headers = fix_header(df = raw_df)
print(fixed_headers)
# Apply to create the new dataframe
fixed_df = df.selectExpr(fixed_headers)
fixed_df.printSchema()
fixed_df.show()
Fixed Data Frame

This code will work with all data sources until you can read the data in Spark Data Frame and pass it through the Python Code.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/23_Fix_Headers.ipynb

Checkout my Personal blog — https://urlit.me/blog/

Checkout the PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

--

--