PySpark — Fix Column Header with Spaces
We often run into a situation where the Source files have header but with spaces in column names. And moving the data to different systems becomes challenging due to these limitations of spaces in column names in other systems.
First Name => first_name
We will run through a small code for fixing the column names dynamically. There will be no need to do any manual effort.
Lets jump into the example. We will start from checking the data for our example file.
%%shmore "dataset/students.csv"
Python code to fix the header and generate the list of fixed headers. We would replace the spaces with underscore “-”
# Python function to read the column name and fix the space with underscore "_"
from pyspark.sql import DataFramedef fix_header(df: DataFrame) -> list:
fixed_col_list: list = []
for col in df.columns:
fixed_col_list.append(f"`{str(col).strip()}` as {str(col).strip().replace(' ','_').lower()}")
return fixed_col_list
Now, lets read the CSV file to generate the raw DataFrame
# Read the CSV file with malformed header
raw_df = spark.read.format("csv").option("header", True).load("dataset/students.csv")
raw_df.printSchema()…