PySpark — User Defined Functions vs Higher Order Functions

Subham Khandelwal
4 min readOct 10, 2022

In Spark Structured Data Frame manipulations, its more often for complex calculations we always look forward to UDF i.e. User Defined Functions due to more flexibility towards writing logics.

But, Python UDFs run into performance bottlenecks while dealing with huge volume of data due to serialization & de-serialization. For Spark, a Python UDF is complete black-box over which it has no control, thus it can’t implement any optimization over it on any Physical or Logical layers.

To avoid such bottlenecks, it is always recommended to use Higher Order Functions wherever possible for complex calculations.

Spark employs many popular higher order functions such as filter, transform, exists, aggregate etc. which can be chained to achieve result in an optimized way.

Lets check this is action. We will create an example data frame with city names and try to calculate the length of the city names all together.

# Example Data Frame_data = [
[1, ["Bangalore", "Mumbai", "Pune", "Indore"]],
[2, ["Bangalore"]],
[3, []],
[4, ["Kolkata", "Bhubaneshwar"]],
[5, ["Bangalore", "Mumbai", "Pune", "Indore", "Ahmedabad", "Suratkal"]],
[6, ["Delhi", "Mumbai", "Kolkāta", "Bangalore", "Chennai", "Hyderābād", "Pune", "Ahmedabad", "Sūrat", "Lucknow", "Jaipur", "Cawnpore", "Mirzāpur", "Nāgpur", "Ghāziābād", "Indore", "Vadodara", "Vishākhapatnam", "Bhopāl", "Chinchvad", "Patna", "Ludhiāna"…

--

--