PySpark — Count(1) vs Count(*) vs Count(col_name)

More often we are confused to choose correct the way for getting count from a table, where none of them is wrong. But which one is the most proficient one? Is it count(1) or count(*) or count(col_name)?

Representation Image

Both count(1) and count(*) basically gives you the total count of records, whereas count(col_name) basically gives you the count of NOT NULL records on that column.

Spark has its own way to deal with the above situation. As usual lets check this out with example.

We will create a Python decorator and use format “noop” for performance benchmarking.

def get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Python decorator for Performance measure

Our example dataset.

df.show()
Example Dataset

Now to test the performance between the different ways of count, lets groupBy on trx_id.

COUNT(1) performance

@get_time
def x(): df.groupBy("trx_id").agg(count(lit(1))).write.format("noop").mode("overwrite").save()
COUNT(1) performance

COUNT(*) performance

COUNT(*) performance

COUNT(col_name) performance

COUNT(col_name) performance

Now, if you keeping eye on the time of the results, count(1) and count(*) has almost same performance, but for count(col_name) its a little degraded.

But why is that? Only way to find out is to check the Explain Plans for all three to understand what’s happening under the hood.

Explain Plan for COUNT(1)
Explain Plan for COUNT(*)
Explain Plan for COUNT(city_id)

If you notice the highlighted segment in the explain plans, count(1) and count(*) has the same plan (function = [count(1)]) with no change at all, whereas in count(city_id) — Spark applies function=[count(city_id)], which has to iterate over column to check for null values.

Note: These results might vary with complexity of implementations. But, in all ideal cases it should result as demonstrated.

Conclusion: In case of count(1) and count(*) we have almost same performance (as explain plans are same) but with count(col_name) we can have slight performance hit, but that’s OK as sometimes we need to get the correct count based on columns.

So, next time you can choose the correct one for implementation or answer wisely to the question of the interviewer :)

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/18_count_explain_plan_comp.ipynb

Checkout my personal blog — https://urlit.me/blog/

Checkout PySpark Medium series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

Wish to Buy me a Coffee: Buy Subham a Coffee

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store