PySpark — Count(1) vs Count(*) vs Count(col_name)

Subham Khandelwal
3 min readOct 20, 2022

More often we are confused to choose correct the way for getting count from a table, where none of them is wrong. But which one is the most proficient one? Is it count(1) or count(*) or count(col_name)?

Representation Image

Both count(1) and count(*) basically gives you the total count of records, whereas count(col_name) basically gives you the count of NOT NULL records on that column.

Spark has its own way to deal with the above situation. As usual lets check this out with example.

We will create a Python decorator and use format “noop” for performance benchmarking.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out :
https://www.geeksforgeeks.org/decorators-in-python/
import time
def get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Python decorator for Performance measure

Our example dataset.

# Lets read the dataframe to check the data
df =…

--

--