PySpark — The Tiny File Problem

Subham Khandelwal
3 min readOct 14, 2022

The tiny file problem is one of the popular known problems in any system. Spark file read performance is often decided by the overheads it require to read a file.

But, Why are Tiny files a Problem ? The major reason is Overheads. When we do a job but with extra effort every time, then that's a Problem.

What are these Overheads? Checkout the parameters in Spark Documentation — https://spark.apache.org/docs/latest/sql-performance-tuning.html

Lets understand the problem with an example. We will use noop for Performance benchmarking.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out :
https://www.geeksforgeeks.org/decorators-in-python/
import time
def get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Python Decorator to measure timings

Let us first read a single Parquet file which is not partitioned of size approx. 210MB

# Lets read the single parquet file directly
@get_time
def x(): spark.read.parquet("dataset/sales.parquet").write.format("noop").mode("overwrite").save()

--

--

Responses (1)