PySpark — The Tiny File Problem
The tiny file problem is one of the popular known problems in any system. Spark file read performance is often decided by the overheads it require to read a file.
But, Why are Tiny files a Problem ? The major reason is Overheads. When we do a job but with extra effort every time, then that's a Problem.
What are these Overheads? Checkout the parameters in Spark Documentation — https://spark.apache.org/docs/latest/sql-performance-tuning.html
Lets understand the problem with an example. We will use noop for Performance benchmarking.
# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out : https://www.geeksforgeeks.org/decorators-in-python/
import timedef get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Let us first read a single Parquet file which is not partitioned of size approx. 210MB
# Lets read the single parquet file directly
@get_time
def x(): spark.read.parquet("dataset/sales.parquet").write.format("noop").mode("overwrite").save()