Member-only story
PySpark — The Tiny File Problem
The tiny file problem is one of the popular known problems in any system. Spark file read performance is often decided by the overheads it require to read a file.
But, Why are Tiny files a Problem ? The major reason is Overheads. When we do a job but with extra effort every time, then that's a Problem.
What are these Overheads? Checkout the parameters in Spark Documentation — https://spark.apache.org/docs/latest/sql-performance-tuning.html
Lets understand the problem with an example. We will use noop for Performance benchmarking.
# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out : https://www.geeksforgeeks.org/decorators-in-python/
import timedef get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Not able to read this story — Click on this link. Now if you are new to Spark, PySpark, Databricks or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data