Sitemap

PySpark — The Tiny File Problem

3 min readOct 14, 2022

The tiny file problem is one of the popular known problems in any system. Spark file read performance is often decided by the overheads it require to read a file.

But, Why are Tiny files a Problem ? The major reason is Overheads. When we do a job but with extra effort every time, then that's a Problem.

What are these Overheads? Checkout the parameters in Spark Documentation — https://spark.apache.org/docs/latest/sql-performance-tuning.html

Lets understand the problem with an example. We will use noop for Performance benchmarking.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out :
https://www.geeksforgeeks.org/decorators-in-python/
import time
def get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Press enter or click to view image in full size
Python Decorator to measure timings

Not able to read this story — Click on this link. Now if you are new to Spark, PySpark, Databricks or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data

--

--

Responses (1)