Member-only story

PySpark — The Tiny File Problem

Subham Khandelwal
3 min readOct 14, 2022

--

The tiny file problem is one of the popular known problems in any system. Spark file read performance is often decided by the overheads it require to read a file.

But, Why are Tiny files a Problem ? The major reason is Overheads. When we do a job but with extra effort every time, then that's a Problem.

What are these Overheads? Checkout the parameters in Spark Documentation — https://spark.apache.org/docs/latest/sql-performance-tuning.html

Lets understand the problem with an example. We will use noop for Performance benchmarking.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out :
https://www.geeksforgeeks.org/decorators-in-python/
import time
def get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
Python Decorator to measure timings

Not able to read this story — Click on this link. Now if you are new to Spark, PySpark, Databricks or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data

--

--

Responses (1)