PySpark — Columnar Read Optimization
Parquet, ORC etc are very popular Columnar File formats used in Big Data for data storage and retrieval. These are popular choice for fast analytics workloads.
To check about different columnar file format, checkout — http://urlit.me/t43HI
We are going to see how we can proactively optimize our data read operations from those columnar files.
For our example we are going to consider Parquet file format, which are popularly used for their Big Data Data Warehousing use cases.
Without delay, lets jump into action mode. Our Parquet example file (118M of size).
As usual we will create out Python decorator for performance measurement.
# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out : https://www.geeksforgeeks.org/decorators-in-python/
import timedef get_time(func):
def inner_get_time() -> str:
start_time = time.time()
func()
end_time = time.time()
return (f"Execution time: {(end_time - start_time)*1000} ms")
print(inner_get_time())
We are going to use “noop” format for performance benchmarking. Keep a note of the timings.