PySpark — Columnar Read Optimization

Parquet, ORC etc are very popular Columnar File formats used in Big Data for data storage and retrieval. These are popular choice for fast analytics workloads.

To check about different columnar file format, checkout — http://urlit.me/t43HI

We are going to see how we can proactively optimize our data read operations from those columnar files.

For our example we are going to consider Parquet file format, which are popularly used for their Big Data Data Warehousing use cases.

Without delay, lets jump into action mode. Our Parquet example file (118M of size).

Sales Parquet File

As usual we will create out Python decorator for performance measurement.

Python decorator for Performance Measure

We are going to use “noop” format for performance benchmarking. Keep a note of the timings.

Method 1 — Lets read the data without specifying any schema. This will allow spark to read the schema on the fly.

Reading without specifying Schema

Method 2 — Reading data with schema specified

Reading with Schema specified

Method 3 — Reading data with only required columns, not all columns

Reading with only required column

Method 4 — Reading data with required columns using select()

Reading with select

Method 5 — Reading required columns using drop() to remove un-necessary columns

Reading with drop

Note: The above result will change with the number of columns and size of dataset. With increase, the timing difference will also vary significantly.

Conclusion — As demonstrated, if we specify schema during columnar reads with required columns only, the read time can be cut-off by huge margin. We can use specify schema directly or use select/drop for the same.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store