PySpark — Read Compressed gzip files

Read Compressed files directly in Spark

Subham Khandelwal
3 min readOct 16, 2022

Spark natively supports reading compressed gzip files into data frames directly. We have to specify the compression option accordingly to make it work.

Photo by JJ Ying on Unsplash

But, there is a catch to it. Spark uses only a single core to read the whole gzip file, thus there is no distribution or parallelization. In case the gzip file is larger in size, there can be Out of memory errors.

Now if you are new to Spark, PySpark or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data

YouTube — Tutorials

Lets check that with an example. We will read the sales.csv.gz file

# Read zipped file directly from Spark
df_zipped = spark \
.read \
.format("csv") \
.option("compression", "gzip") \
.option("header", True) \
Read the gzip file into data frame

Now, if we check the distribution

# Check the number of partitions
Number of partition

So, what is the best possible solution: The answer is simple — In case we have larger gzip/dataset which can cause memory errors, then just unzip it and then read in data frame.

%%shgzip -d dataset/tmp/sales.csv.gz
ls -lhtr dataset/tmp/
Unzipped file

Read the un-zipped file

# Read un-zipped file directly from Spark
df_unzipped = spark \
.read \
.format("csv") \
.option("header", True) \
Reading unzipped file

Checking the distribution

# Check the number of partitions
Unzipped distribution

But, in case the file/dataset is small enough to fit and distribute without errors, just go ahead and read directly into data frame and then repartition according to requirement.

Checkout the iPython Notebook on Github —

Checkout the PySpark Series on Medium —

Wish to Buy me a Coffee: Buy Subham a Coffee