Sitemap

PySpark — Read Compressed gzip files

Read Compressed files directly in Spark

3 min readOct 16, 2022

--

Spark natively supports reading compressed gzip files into data frames directly. We have to specify the compression option accordingly to make it work.

Press enter or click to view image in full size
Photo by JJ Ying on Unsplash

But, there is a catch to it. Spark uses only a single core to read the whole gzip file, thus there is no distribution or parallelization. In case the gzip file is larger in size, there can be Out of memory errors.

Now if you are new to Spark, PySpark or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data

Lets check that with an example. We will read the sales.csv.gz file

# Read zipped file directly from Spark
df_zipped = spark \
.read \
.format("csv") \
.option("compression", "gzip") \
.option("header", True) \
.load("dataset/tmp/sales.csv.gz")
df_zipped.printSchema()

--

--

No responses yet