Member-only story

PySpark — Read Binary Files like PNG or PDF

3 min readOct 15, 2022

Can Spark read .png or .pdf file? The answer is YES. Spark can read almost any type of file as binary file into data frame.

Spark has a binaryFile in-built format to load any Binary file and store the content as binary. The BLOB or binary content can be later written back to appropriate file format as per requirement.

Lets read some binary files quickly for demonstration. Files we are going to read

%%shls -lhtr dataset/files

Lets read one .png file to check the output of the data frame

# Lets read a .png filedf_spark_png = spark \
    .read \
    .format("binaryFile") \
    .load("dataset/files/spark.png")df_spark_png.printSchema()
df_spark_png.show()

We read all .png files from path

# Lets read all .png filedf_spark_png = spark \
    .read \
    .format("binaryFile") \
    .load("dataset/files/*.png")df_spark_png.printSchema()
df_spark_png.show()

PySpark — Read Binary Files like PNG or PDF

Written by Subham Khandelwal

No responses yet