PySpark — Setup Delta Lake

Delta Lake is one of the hot topics surrounding the big data space right now. It brings in a lot of improvements in terms reliability on data lake. But why is it gaining so much attention?

Representation Image (Credits: delta.io)

As per Delta IO documentation — “Delta Lake is an open source project that enables building a Lakehouse Architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS”.

There are so many Key features associated with delta lake, checkout below image

Key Features (Credits: delta.io)

Today, we are going to setup delta lake using PySpark and enable our metastore for the same.

In order to make PySpark support Delta Lake, we need to import the delta-core jar into SparkSession and set the configs right.

Generate SparkSession

Make sure to enable persisting metastore. Checkout my previous blog on the same — https://urlit.me/blog/pyspark-implementing-persisting-metastore/

Now, we are ready to work with Delta Lake. To create our first Delta table, lets read the Sales Parquet data.

Sales dataset

Now, we transform and write this data as delta table.

Sales Delta table

Since we didn’t specify any external location, by default it would be a in-house managed delta table.

Table definition
Table data

Since, delta tables supports versioning

Delta table versioning

Lets update a record to see the changes in versioning

ACID transaction support

Record got updated without any hassle (delta lake supports DML operations)

Record updated

Changes in versioning

Version updated after DML operation

Lets validate if a given table location is Delta Table

Validate Delta table

Bonus : Shortcut to convert Parquet data to delta

Verify if the location is converted into delta

Verification

But, if we check the metadata from Catalog, its still a hive table.

So, to convert the metadata as well

Table converted to delta

We are going to cover many more Delta Lake concepts in upcoming blogs.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/27_delta_with_pyspark.ipynb

Checkout my Personal blog — https://urlit.me/blog/

Checkout the PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store