PySpark — Implementing Persisting Metastore

Subham Khandelwal
3 min readNov 11, 2022

Metastore is a central repository that is used to manage relational objects such as database, tables, views etc. Apache Hive, Impala, Spark, Presto etc. implements the same in order to maintain the metadata.

Representation Image (Credits: Apache Spark)

Spark basically allows only two type of catalog Implementation “in-memory” and “hive”. This can be verified with the following configuration parameter.

spark.sql.catalogImplementation

In-memory implementation only persists for the session, once the session is stopped metadata is remove along with it. In-order to persist our relational metadata, we have to change our catalog implementation to hive.

But, what if we don’t have Hive Metastore implemented? Can’t we implement persisting relational entities? The answer is — Yes we can. In order to achieve the same, Spark uses default Apache Derby to create a metastore_db in the current working directory.

Now, how to we enable the same? Lets check that out with an example.

Default spark catalog Implementation.

# Validate CatalogImplementation
spark.conf.get("spark.sql.catalogImplementation")

Lets create a new SparkSession with Hive catalog implementation and to achieve that we add enablehivesupport() while creating SparkSession object.

--

--

No responses yet