PySpark — Implementing Persisting Metastore

Subham Khandelwal
3 min readNov 11, 2022

--

Metastore is a central repository that is used to manage relational objects such as database, tables, views etc. Apache Hive, Impala, Spark, Presto etc. implements the same in order to maintain the metadata.

Representation Image (Credits: Apache Spark)

Spark basically allows only two type of catalog Implementation “in-memory” and “hive”. This can be verified with the following configuration parameter.

spark.sql.catalogImplementation

In-memory implementation only persists for the session, once the session is stopped metadata is remove along with it. In-order to persist our relational metadata, we have to change our catalog implementation to hive.

But, what if we don’t have Hive Metastore implemented? Can’t we implement persisting relational entities? The answer is — Yes we can. In order to achieve the same, Spark uses default Apache Derby to create a metastore_db in the current working directory.

Now, how to we enable the same? Lets check that out with an example.

Default spark catalog Implementation.

# Validate CatalogImplementation
spark.conf.get("spark.sql.catalogImplementation")
Default catalog Implementation

Lets create a new SparkSession with Hive catalog implementation and to achieve that we add enablehivesupport() while creating SparkSession object.

# Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Metastore with Hive Support") \
.master("local[*]") \
.config("spark.sql.warehouse.dir", "spark-warehouse") \
.enableHiveSupport() \
.getOrCreate()

spark
Spark Session with hive catalog

Once the SparkSession object is created we can find a metastore_db folder created in default working directory to persist the metadata information. This folder stores all the required metadata.

metastore_db created

By default path specified to spark.sql.warehouse.dir is used to store data for the in-house managed tables. For external tables location needs to specified while creating the same.

Now, lets quickly create a table to checkout our metastore.

# Read data from Parquet

df = spark.read \
.format("parquet") \
.load("dataset/sales.parquet")

df.printSchema()
df.show(10)
Example Dataset

Write the dataset as managed table

# Write dataframe as table

df.write.saveAsTable("sales_managed")
Metadata

Now, if we stop and create a new SparkSession with same default working directory, we will still be able to access all table metadata, since it is permanent/persisted.

%%sparksql

select * from sales_managed limit 10;
Table data

Conclusion: The benefit for metastore implementation comes into play when we have to read the a dataset and we have performance hits due to data scanning.

Checkout this article to understand more on data scanning — https://urlit.me/blog/pyspark-optimize-data-scanning-exponentially/

And now we can directly read data directly — as metastore will store the table name, metadata/schema and path of the table.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/26_metastore_in_pyspark.ipynb

Checkout my Personal blog — https://urlit.me/blog/

Checkout the PySpark Series on Medium — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

--

--