PySpark — Implementing Persisting Metastore
Metastore is a central repository that is used to manage relational objects such as database, tables, views etc. Apache Hive, Impala, Spark, Presto etc. implements the same in order to maintain the metadata.
Spark basically allows only two type of catalog Implementation “in-memory” and “hive”. This can be verified with the following configuration parameter.
In-memory implementation only persists for the session, once the session is stopped metadata is remove along with it. In-order to persist our relational metadata, we have to change our catalog implementation to hive.
But, what if we don’t have Hive Metastore implemented? Can’t we implement persisting relational entities? The answer is — Yes we can. In order to achieve the same, Spark uses default Apache Derby to create a metastore_db in the current working directory.
Now, how to we enable the same? Lets check that out with an example.
Default spark catalog Implementation.
# Validate CatalogImplementation
Lets create a new SparkSession with Hive catalog implementation and to achieve that we add enablehivesupport() while creating SparkSession object.
# Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession \
.appName("Metastore with Hive Support") \
.config("spark.sql.warehouse.dir", "spark-warehouse") \
Once the SparkSession object is created we can find a metastore_db folder created in default working directory to persist the metadata information. This folder stores all the required metadata.
By default path specified to spark.sql.warehouse.dir is used to store data for the in-house managed tables. For external tables location needs to specified while creating the same.
Now, lets quickly create a table to checkout our metastore.
# Read data from Parquet
df = spark.read \
Write the dataset as managed table
# Write dataframe as table
Now, if we stop and create a new SparkSession with same default working directory, we will still be able to access all table metadata, since it is permanent/persisted.
select * from sales_managed limit 10;
Conclusion: The benefit for metastore implementation comes into play when we have to read the a dataset and we have performance hits due to data scanning.
Checkout this article to understand more on data scanning — https://urlit.me/blog/pyspark-optimize-data-scanning-exponentially/
And now we can directly read data directly — as metastore will store the table name, metadata/schema and path of the table.
Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/26_metastore_in_pyspark.ipynb
Checkout my Personal blog — https://urlit.me/blog/
Checkout the PySpark Series on Medium — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30