Data Lakehouse with PySpark — Setup Delta Lake Warehouse on S3 and Boto3 with AWS

Subham Khandelwal
3 min readFeb 21, 2023

Next, as part of the series Data Lakehouse with PySpark, we need to setup boto3 and Delta Lake to communicate with AWS S3. This will help us to create our default warehouse location for Delta Lake on AWS S3. We will also setup the metastore location for Delta Lake.

Representation Image

To start, we need the AWS credentials Access Key and Secret Access Key. Checkout — https://medium.com/@subhamkharwal/pyspark-connect-aws-s3-20660bb4a80e to know more. In case of any issues, please follow the YouTube video at the end.

Connect AWS from boto3:

Once we have the AWS Access Key and Secret Access Key, create a new folder .aws and file credentials in the user’s root directory. Add the following lines replacing the Access Key, Secret Key with profile as default and save the file.

Credential file
[default]
aws_access_key_id=<Your AWS Access Key>
aws_secret_access_key=<Your AWS Secret Key>

And we are done, now boto3 can easily use the credentials from the default profile to connect with AWS.

Connect AWS from Delta Lake:

To connect Delta Lake with AWS S3 and create the default warehouse location on AWS S3. Add the following lines in the bottom on spark-defualts.conf file.

park.jars.packages  io.delta:delta-core_2.12:2.1.1,org.apache.hadoop:hadoop-aws:3.3.2
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.warehouse.dir s3a://easewithdata/dw-with-pyspark/warehouse
spark.driver.extraJavaOptions -Dderby.system.home=/home/jupyter/ease-with-data/dw-with-pyspark/derby
Spark Defaults Conf Setup

Please change parameter as per your location on S3. This setup is done as per the session of Data Lakehouse on YouTube — https://youtube.com/playlist?list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b

We can also define the location of the metastore for Delta Lake using the hive-site.xml file.

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/jupyter/ease-with-data/dw-with-pyspark/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/jupyter/ease-with-data/dw-with-pyspark/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
Change Metastore Location

Save the files and we would now be able to easily create Delta table with default warehouse location on S3.

DB and tables creates in S3

Github location for conf files — https://github.com/subhamkharwal/ease-with-data/tree/master/dw-with-pyspark/conf

Still struggling, checkout the following YouTube video

Make sure to Like and Subscribe.

Follow us on YouTube: https://youtube.com/@easewithdata

If you are new to Data Lakehouse checkout — https://youtube.com/playlist?list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b

--

--