Data Lakehouse with PySpark — Setup Delta Lake Warehouse on S3 and Boto3 with AWS
Next, as part of the series Data Lakehouse with PySpark, we need to setup boto3 and Delta Lake to communicate with AWS S3. This will help us to create our default warehouse location for Delta Lake on AWS S3. We will also setup the metastore location for Delta Lake.
To start, we need the AWS credentials Access Key and Secret Access Key. Checkout — https://medium.com/@subhamkharwal/pyspark-connect-aws-s3-20660bb4a80e to know more. In case of any issues, please follow the YouTube video at the end.
Connect AWS from boto3:
Once we have the AWS Access Key and Secret Access Key, create a new folder .aws and file credentials in the user’s root directory. Add the following lines replacing the Access Key, Secret Key with profile as default and save the file.
aws_access_key_id=<Your AWS Access Key>
aws_secret_access_key=<Your AWS Secret Key>
And we are done, now boto3 can easily use the credentials from the default profile to connect with AWS.
Connect AWS from Delta Lake:
To connect Delta Lake with AWS S3 and create the default warehouse location on AWS S3. Add the following lines in the bottom on spark-defualts.conf file.
Please change parameter as per your location on S3. This setup is done as per the session of Data Lakehouse on YouTube — https://youtube.com/playlist?list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b
We can also define the location of the metastore for Delta Lake using the hive-site.xml file.
<description>JDBC connect string for a JDBC metastore</description>
<description>Driver class name for a JDBC metastore</description>
<description>location of default database for the warehouse</description>
Save the files and we would now be able to easily create Delta table with default warehouse location on S3.
Github location for conf files — https://github.com/subhamkharwal/ease-with-data/tree/master/dw-with-pyspark/conf
Still struggling, checkout the following YouTube video
Make sure to Like and Subscribe.
Follow us on YouTube: https://youtube.com/@easewithdata
If you are new to Data Lakehouse checkout — https://youtube.com/playlist?list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b