Data Lakehouse with PySpark — Setup Delta Lake Warehouse on S3 and Boto3 with AWS
Next, as part of the series Data Lakehouse with PySpark, we need to setup boto3 and Delta Lake to communicate with AWS S3. This will help us to create our default warehouse location for Delta Lake on AWS S3. We will also setup the metastore location for Delta Lake.
To start, we need the AWS credentials Access Key and Secret Access Key. Checkout — https://medium.com/@subhamkharwal/pyspark-connect-aws-s3-20660bb4a80e to know more. In case of any issues, please follow the YouTube video at the end.
Connect AWS from boto3:
Once we have the AWS Access Key and Secret Access Key, create a new folder .aws and file credentials in the user’s root directory. Add the following lines replacing the Access Key, Secret Key with profile as default and save the file.
[default]
aws_access_key_id=<Your AWS Access Key>
aws_secret_access_key=<Your AWS Secret Key>
And we are done, now boto3 can easily use the credentials from the default profile to connect with AWS.
Connect AWS from Delta Lake:
To connect Delta Lake with AWS S3 and create the default warehouse location on AWS S3. Add the following lines in the bottom on spark-defualts.conf file.