PySpark — Connect AWS S3
Cloud Distributed Storage spaces such as Google GCS, Amazon S3 and Azure ADLS often serves as data endpoints in many big data workloads.
Today, we are going to try and connect AWS S3 to our PySpark Cluster. And as you know to begin with we would definitely need an AWS Account and S3 bucket created.
Once ready with the S3 bucket, lets created a new user group S3Group in AWS IAM > User Groups. Attach AmazonS3FullAccess policy to the user group S3Group and complete the wizard.
Once done, lets create a new user S3User and add to S3Group group. As soon the user is attached to S3Group group, go to AWS IAM > User > S3User > Security Credentials > Access Keys and generate a new pair of Access Key.
Now, as the access keys will in format as Key Description/ID and Secret Key. Keep note of both and logout of AWS Console. Our setup in AWS is now complete.
Lets move to PySpark notebook/Spark environment and edit spark-env.sh file to add generated AWS credentials.
Find the spark-env.sh file in directory: Spark Installation > conf. If the file is spark-env.sh.template then rename to spark-env.sh