Data Lakehouse with PySpark — Setup PySpark Docker Jupyter Lab env

As part of the series Data Lakehouse with PySpark, we need to setup the PySpark environment on Docker in Jupyter lab. Today we are going to set up the same in few simple steps. This environment can further be used for your personal use-cases and practise.

Representation Image

If you are still not following, checkout the playlist on YouTube —

There are few prerequisites for the environment setup, check below image. We will not need AWS account for now, but will definitely needed in future course.

Prerequisites

Now, install docker desktop on your machine. You can download it from Docker official website — .

The setup is pretty straight forward and will not need any speciality.

Turn on the docker desktop and wait until the Docker Engine colour changes to GREEN.

Docker Desktop

Go ahead and clone/download the Docker images Github repo — .

Once cloned/downloaded move into the folder pyspark-jupyter-lab and open Command prompt to run the following command which will build the required docker image.

docker build --tag easewithdata/pyspark-jupyter-lab .
Docker Image Build

The command will run for sometime depending on the internet bandwidth, once complete you can find the image listed in your Docker desktop.

Docker Desktop Images

Now, run the following command to create the container, which will run our PySpark Jupyter Lab env on Docker.

docker run -d -p 8888:8888 -p 4040:4040 --name jupyter-lab easewithdata/pyspark-jupyter-lab

You can find one container created in the containers tab.

Docker container

Copy the token from the jupyter-lab container logs by clicking on jupyter-lab container.

Jupyter-lab token

Now, open to login into the Jupyter Lab environment. Paste the copied token and set up a new password.

Login into Jupyter lab

Voila, we just created a brand new PySpark Jupyter lab environment on Docker.

PySpark Jupyter Lab env

Next time you just need to start the container, open the and login using the set password.

To check out SparkSession web UI —

Note: This is a completely configured PySpark env, open a notebook create a SparkSession and start using it. There is no need of any setup of Spark, Python etc. In case you need any library, you can install it as it comes with root privileges by default.

Still struggling, checkout the following YouTube video

Make sure to Like and Subscribe.

Follow us on YouTube:

If you are new to Data Warehousing checkout —

--

--

⚒️ Senior Data Engineer with 10+ YOE | 📽️ YouTube channel: https://www.youtube.com/@easewithdata | 📞 TopMate : https://topmate.io/subham_khandelwal

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store