How to install Pyspark correctly on windows step by step guide.

Spread the love

If you are struggling to install Pyspark on your windows machine then look no further. In this post, I will show you how to install Pyspark correctly on windows without any hassle. It’s guaranteed to work on windows.

Let’s get started.

Installing Pyspark using Docker –

Why using Docker to install Pyspark?

I have been trying to install Pyspark on my windows laptop for the past 3 days. I tried almost every method that has been given in various blog posts on the internet but nothing seems to work. Some of the main issues with installing Pyspark on windows are related to Java like Py4jError and others. The methods that are described by many articles have worked on some machine but does not have worked on many other machines because we all have different hardware and software configuration. So what works on one machine does not guarantees that it will also work on other machines. This is what Docker is trying to solve. It does not matter what hardware and software you are using, if the Application built with Docker runs on one machine then it is guaranteed to work on others as everything that is needed to run an application successfully is included in the Docker containers. Understanding how to use Docker is also a very important skill for any data scientist, so along the way, you will also learn to use it which an added benefit. 

So Let’s see how to install pyspark with Docker.

First, go to the website and create a account. Then download the Docker.

Then Double click on the Docker Desktop installer to install it. click ok after selection.

Once installed, you will see a screen like this.

Now, we need to download the Pyspark image from the docker hub which you can find here –

Copy the Docker Pull command and then run it in windows PowerShell or Git bash.

Once the image is downloaded, you we will see the Pull complete message and inside the Docker Desktop App you will see the Pyspark Image.

Then run this command in the PowerShell to run the container.

docker run -p 8888:8888 jupyter/pyspark-notebook

Then copy the address that is shown in the PowerShell and paste it in the web browser and hit enter.

This should open jupyter notebook in your web browser.

Now, let’s test if Pyspark runs without any errors. Create a new jupyter notebook. Then run the following command to start a pyspark session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

If everything installed correctly, then you should not see any problem running the above command.

You have successfully installed Pyspark on your machine.

To stop the container either use the Docker Desktop App or run the following command.

# stop the container
docker stop notebook

And to remove the container permanently run this command.

# remove the container permanently
docker rm notebook

To learn more about Docker, please follow this link on YouTube –

In my future posts, I will write more about How to use Docker for Data science. So make sure to subscribe to our blog below and if you like this post then please share it with others.

Leave a Reply