Skip to the content.

Install & Run Spark on your MAC machine & AWS Cloud step by step guide

http://www.rupeshtiwari.com/learning-apache-spark/

You will able to install spark and also run spark shell and pyspark shell on your mac.

Step 1: Install JAVA on your MAC

Install JAVA steps on your mac machine

Step 2: Install Spark on your MAC

Step 3: Installing python3

If you already have python3 then ignore this step. In order to check type python3 on your terminal.

  1. Install python3: brew install python3
  2. Check python3 installed: python3

  3. Next setup pyspark_python environment variable to point python3: export PYSPARK_PYTHON=python3
  4. Check the path: echo $PYSPARK_PYTHON

  5. Also put this script on your startup command file .zshrc file in my case.

Step 4: Running PySpark shell in your MAC laptop

Now run pyspark to see the spark shell in python.

Running small program with PySpark Shell in your MAC laptop

You will learn about spark shell, local cluster, driver, executor and Spark Context UI.

cluster mode tool
Local Client Mode spark-shell
# Navigate to spark3 bin folder
cd ~/spark3/bin

# 1. create shell
pyspark

# read and display json file with multiline formated json like I have in my example
df = spark.read.option("multiline","true").json("/Users/rupeshti/workdir/git-box/learning-apache-spark/src/test.json")
df.show()

👉 option("multiline","true") is important if you have JSON with multiple lines formated by prettier or any other formatter

Analyzing Spark Jobs using Spark Context Web UI in your MAC laptop

To monitor and investigate your spark application you can check spark context web UI.

Running Jupyter Notebook ( In local cluster and client mode ) in your MAC laptop

Data scientist use Jupyter Notebook to develop & explore application step by step. Spark programming in python requires you to have python on your machine.

cluster mode tool
Local Client Mode Notebook

If you install Anaconda environment, then you get python development environment also you will get spark support. You can download the community edition and install Anaconda. ANaconda comes with pre-configured Jupyter notebook.

How to use Spark using Jupyter Notebook?

Notebook is a Shell based environment. You can type your code in shell and run it.

  1. set SPARK_HOME environment variable
  2. Install findspark package
  3. Initialize findspark: the connection between Anaconda python environment and your spark installation

Step 1: Setting environment variable and starting notebook

After installing Anaconda. Type jupyter notebook on terminal. You will see your browser will spin up at this URL on default browser: http://localhost:8888/tree You shell will also be keep running on terminal. Now you have Jupyter notebook environment.

Go to desired folder and create a new python 3 notebook.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.read.option("multiline","true").json("/Users/rupeshti/workdir/git-box/learning-apache-spark/src/test.json").show()

You get this error ModuleNotFoundError: No module named 'pyspark' because you have not connected the shell to spark.

Step 2: installing findspark

# 1. install pipx
brew install pipx

#2. install findspark
pip3 install findspark

Step 3 connecting spark with notebook shell

Below script will connect to spark.

import findspark
findspark.init()

Final notebook code is:

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.option("multiline","true").json("/Users/rupeshti/workdir/git-box/learning-apache-spark/src/test.json").show()

Installing Multi-Node Spark Cluster in AWS Cloud

At AWS, Amazon EMR (Elastic Map & Reduce) service can be used to create Hadoop cluster with spark.

cluster mode tool
YARN Client Mode spark-shell, Notebook

This mode is used by data scientist for interactive exploration directly with production cluster. Most cases we use notebooks for web base interface and graph capability.

Step 1: Creating EMR cluster at AWS cloud

Creating spark shell on a real multi-node yarn cluster.

What is Amazon EMR?

Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning using open source framework such as Apache Spark, Apache Hive and Presto.

Benefits of using Amazon EMR are:

  1. You do not need to manage compute capacity or open-source applications that will save you time and money.
  2. Amazon EMR lets you to set up scaling rules to manage changing compute demand
  3. You can set up CloudWatch alerts to notify you of changes in your infrastructure and take actions immediately
  4. EMR has optimized runtime which speed up your analysis and save both time and money
  5. You can submit your workload to either EC2 or EKS using EMR

👉 make sure to login with hadoop user

Step 2: Running PySpark on EMR cluster in AWS cloud using Spark-Shell

Step 3: Running PySpark on Notebook on EMR cluster at AWS cloud using Zeppelin

👉 Note: mostly you will not use pyspark shell in real world people are using notebooks. Therefore, we are going to use zeppelins notebook next.

Visit Zeppelin URL

In your secured enterprise setup you have to ask your cluster operations team to provide you the URL and grant you the access for the same.

Notebook is not like spark shell. So It is not connected to spark by default you have to run some spark command to connect. You can simply run spark.version command also.

Create new notebook and run spark.version Default notebook zeppelin shell is skala shell. Therefore, you should use interpreter directive %pyspark so that you can run python code.

Working with spark-submit on EMR cluster

cluster mode tool
YARN Cluster Mode spark-submit

This mode of operation is mostly used for executing spark application on your production cluster. spark-submit --help to check all options.

Let’s create and submit a spark application.

import sys
x=int(sys.argv[1])
y=int(sys.argv[2])
sum=x+y
print("The addition is :",sum)

Todo

https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210705/

git pull && git add . && git commit -m 'adding new notes' && git push