Integrating IPython Notebook with Spark

1. To install Spark download Apache Spark from here

2. Extract Spark from the downloaded zip file and place at desired location

3. Create an Environment variable named ‘SPARK_HOME’ with path value like ‘C:\spark’

4. Download & Install Anaconda Python distribution from here

5. Open command prompt and enter command

ipython profile create pyspark

This should create a pyspark profile where we need to make some changes.

Now we need to make changes in two configuration files. ipython_notebook_config.py and 00-pyspark-setup.py

6. Locate ipython_notebook_config.py file at C:\Users\username\.ipython\profile_pyspark\ipython_notebook_config.py and add following lines

c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = True
c.NotebookApp.port = 8880 # or whatever you want;

7. Now create file named 00-pyspark-setup.py in C:\Users\username\.ipython\profile_pyspark\startup & add following contents to 00-pyspark-setup.py

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))

# Add py4j to the path with the version matching the SPARK_HOME/python/lib content
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

To easily display the kernel in Jupyter, add the following code to the kernel.json  file in ~/.ipython/kernels/pyspark :

{
 "display_name": "pySpark (Spark 1.6.1)",
 "language": "python",
 "argv": [
 "/usr/bin/python2",
 "-m",
 "IPython.kernel",
 "-f",
 "{connection_file}"
 ],
 "env": {
 "SPARK_HOME": "<spark_dir>",
 "PYTHONPATH": "<spark_dir>/python/:<spark_dir>/python/lib/py4j-0.9-src.zip",
 "PYTHONSTARTUP": "<spark_dir>/python/pyspark/shell.py",
 "PYSPARK_SUBMIT_ARGS": "--master local[2] pyspark-shell"
 }
}

8. Now open the command prompt and type following command to run the IPython notebook-

ipython notebook --profile=pyspark

This should launch ipython notebook in a browser.

9. Create one notebook and type ‘sc’ command in one cell and run it. You should get
<pyspark.context.SparkContext at 0x4f4a490>

This indicates your IPython notebook and Spark are successfully integrated. Otherwise if you get ‘ ‘ after running then that means integration is unsuccessful, in which case check all above config/paths again. To run a sample refer to this post

Advertisements

One thought on “Integrating IPython Notebook with Spark

  1. Pingback: Install Spark & IPython on OSX | information flâneur

Comments are closed.