PySpark/IPython in OSX El-Capitan

STEP 1: To use Spark on Hadoop first install hadoop Installing Hadoop on OSX (El-Capitan) If not already then install HomeBrew

STEP 2: Then Install Spark

$ brew search spark
$ brew install apache-spark

brew will install Spark to directory /usr/local/Cellar/apache-spark/1.5.0/


STEP 3: Create a HDFS directory for test dataset

$ hdfs dfs -mkdir /Python

STEP 4: Download a sample book for Word Count

$ wget
$ mv 30760-0.txt book.txt
$ hdfs dfs -put book.txt /Python/
$ hdfs dfs -ls /Python/

STEP 5: Install Anaconda Python because it contains iPython and that will make working with Python easy., after install create a pyspark profile as described in this post, then to run iPython notebook, In the terminal execute

$ IPYTHON_OPTS="notebook" pyspark

Which will start iPython kernel, create a Spark Hdfs connection, and automatically open up a browser window with the Python Notebook. In top right corner click on New Notebook & paste the following pyspark wordcount example & run the notebook cell, voila!

words = sc.textFile("hdfs://localhost:9000/Python/book.txt")
words.filter(lambda w: w.startswith(" ")).take(5)
counts = words.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)


Additional Resources

Sample Datasets