PySpark/IPython in OSX El-Capitan

STEP 1: To use Spark on Hadoop first install hadoop Installing Hadoop on OSX (El-Capitan) If not already then install HomeBrew

STEP 2: Then Install Spark

$ brew search spark
$ brew install apache-spark

brew will install Spark to directory /usr/local/Cellar/apache-spark/1.5.0/

el-spark

STEP 3: Create a HDFS directory for test dataset

$ hdfs dfs -mkdir /Python

STEP 4: Download a sample book for Word Count

$ wget http://www.gutenberg.org/files/30760/30760-0.txt
$ mv 30760-0.txt book.txt
$ hdfs dfs -put book.txt /Python/
$ hdfs dfs -ls /Python/

STEP 5: Install Anaconda Python because it contains iPython and that will make working with Python easy. http://continuum.io/downloads, after install create a pyspark profile as described in this post, then to run iPython notebook, In the terminal execute

$ IPYTHON_OPTS="notebook" pyspark

Which will start iPython kernel, create a Spark Hdfs connection, and automatically open up a browser window with the Python Notebook. In top right corner click on New Notebook & paste the following pyspark wordcount example & run the notebook cell, voila!

words = sc.textFile("hdfs://localhost:9000/Python/book.txt")
words.filter(lambda w: w.startswith(" ")).take(5)
counts = words.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
#counts.saveAsTextFile("hdfs://localhost:9000/Python/spark_output")
counts.collect()

d

Additional Resources
https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks
https://spark.apache.org/examples.html

Sample Datasets
https://scans.io/series/modbus-full-ipv4
http://www.gutenberg.org/
http://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki

Advertisements