sparklyr — R interface for Apache Spark

Original Post Connect to Spark from R — the sparklyr package provides a complete dplyr backend. Filter and aggregate Spark datasets then bring them into R for analysis and visualization. Orchestrate distributed machine learning from R using eitherSpark MLlib or H2O Sparkling Water. Create extensions that call the full Spark API and provide interfaces to Spark packages Advertisements

Toree (Spark Kernel) in OSX El-Capitan

Apache Spark is topping the charts as a reference for Big Data, Advanced Analytics and “fast engine for large-scale computing”. In an earlier post, we saw how to use PySpark leveraging Jupyter notebook interactive interface. Here we will see how to use Apache Toree multi-interpreter and use Spark-Kernel, SparkR and and SparkQL as well. The Github docs for Toree are still in incubator mode & wip.…

Integrating IPython Notebook with Spark

1. To install Spark download Apache Spark from here 2. Extract Spark from the downloaded zip file and place at desired location 3. Create an Environment variable named ‘SPARK_HOME’ with path value like ‘C:\spark’ 4. Download & Install Anaconda Python distribution from here 5. Open command prompt and enter command This should create a pyspark…

PySpark/IPython in OSX El-Capitan

STEP 1: To use Spark on Hadoop first install hadoop Installing Hadoop on OSX (El-Capitan) If not already then install HomeBrew STEP 2: Then Install Spark brew will install Spark to directory /usr/local/Cellar/apache-spark/1.5.0/ STEP 3: Create a HDFS directory for test dataset STEP 4: Download a sample book for Word Count STEP 5: Install Anaconda Python because it contains iPython and that will…

Distributed Deep Learning Network over Spark

Distributed Deep Learning Network over Spark is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications. Objective is to parallelize the training phase. Introduction – Geoffrey Hinton presented the paradigm for fast learning in a deep belief network [Hinton 2006]. This paper, led to the breakthrough in this field. Consequently,…

Why SPARK ?

Computation of a large data set across a cluster, involves a good amount of network & disk I/O for each of the Hadoop Map/Reduce stages thus most of the time is being spent on I/O, rather than actual computation, thereby leaving it still a very high latency system. Although map-reduce is a great computing paradigm for distributed programming,…

Spark Streaming vs. Storm Streaming

Overview – Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences. Processing Model, Latency – Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Storm processes incoming events one at a time, Spark Streaming batches up events that arrive within a short time…