sparklyr — R interface for Apache Spark

Original Post Connect to Spark from R — the sparklyr package provides a complete dplyr backend. Filter and aggregate Spark datasets then bring them into R for analysis and visualization. Orchestrate distributed machine learning from R using eitherSpark MLlib or H2O Sparkling Water. Create extensions that call the full Spark API and provide interfaces to Spark packages

Toree (Spark Kernel) in OSX El-Capitan

Apache Spark is topping the charts as a reference for Big Data, Advanced Analytics and “fast engine for large-scale computing”. In an earlier post, we saw how to use PySpark leveraging Jupyter notebook interactive interface. Here we will see how to use Apache Toree multi-interpreter and use Spark-Kernel, SparkR and and SparkQL as well. The Github docs for Toree are still in incubator mode & wip.…

Solr in OSX El-Capitan

  STEP 1: for osx solr can be installed from Homebrew   STEP 2: To launch Solr, run:   STEP 3: Then open http://localhost:8983/solr in browser you will see solr admin ui   STEP 4: INDEXING DATA – now the Solr server is up and running, but it doesn’t contain any data. The solr/bin directory includes the post* tool in order to…

Integrating IPython Notebook with Spark

1. To install Spark download Apache Spark from here 2. Extract Spark from the downloaded zip file and place at desired location 3. Create an Environment variable named ‘SPARK_HOME’ with path value like ‘C:\spark’ 4. Download & Install Anaconda Python distribution from here 5. Open command prompt and enter command This should create a pyspark…

PySpark/IPython in OSX El-Capitan

STEP 1: To use Spark on Hadoop first install hadoop Installing Hadoop on OSX (El-Capitan) If not already then install HomeBrew STEP 2: Then Install Spark brew will install Spark to directory /usr/local/Cellar/apache-spark/1.5.0/ STEP 3: Create a HDFS directory for test dataset STEP 4: Download a sample book for Word Count STEP 5: Install Anaconda Python because it contains iPython and that will…

Distributed Deep Learning Network over Spark

Distributed Deep Learning Network over Spark is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications. Objective is to parallelize the training phase. Introduction – Geoffrey Hinton presented the paradigm for fast learning in a deep belief network [Hinton 2006]. This paper, led to the breakthrough in this field. Consequently,…

Why SPARK ?

Computation of a large data set across a cluster, involves a good amount of network & disk I/O for each of the Hadoop Map/Reduce stages thus most of the time is being spent on I/O, rather than actual computation, thereby leaving it still a very high latency system. Although map-reduce is a great computing paradigm for distributed programming,…