Computation of a large data set across a cluster, involves a good amount of network & disk I/O for each of the Hadoop Map/Reduce stages thus most of the time is being spent on I/O, rather than actual computation, thereby leaving it still a very high latency system.

Although map-reduce is a great computing paradigm for distributed programming, but it is not easy to write program in map-reduce. So some higher level abstraction was required which spawned Hive & Pig. Any job written on Hive or Pig gets converted to a map-reduce job. That does not solve high latency problem though. Imagine cases where answer is needed within seconds, otherwise the purpose of analysis is lost, such as fraud detection, spam analysis, face detection etc. Many algorithms to do this kind of work or machine learning or computing page rank etc are inherently multipass algorithm or iterative in nature. This is again another difficulty with hadoop to perform iterative operations.

Thus there was a need of low latency distributed system where iterative algorithms can run with ease. Spark solves these problems. Spark is part of Berkeley Distributed Analytics Stack (BDAS) developed at AmpLab. Other parts of the stack are Mesos and Shark (Others are BlinkDB and MLbase) Mesos is a cluster Manager and Shark is Hive on Spark.

Key Features:

  1. Speed. Shared in-memory immutable dataset, greatly reduces network and disk I/O.
  2. Consistency is free, because of using immutable dataset.
  3. local mode: where things can be tried out in a single box. Once you are comfortable, you can try this in a cluster setup. This speeds up the learning curve.
  4. REPL: Ability to try things from command line in a interactive way
  5. Primary language is Scala, although it supports Java and python. Scala is a great language that combines functional programming and OOP. It’s a JVM language and can utilize existing Java libraries.
  6. Fault tolerant without replication (through RDDs) and fast recovery time, Self healing
  7. Interoperability: It’s ability to talk to HDFS, S3, EC2, MPI, even local filesystem.
  8. Not only map-reduce. Spark’s programming model includes mutable accumulators and broadcast variables and immutable RDDs, along with 2 types of operations, lazy Transformations and Actions. Transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
  9. Persisting or caching a dataset in memory across operations. These significantly increases the speed of subsequent operations on the dataset.
  10. Easy deployability in the Cloud like amazon web services.
  11. Can run in different modes, local, standalone and with Mesos / YARN.