BDD: “Face of Hadoop”

Oracle Big Data Discovery was released last week, the latest addition to Oracle’s big data tools suite which includes Oracle Big Data SQL, ODI and it’s Hadoop capabilities and Oracle GoldenGate for Big Data 12c. Introduced by Oracle as “the visual face of Hadoop”, Big Data Discovery combines the data discovery and visualisation elements of Oracle Endeca Information Discovery with data loading and transformation features built on Apache Spark to deliver a tool aimed at the “Discovery Lab” part of the Oracle Big Data and Information Management Reference Architecture.


Endeca Information Discovery (OEID) can be viewed in two main ways:
– A data discovery tool for textual & unstructured data that complements more structured analysis capabilities of OBI.
– And as a fast click-and-refine data exploration tool similar to Qlikview and Tableau.

OEID will continue standalone, the data discovery and unstructured data analysis parts of OEID are making way into new product called Oracle Big Data Discovery, And the fast click-n-refine features will be part of Visual Analyzer in OBIEE12c.

Big Data Discovery, is basically “Endeca on Hadoop”. Endeca has three main parts; data loading performed using EID Integrator or studio upload. Data is then ingested into Endeca Server and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the gui provided by Studio.

BDD is made up of three component types:-

  • The Studio web gui, which combines the faceted search and data discovery parts of EID Studio with a lightweight data transformation capability.
  • The DGraph Gateway, which brings Endeca Server search & analytical capabilities to the world of Hadoop.
  • The Data Processing component that runs on each of the Hadoop nodes, and uses Hive’s HCatalog to read Hive table metadata and Apache Spark to load and transform data in the cluster

Studio can run across several nodes for high-availability and load-balancing, DGraph can run on a single node, or in a cluster with a single “leader” node and multiple “follower” nodes for enhanced availability and throughput. DGraph works with Apache Spark to run intensive search and analytics on subsets of the whole Hadoop dataset, with sample sets of data being moved into the DGraph engine and any resulting transformations then being applied to the whole Hadoop dataset using Apache Spark.