Our objective today is to ingest a log file and explore it in BDD, create a Hive table over a web server log file, with a DDL somewhat like this:
CREATE EXTERNAL TABLE mylog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]) ([^ ]) ([^ ]) (-|\[[^\]]\]) ([^ \"]|\"[^\"]\") (-|[0-9]) (-|[0-9])(?: ([^ \"]|\" [^\"]\") ([^ \"]|\"[^\"]\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/oracle/mylogs';
Use Hive CTAS to create a table and ingest into BDD as below
CREATE TABLE ACCESS_LOGS AS SELECT * FROM APLOG;
if the BDD Hive Table Detector is running, it should pick up the presence of new hive table and ingest it into BDD. Or, we can manually trigger the ingestion from the Data Processing CLI on the BDD node as below:
cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli ./data_processing_CLI -t access_logs;
This data processing process creates an Oozie job to sample a set of data into Spark – with a 1% sample providing 95% sample accuracy – which is then profiled, enriched, loaded into BDD DGraph engine for further transformation, exploration and analysis in BDD Studio.
The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. Once ingested the new dataset is listed among others as below:
Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – BDD provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.
Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.
Thus the data cleansing process turns into a matter of hours compared to the days with manual Hive, Pig and Spark scripting. Now the data is much more usable and easy to understand.