Apache Drill with ODI 12C

Here we will see drill implementation with ODI using simple joins between different sources.

Prerequisites:

  1. Little bit familiarity with hadoop ecosystem. For more about drill refer docs.
  2. You have a virtual box and cdh or oracle bigdatalite appliance is already imported & running.
  3. Hadoop, Hive services are up and running.

Drill is an Apache opensource SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language. So it is quite similar to Cloudera Impala or Facebook Presto.

Check Java version, should be 1.7+

[oracle@bigdatalite ~]$ java –version
java version “1.8.0_111”

Download drill.

[oracle@bigdatalite ~]$ wget http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz

[oracle@bigdatalite ~]$ curl –o apache–drill–1.10.0.tar.gz http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz

The file is downloaded in /home/oracle. Copy the file to a directory where you want to install for e.g. /etc/drill.

[oracle@bigdatalite ~]$ su
Password:

[root@bigdatalite oracle]# cd /etc/drill

[root@bigdatalite drill]# tar –xvzf /home/oracle/apache–drill–1.10.0.tar.gz
[root@bigdatalite drill]# ls
apache–drill–1.10.0
[root@bigdatalite drill]#

Before using the drill shell, lets a take look at the drill-override-example.conf file just to ensure we have an unique cluster id and zookeeper port is correctly configured.

[root@bigdatalite apache–drill–1.10.0]# pwd
/etc/drill/apache–drill–1.10.0
[root@bigdatalite apache–drill–1.10.0]# ls
bin git.properties KEYS log README.md winutils
conf jars LICENSE NOTICE sample–data
[root@bigdatalite apache–drill–1.10.0]# cd conf/
[root@bigdatalite conf]# ls
core–site–example.xml drill–env.sh drill–override–example.conf
distrib–env.sh drill–override.conf logback.xml
[root@bigdatalite conf]# cat drill–override.conf
# Licensed to the Apache Software Foundation (ASF) under one or more
# See ‘drill-override-example.conf’ for example configurations

drill.exec: {
cluster–id: “drillbits1”,
zk.connect: “localhost:2181”
}
[root@bigdatalite conf]#

Starting drill shell:

So far everything looks fine. Lets start the drill shell using

env -i HOME=”$HOME” LC_CTYPE=”${LC_ALL:-${LC_CTYPE:-$LANG}}” PATH=”$PATH” USER=”$USER” /etc/drill/apache-drill-1.10.0/bin/drill-embedded

[root@bigdatalite oracle]# env –i HOME=”$HOME” LC_CTYPE=”${LC_ALL:-${LC_CTYPE:-$LANG}}” PATH=”$PATH” USER=”$USER”/etc/drill/apache–drill–1.10.0/bin/drill–embedded
Java HotSpot(TM) 64–Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Apr 02, 2017 3:56:59 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014–04–29 01:25:26…
apache drill 1.10.0
“json ain’t no thang”
0: jdbc:drill:zk=local>

Now open browser and point your url to http://localhost:8047/ You will see the home page of drill.


Then click on storage and enable hive , dfs if it is in disable storage plugins.


Click the update button for dfs and update the configuration as shown below. Similarly we will do for hive.


Change the configuration wherever required and save it. Drill is now ready for use.

Using JSON files:

Run show files command.

0: jdbc:drill:zk=local> show files;


Lets execute the following select statement. I will explain this query later.

SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.`employee.json` LIMIT 5

0: jdbc:drill:zk=local> SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.`employee.json` LIMIT 5;
+————–+——————+————-+————+————————-+
| employee_id | full_name | first_name | last_name | position_title |
+————–+——————+————-+————+————————-+
| 1 | Sheri Nowmer | Sheri | Nowmer | President |
| 2 | Derrick Whelply | Derrick | Whelply | VP Country Manager |
| 4 | Michael Spence | Michael | Spence | VP Country Manager |
| 5 | Maya Gutierrez | Maya | Gutierrez | VP Country Manager |
| 6 | Roberta Damstra | Roberta | Damstra | VP Information Systems |
+————–+——————+————-+————+————————-+
5 rows selected (0.789 seconds)
0: jdbc:drill:zk=local>

Well whatever we queried above was just a json file named as employee.json and located at ./jars/3rdparty/foodmart-data-json.0.4.jar. The cp command is nothing but the classpath and the json file presents inside the jar file.

Using PARQUET files:

To query a parquet file , we need to move the sample-data from installation location to hdfs location using below command.

[root@bigdatalite apache–drill–1.10.0]# hadoop fs –put sample–data /user/bhabani/

[root@bigdatalite apache–drill–1.10.0]# hadoop fs –ls /user/bhabani
Found 4 items
–rwxrwxrwx 1 oracle supergroup 45 2017–04–02 15:06 /user/bhabani/demo.tbl
–rwxrwxrwx 1 oracle supergroup 41 2017–04–02 14:32 /user/bhabani/demo.tsv
–rwxrwxrwx 1 oracle supergroup 53 2017–04–02 15:18 /user/bhabani/demo1.csv
drwxrwxrwx – oracle supergroup 0 2017–04–02 14:20 /user/bhabani/sample–data

We are now ready to query parquet file in drill shell.

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/sample–data/nation.parquet`;
+————–+—————–+————–+———————–+
| N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT |
+————–+—————–+————–+———————–+
| 0 | ALGERIA | 0 | haggle. carefully f |
| 1 | ARGENTINA | 1 | al foxes promise sly |
| 2 | BRAZIL | 1 | y alongside of the p |
| 3 | CANADA | 1 | eas hang ironic, sil |
| 4 | EGYPT | 4 | y above the carefull |
| 5 | ETHIOPIA | 0 | ven packages wake qu |
| 6 | FRANCE | 3 | refully final reques |
| 7 | GERMANY | 3 | l platelets. regular |
| 8 | INDIA | 2 | ss excuses cajole sl |
+————–+—————–+————–+———————–+
25 rows selected (0.324 seconds)
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/sample–data/region.parquet`;
+————–+————–+———————–+
| R_REGIONKEY | R_NAME | R_COMMENT |
+————–+————–+———————–+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
| 2 | ASIA | ges. thinly even pin |
| 3 | EUROPE | ly final courts cajo |
| 4 | MIDDLE EAST | uickly special accou |
+————–+————–+———————–+
5 rows selected (0.275 seconds)
0: jdbc:drill:zk=local>

Using TEXT files:

What about text files? Yes we can do it as well. Just note down here one thing.

Drill supports the following file types:

Plain text files, including:
Comma-separated values (CSV, type: text)
Tab-separated values (TSV, type: text)
Pipe-separated values (PSV, type: text)
Structured data files:
Avro (type: avro) (This file type is experimental)
JSON (type: json)
Parquet (type: parquet)

So I will create one csv file in local and then move it to HDFS location.

[oracle@bigdatalite ~]$ pwd
/home/oracle
[oracle@bigdatalite ~]$ cat demo1.csv
N_NAME,N_LOCATION
BHA,INDIA
KING,CANADA
SCOTT,BRAZIL
[oracle@bigdatalite ~]$ hadoop fs –cat /user/bhabani/demo1.csv
N_NAME,N_LOCATION
BHA,INDIA
KING,CANADA
SCOTT,BRAZIL
[oracle@bigdatalite ~]$

Now lets query this file from drill shell.

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/demo1.csv`;
+————————–+
| columns |
+————————–+
| [“N_NAME”,”N_LOCATION”] |
| [“BHA”,”INDIA”] |
| [“KING”,”CANADA”] |
| [“SCOTT”,”BRAZIL”] |
+————————–+
4 rows selected (0.726 seconds)
0: jdbc:drill:zk=local>

Sounds good? Then lets make it more interesting!!  What if I would like to join and PARQUET and CSV file. Can we do that? Lets check it out with below query.

Using joins between files:

select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs.`/user/bhabani/sample-data/nation.parquet` as tbl1
join
dfs.`/user/bhabani/demo1.csv` as tbl2
on tbl1.N_NAME=tbl2.columns[1];

0: jdbc:drill:zk=local> select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs.`/user/bhabani/sample–data/nation.parquet` as tbl1
. . . . . . . . . . . > join
. . . . . . . . . . . > dfs.`/user/bhabani/demo1.csv` as tbl2
. . . . . . . . . . . > on tbl1.N_NAME=tbl2.columns[1];
SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: Defaulting to no–operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+———+———————–+——–+
| N_NAME | N_COMMENT | nm |
+———+———————–+——–+
| BRAZIL | y alongside of the p | SCOTT |
| CANADA | eas hang ironic, sil | KING |
| INDIA | ss excuses cajole sl | BHA |
+———+———————–+——–+
3 rows selected (2.488 seconds)
0: jdbc:drill:zk=local>

As you saw I joined CSV and PARQUET file based on the location name because this is the only common present in both data set and then displayed two columns from PARQUET and one column from CSV file.

Using Hive:

We will connect to hive from drill shell and execute a select statement as shown below.

use hive;
select cust_id,first_name,last_name from cust limit 3;


Using joins between file and hive:

The last example for this post is to join a csv file with hive table. Use below query to create a join based on country and location column.

select tbl1.cust_id, tbl1.city,tbl2.columns[0] as nam from cust as tbl1
join
dfs.`/user/bhabani/demo2.csv` as tbl2
on tbl1.country=tbl2.columns[1] limit 5;


Thats it for today. I would suggest you to practice some basic commands and get your hands dirty on all type of joins. Once you are familiar with the commands we can then use it in ODI. Though ODI will do the work for you but having knowledge on the working principle will help you in accomplishing your objective  quickly and seamlessly.


Advertisements