big data

Oracle Direct connector for HDFS

Some time back oracle had announced Oracle Big Data appliance. Along with this Oracle released Big Data Connectors to facilitate data access between data stored in a Hadoop cluster and Oracle Database. I tried installing Oracle Direct Connector for HDFS (Hadoop distributed file system) and documenting steps below.

To use Oracle Direct Connector we need to install hadoop software on the system where Oracle Database is running. We access HDFS using preprocessor feature of external table (available in 11g)
You can download the software (from edelivery.oracle.com) and unzip it to directory. This provides a file hdfs_stream which will be used in preprocessor clause

To get started , we need to create directory for external table file location and oracle direct connector bin directory

SQL> CREATE OR REPLACE DIRECTORY hadoop_test_dir as '/home/oracle/ext_table';
SQL> CREATE OR REPLACE DIRECTORY hdfs_bin_path AS '/home/oracle/product/orahdfs-1.0.0.0.0/bin';

I will be using scott user and we need to provide following permissions to it

GRANT CREATE SESSION TO scott;
 GRANT EXECUTE ON SYS.UTL_FILE TO scott;
 GRANT READ, WRITE on DIRECTORY hadoop_test_dir TO scott;
 GRANT EXECUTE ON DIRECTORY hdfs_bin_path TO scott;

Create the external table

drop TABLE "SCOTT"."TEST_DATA_EXT_TAB" ;
CREATE TABLE "SCOTT"."TEST_DATA_EXT_TAB"
 (OBJECT_ID NUMBER,
 OBJECT_NAME VARCHAR2(128),
 OBJECT_TYPE VARCHAR2(19)
 )
 ORGANIZATION EXTERNAL
 ( TYPE ORACLE_LOADER
 DEFAULT DIRECTORY HADOOP_TEST_DIR
 ACCESS PARAMETERS
 ( RECORDS DELIMITED BY NEWLINE
 PREPROCESSOR HDFS_BIN_PATH:'hdfs_stream'
 FIELDS TERMINATED BY ','
 )
 LOCATION ( 'test_data1.txt','test_data2.txt','test_data3.txt','test_data4.txt')
 );

Selecting from table gives error as no file is present

select count(*) from "SCOTT"."TEST_DATA_EXT_TAB"
 *
 ERROR at line 1:
 ORA-29913: error in executing ODCIEXTTABLEOPEN callout
 ORA-29400: data cartridge error
 KUP-04040: file test_data1.txt in HADOOP_TEST_DIR not found

Create a configuration files with location of files on hdfs and database connection string. This will be used to populate the dummy external table files

[oracle@oradbdev01]~/product/orahdfs-1.0.0.0.0/log% cat ../myconf.xml
<?xml version="1.0"?>
 <configuration>
 <property>
 <name>oracle.hadoop.hdfs.exttab.tableName</name>
 <value>SCOTT.TEST_DATA_EXT_TAB</value>
 </property>
 <property>
 <name>oracle.hadoop.hdfs.exttab.datasetPaths</name>
 <value>/user/oracle/TEST_DATA/part*</value>
 </property>
 <property>
 <name>oracle.hadoop.hdfs.exttab.connection.url</name>
 <value>
 jdbc:oracle:thin:@oradbdev01:1521:tintin1
 </value>
 </property>
 <property>
 <name>oracle.hadoop.hdfs.exttab.connection.user</name>
 <value>SCOTT</value>
 </property>
</configuration>

Set HADOOP_CLASSPATH in hadoop_env.sh to include $ORACLE_HOME/jdbc/lib/ojdbc6.jar

export CLASSPATH=$CLASSPATH:${OLH_HOME}/jlib/oraloader.jar:${OLH_HOME}/jlib/ojdbc6.jar:${OLH_HOME}/jlib/oraclepki.jar

Run the ExternalTable command as below to create the files for external table. It will prompt for password for database user mentioned in conf file

[oracle@oradbdev01]~/product/orahdfs-1.0.0.0.0/jlib% hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable -conf ../myconf.xml -publish
 [Enter Database Password:]
 Oracle Direct HDFS Release 1.0.0.0.0 - Production
Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.
SCOTT.TEST_DATA_EXT_TAB publish operation succeeded
Details:
HADOOP_TEST_DIR:test_data2.txt file created with the following paths
 hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00000
HADOOP_TEST_DIR:test_data1.txt file created with the following paths
 hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00001
HADOOP_TEST_DIR:test_data3.txt file created with the following paths
 hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00002
HADOOP_TEST_DIR:test_data4.txt file created with the following paths
 hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00003

test_data*.txt files will be populated by above data.

Now let’s select from the table

SQL> select /*+ parallel (a,4) */ count(*) from "SCOTT"."TEST_DATA_EXT_TAB" a;
COUNT(*)
 ----------
 153450496
Elapsed: 00:02:13.69

As you can see we can also use parallel hint in query.

Oracle documentation had mentioned setting following parameter in conf file

<property>
 <name>oracle.hadoop.hdfs.exttab.datasetCompressionCodec</name>
 <value>org.apache.hadoop.io.compress.DefaultCodec</value>
 </property>

But setting this causes no rows to be returned and following error is recorded in logfile

cat directhdfs-log_13387-0126121327579379.log
 12/01/26 12:03:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Error reading data from hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00002: unknown compression method

Solution is to unset CompressionCodec parameter.

[oracle@oradbdev01]~/product/orahdfs-1.0.0.0.0/log% cat /home/oracle/ext_table/test_data1.txt
 CompressionCodec=
 hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00001

You need to set this parameter only if the hdfs files are compressed. In case files are compressed (mostly we have compressed files on hadoop) we need to use appropriate codec.e.g for gz format use org.apache.hadoop.io.compress.GzipCodec