hadoop
Oracle Direct connector for HDFS
Some time back oracle had announced Oracle Big Data appliance. Along with this Oracle released Big Data Connectors to facilitate data access between data stored in a Hadoop cluster and Oracle Database.Ā I tried installing Oracle Direct Connector for HDFS (Hadoop distributed file system) and documenting steps below.
To use Oracle Direct Connector we need to install hadoop software on the system where Oracle Database is running. We access HDFS using preprocessor feature of external table (available in 11g)
You can download the software (from edelivery.oracle.com) and unzip it to directory. This provides a file hdfs_stream which will be used in preprocessor clause
To get started , we need to create directory for external table file location and oracle direct connector bin directory
SQL> CREATE OR REPLACE DIRECTORY hadoop_test_dir as '/home/oracle/ext_table'; SQL> CREATE OR REPLACE DIRECTORY hdfs_bin_path AS '/home/oracle/product/orahdfs-1.0.0.0.0/bin';
I will be using scott user and we need to provide following permissions to it
GRANT CREATE SESSION TO scott; GRANT EXECUTE ON SYS.UTL_FILE TO scott; GRANT READ, WRITE on DIRECTORY hadoop_test_dir TO scott; GRANT EXECUTE ON DIRECTORY hdfs_bin_path TO scott;
Create the external table
drop TABLE "SCOTT"."TEST_DATA_EXT_TAB" ; CREATE TABLE "SCOTT"."TEST_DATA_EXT_TAB" (OBJECT_ID NUMBER, OBJECT_NAME VARCHAR2(128), OBJECT_TYPE VARCHAR2(19) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY HADOOP_TEST_DIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE PREPROCESSOR HDFS_BIN_PATH:'hdfs_stream' FIELDS TERMINATED BY ',' ) LOCATION ( 'test_data1.txt','test_data2.txt','test_data3.txt','test_data4.txt') );
Selecting from table gives error as no file is present
select count(*) from "SCOTT"."TEST_DATA_EXT_TAB" * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-04040: file test_data1.txt in HADOOP_TEST_DIR not found
Create a configuration files with location of files on hdfs and database connection string. This will be used to populate the dummy external table files
[oracle@oradbdev01]~/product/orahdfs-1.0.0.0.0/log% cat ../myconf.xml <?xml version="1.0"?> <configuration> <property> <name>oracle.hadoop.hdfs.exttab.tableName</name> <value>SCOTT.TEST_DATA_EXT_TAB</value> </property> <property> <name>oracle.hadoop.hdfs.exttab.datasetPaths</name> <value>/user/oracle/TEST_DATA/part*</value> </property> <property> <name>oracle.hadoop.hdfs.exttab.connection.url</name> <value> jdbc:oracle:thin:@oradbdev01:1521:tintin1 </value> </property> <property> <name>oracle.hadoop.hdfs.exttab.connection.user</name> <value>SCOTT</value> </property> </configuration>
Set HADOOP_CLASSPATH in hadoop_env.sh to include $ORACLE_HOME/jdbc/lib/ojdbc6.jar
export CLASSPATH=$CLASSPATH:${OLH_HOME}/jlib/oraloader.jar:${OLH_HOME}/jlib/ojdbc6.jar:${OLH_HOME}/jlib/oraclepki.jar
Run the ExternalTable command as below to create the files for external table. It will prompt for password for database user mentioned in conf file
[oracle@oradbdev01]~/product/orahdfs-1.0.0.0.0/jlib% hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.ExternalTable -conf ../myconf.xml -publish [Enter Database Password:] Oracle Direct HDFS Release 1.0.0.0.0 - Production Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved. SCOTT.TEST_DATA_EXT_TAB publish operation succeeded Details: HADOOP_TEST_DIR:test_data2.txt file created with the following paths hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00000 HADOOP_TEST_DIR:test_data1.txt file created with the following paths hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00001 HADOOP_TEST_DIR:test_data3.txt file created with the following paths hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00002 HADOOP_TEST_DIR:test_data4.txt file created with the following paths hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00003
test_data*.txt files will be populated by above data.
Now let’s select from the table
SQL> select /*+ parallel (a,4) */ count(*) from "SCOTT"."TEST_DATA_EXT_TAB" a; COUNT(*) ---------- 153450496 Elapsed: 00:02:13.69
As you can see we can also use parallel hint in query.
Oracle documentation had mentioned setting following parameter in conf file
<property> <name>oracle.hadoop.hdfs.exttab.datasetCompressionCodec</name> <value>org.apache.hadoop.io.compress.DefaultCodec</value> </property>
But setting this causes no rows to be returned and following error is recorded in logfile
cat directhdfs-log_13387-0126121327579379.log 12/01/26 12:03:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Error reading data from hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00002: unknown compression method
Solution is to unsetĀ CompressionCodec parameter.
[oracle@oradbdev01]~/product/orahdfs-1.0.0.0.0/log% cat /home/oracle/ext_table/test_data1.txt CompressionCodec= hdfs://oradbdev03:9000/user/oracle/TEST_DATA/part-m-00001
You need to set this parameter only if the hdfs files are compressed. In case files are compressed (mostly we have compressed files on hadoop) we need to use appropriate codec.e.g for gz format useĀ org.apache.hadoop.io.compress.GzipCodec
Recent Comments