Sunday, October 23, 2016

Sqoop Import and Export tables from Hive to Oracle Database

Exporting and Importing table data from Oracle database to Hive and vice-versa is one of the most common activities in the world of Hadoop. It is essential to get sorted out on a few basics for seamless first time integration so as to avoid various parsing and loading errors.

We will be doing the below activities sequentially so as to cover all the integration points between Oracle database, Sqoop, HDFS and Hive.

    Step 1: Extract data from a source Oracle database table to Hadoop file system using Sqoop
    Step 2: Load the above Sqoop extracted data into a Hive table
    Step 3: Use Hive query to generate a file extract in the Hadoop file system
    Step 4: Load the generated file in Step 3 to a new target Oracle database table

Step 1: Sqoop import data from Oracle database to Hive table

Our first task is to identify our source Oracle database table, and then use Sqoop to fetch the data from this table to HDFS using Sqoop.

SQOOP IMPORT --connect "jdbc:oracle:thin:@<<database host>>:<<database port number>>:<<database service name>>" --password "<<database password>>" --username "<<database username>>" --table "<<source schema name>>.<<source database table>>" --columns "<<column1>>,<<column2>>" -m 1 --target-dir "<<HDFS path>>" --verbose

It is interesting to observe that we need to identify a primary key for the source Oracle database table. Else we will get the error "Error during import: No primary key could be found for table". If we want to skip assigning a key we can include the highlighted parameter -m 1.

Step 2: Load the above Sqoop extracted data to a Hive table

Assuming we already have a table created in Hive, we will load the file created in Step 1 into the Hive table using the below syntax.

LOAD DATA INPATH '<<HDFS path>>' INTO TABLE <<hive table name>>;

Step 3: Export a file using Hive query to be consumed by Sqoop

Now that we have the data in our Hive table, we will use the below command to create a file using a custom Hive query, in the green highlighted path. The delimiter highlighted in yellow can be changed according to our requirement - but accordingly it must be changed in Step 4 also where it's highlighted in yellow.

INSERT OVERWRITE DIRECTORY '<<HDFS directory>>/<<file name>>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from <<hive table name>> where <<condition>>;

Step 4: Load data from Hive table exported file to Oracle database table

The below command will use the above Hive exported file (from the same green highlighted path) to load our target Oracle database table.

SQOOP EXPORT --connect "jdbc:oracle:thin:@<<database host>>:<<database port number>>:<<database service name>>" --password "<<database password>>" --username "<<database username>>" --table "<<target schema name>>.<<target database table>>" --input-fields-terminated-by ',' --input-lines-terminated-by '\n' --export-dir "<<HDFS directory>>/<<file name>>" --input-null-string "\\\\N" --input-null-non-string "\\\\N" --verbose

The orange and blue highlighted sections above helps reading the records while exporting to the target database table. Else we might sometimes encounter the error "Can't parse input data".

Thus we have successfully loaded a table from Oracle database to Hive and again back from Hive to Oracle database, using Sqoop. We can query our Oracle and Hive databases to check the data if it is loaded fine. How do you prefer to load your data between these two systems?

0 Comments:

Post a Comment

Have something to share? Let me know!