We will be doing the below activities sequentially so as to cover all the integration points between Oracle database, Sqoop, HDFS and Hive.
Step 1: Extract data from a source Oracle database table to Hadoop file system using Sqoop
Step 2: Load the above Sqoop extracted data into a Hive table
Step 3: Use Hive query to generate a file extract in the Hadoop file system
Step 4: Load the generated file in Step 3 to a new target Oracle database table
Step 1: Sqoop import data from Oracle database to Hive table
Our first task is to identify our source Oracle database table, and then use Sqoop to fetch the data from this table to HDFS using Sqoop.
SQOOP IMPORT --connect "jdbc:oracle:thin:@<<database host>>:<<database port number>>:<<database service name>>" --password "<<database password>>" --username "<<database username>>" --table "<<source schema name>>.<<source database table>>" --columns "<<column1>>,<<column2>>" -m 1 --target-dir "<<HDFS path>>" --verbose
It is interesting to observe that we need to identify a primary key for the source Oracle database table. Else we will get the error "Error during import: No primary key could be found for table". If we want to skip assigning a key we can include the highlighted parameter -m 1.
Step 2: Load the above Sqoop extracted data to a Hive table
Assuming we already have a table created in Hive, we will load the file created in Step 1 into the Hive table using the below syntax.
LOAD DATA INPATH '<<HDFS path>>' INTO TABLE <<hive table name>>;
Step 3: Export a file using Hive query to be consumed by Sqoop
Now that we have the data in our Hive table, we will use the below command to create a file using a custom Hive query, in the green highlighted path. The delimiter highlighted in yellow can be changed according to our requirement - but accordingly it must be changed in Step 4 also where it's highlighted in yellow.
INSERT OVERWRITE DIRECTORY '<<HDFS directory>>/<<file name>>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from <<hive table name>> where <<condition>>;
The below command will use the above Hive exported file (from the same green highlighted path) to load our target Oracle database table.
SQOOP EXPORT --connect "jdbc:oracle:thin:@<<database host>>:<<database port number>>:<<database service name>>" --password "<<database password>>" --username "<<database username>>" --table "<<target schema name>>.<<target database table>>" --input-fields-terminated-by ',' --input-lines-terminated-by '\n' --export-dir "<<HDFS directory>>/<<file name>>" --input-null-string "\\\\N" --input-null-non-string "\\\\N" --verbose
The orange and blue highlighted sections above helps reading the records while exporting to the target database table. Else we might sometimes encounter the error "Can't parse input data".
Thus we have successfully loaded a table from Oracle database to Hive and again back from Hive to Oracle database, using Sqoop. We can query our Oracle and Hive databases to check the data if it is loaded fine. How do you prefer to load your data between these two systems?