2017/05/23

Apache Sqoop Import - Import data from Relational database to HDFS

Apache Sqoop is used to import data from Relational Databases ( MySql, Oracle, SQL Server etc., ) into Hadoop ecosystem.

This post details different variants of using the Sqoop tool for importing data.

#1 Transfer Entire table content


[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \

The above command will import entire contents of "categories" table into HDFS as a CSV delimited file.

The parameters to sqoop command are -
import  - specifies that the data need to be imported into HDFS from relational db
--connect - contains the JDBC URL for the source relational db
--username & --password - credentials to connect to the relational db
--table -  name of the table to  be transferred to HDFS

When the above command is executed, the contents of the table are imported into HDFS and a file will be created under the directory - /user/$user -  as below:



#2 Specifying a Target Directory


Sqoop offers to import contents of a table into a specific HDFS directory.  These are :
  • --target-dir -  allows to specify the final directory where the data will be imported into.
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \
--target-dir /user/root/output/categories

With this variation, the categories folder will be created under /user/root/output  and data will be imported.  The parameter --target-dir need to modified everytime a new table need to be imported.

  • --warehouse-dir - allows to specify the root directory under which a new folder will be created while importing
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--warehouse-dir /user/root/output

With this variation a parent directory is specified under HDFS.  Sqoop import when executed, rather than writing the contents into the parent directory will create a new folder matching the name of the table and imports data in there.

In both instances, sqoop will reject data if the output directory already exists.

#3 Importing only subset of data


Sqoop allows to import only a portion of the table contents into HDFS with a where clause.

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--where "substr(product_name,1,2) = 'Sn'" \
--target-dir /user/root/output/products-Sn

The above command will only import the product names starting with "Sn" from "Products" table into HDFS and stores under the specified directory.

#4 File Formats


Sqoop can import data in 3 formats:
  • CSV  - text - default
  • Sequencefile - binary
  • Avro - binary
To invoke the sqoop import to store in binary format use below syntax

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-sequencefile \
--warehouse-dir /user/root/output

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-avrodatafile \
--warehouse-dir /user/root/output


#5 Importing all tables


Sqoop supports importing all tables in a database, rather than importing tables one by one.  To import all tables use the tool - import-all-tables.

[root@ODIGettingStarted ~]# sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--warehouse-dir /user/root/output

When executed, this command will create one folder for each of the table in the database and transfers data into it.  The folder structure is as below at the end of this command execution:

mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)


mysql> 

After import, the HDFS directory structure is:
[root@ODIGettingStarted conf.dist]# hdfs dfs -ls /user/root/output
Found 6 items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:36 /user/root/output/categories
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/customers
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/departments
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/order_items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/orders
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/products
[root@ODIGettingStarted conf.dist]# 


1 comment:

Hive OpenCSVSerde - Impala error

SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' I have a Hive external table created on a CSV file using the SerDe - &...