Sqoop Tutorial : Hadoop : Importing data from RDBMS to HDFS

In this article we will go through a very important technique – importing data from SQL table to HDFS. We will do so on a sample database say ‘bigdata’ and a sample table say ’employee’ containing employee data.

We will do this in 3 parts. Part 1 will be in scope of this article. We will look at the next parts in subsequent article

Part 1. Learn Sqoop Basics and Basic action of importing that we can do in Sqoop.

Part 2. Other Sqoop Capabilities. This article can be found here. ( Sqoop : Hadoop : Other Sqoop capabilities explored )

Part 3. Importing data from RDBMS using mapreduce and outputting it using avro on HDFS

Sqoop Data import

In this article we will go through a very important technique – importing data from SQL table. We will do so on a sample database say ‘bigdata’ and a sample table say ‘Employee’ containing employee data.

What is Sqoop ?

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. More on Sqoop can be found here : http://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html

Step 0 : Sqoop Installation

  • A two node hadoop 1.0.1 cluster ready.
  • Extracted sqoop-1.4.1 to /usr/local
  • Assuming the mysql database lies on server aaa.bbb.ccc.ddd and the database name is bigdata, user name as root and password as test , create a config file $HOME/bigdatadboptions.txt add following to the config file
  • –connect
    jdbc:mysql://aaa.bbb.ccc.ddd:3306/bigdata
    –username
    root ‘test’;
    –password
    test
    –direct
  • Also issue following mysql command to grant access to this host for executing sql >> grant all privileges on bigdata.* to ‘root’@’10.222.67.102’ identified by ‘test’;

Step 1 : Preperation

  • create a database on your mysql ( or any SQL) called bigdata.
  • create a table ‘Employee’ in that database.
  • Employee will have following structure

describe Employee;
+————-+————-+——+—–+———+——-+
| Field | Type | Null | Key | Default | Extra |
+————-+————-+——+—–+———+——-+
| emp_id | bigint(20) | NO | PRI | NULL | |
| emp_name | varchar(50) | NO | | NULL | |
| emp_address | text | NO | | NULL | |
| emp_salary | float | NO | | NULL | |
+————-+————-+——+—–+———+——-+

  • add 4 sample records. we have added following 4 records to the table

+——–+—————+————————————-+————+
| emp_id | emp_name | emp_address | emp_salary |
+——–+—————+————————————-+————+
| 1 | Shantanu Deo | 21 Timbuktoo, Lionfield , CA | 200000 |
| 2 | Suruchi Bhide | 22 Aberdeen Park , Georgia Road, CA | 400000 |
| 3 | Neo Dogman | 32, Heaven Door, Indrapuri, Swarg | 500000 |
| 4 | Scooby Doo | 34, Vishnu St. Vaikunth, Swarg | 450000 |
+——–+—————+————————————-+————+

Step 2 : Importing the Data

Now that we are all setup we can create a folder say temp in sqoop directory and execute the sqoop import in one of the
following 2 ways

../bin/sqoop --options-file /home/hduser/bigdatadboptions.txt --table employee

OR

../bin/sqoop import --connect jdbc:mysql://10.222.65.158:3306/bigdata --username root --password test --table employee

In the background Sqoop connects to the database using JDBC and invokes the dump command if available then it invokes map
reduce to fetch the records and put it on hdfs. In the process it also generates the java source file for getting the data.
This will be useful to us later.

The data in my case is available in hdfs at /user/hduser/employee directory in multiple part files.

Following is the output from these files :

[[email protected] importtemps]$ hadoop dfs -cat /user/hduser/employee/part-m-00000
Warning: $HADOOP_HOME is deprecated.

1,Shantanu Deo,21 Timbuktoo, Lionfield , CA,200000.0

[[email protected] importtemps]$ hadoop dfs -cat /user/hduser/employee/part-m-00001
Warning: $HADOOP_HOME is deprecated.

2,Suruchi Bhide,22 Aberdeen Park , Georgia Road, CA,400000.0
[[email protected] importtemps]$ hadoop dfs -cat /user/hduser/employee/part-m-00002
Warning: $HADOOP_HOME is deprecated.

3,Neo Dogman,32, Heaven Door, Indrapuri, Swarg,500000.0
4,Scooby Doo,34, Vishnu St. Vaikunth, Swarg,450000.0

Please let me know your comments. Happy Coding.

Advertisements

3 thoughts on “Sqoop Tutorial : Hadoop : Importing data from RDBMS to HDFS

  1. Couple of questions on this excercise. When you actually import this data into HDFS can you have this data readily available for Hive external table ? If yes, then this artical can be frame as data import from MySQL to Hive 🙂
    How this is different than -copyFromLocal? I can do otherwise when i can dump a file from MySQL and move the data to HDFS using -copyFromLocal. I have never used Sqoop. so i want to understant this import process.

    Like

  2. Hi Manish
    Yes certainly we can have it imported to Hive. The purpose of the article was to give introduction to sqoop.
    1 followup article is already in pipeline for importing data using direct map reduce. I would add a small article to add to this one to tweak the sqoop import process to import to Hive.
    Thanks a ton for your suggestion
    Happy Coding
    Shantanu

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s