In this article we will go through a very important technique – importing data from SQL table to HDFS. We will do so on a sample database say ‘bigdata’ and a sample table say ’employee’ containing employee data.
We will do this in 3 parts. Part 1 will be in scope of this article. We will look at the next parts in subsequent article
Part 1. Learn Sqoop Basics and Basic action of importing that we can do in Sqoop.
Part 2. Other Sqoop Capabilities. This article can be found here. ( Sqoop : Hadoop : Other Sqoop capabilities explored )
Part 3. Importing data from RDBMS using mapreduce and outputting it using avro on HDFS
Sqoop Data import
In this article we will go through a very important technique – importing data from SQL table. We will do so on a sample database say ‘bigdata’ and a sample table say ‘Employee’ containing employee data.
What is Sqoop ?
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. More on Sqoop can be found here : http://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
Step 0 : Sqoop Installation
- A two node hadoop 1.0.1 cluster ready.
- Extracted sqoop-1.4.1 to /usr/local
- Assuming the mysql database lies on server aaa.bbb.ccc.ddd and the database name is bigdata, user name as root and password as test , create a config file $HOME/bigdatadboptions.txt add following to the config file
- Also issue following mysql command to grant access to this host for executing sql >> grant all privileges on bigdata.* to ‘root’@’10.222.67.102’ identified by ‘test’;
Step 1 : Preperation
- create a database on your mysql ( or any SQL) called bigdata.
- create a table ‘Employee’ in that database.
- Employee will have following structure
| Field | Type | Null | Key | Default | Extra |
| emp_id | bigint(20) | NO | PRI | NULL | |
| emp_name | varchar(50) | NO | | NULL | |
| emp_address | text | NO | | NULL | |
| emp_salary | float | NO | | NULL | |
- add 4 sample records. we have added following 4 records to the table
| emp_id | emp_name | emp_address | emp_salary |
| 1 | Shantanu Deo | 21 Timbuktoo, Lionfield , CA | 200000 |
| 2 | Suruchi Bhide | 22 Aberdeen Park , Georgia Road, CA | 400000 |
| 3 | Neo Dogman | 32, Heaven Door, Indrapuri, Swarg | 500000 |
| 4 | Scooby Doo | 34, Vishnu St. Vaikunth, Swarg | 450000 |
Step 2 : Importing the Data
Now that we are all setup we can create a folder say temp in sqoop directory and execute the sqoop import in one of the
following 2 ways
../bin/sqoop --options-file /home/hduser/bigdatadboptions.txt --table employee OR ../bin/sqoop import --connect jdbc:mysql://10.222.65.158:3306/bigdata --username root --password test --table employee
In the background Sqoop connects to the database using JDBC and invokes the dump command if available then it invokes map
reduce to fetch the records and put it on hdfs. In the process it also generates the java source file for getting the data.
This will be useful to us later.
The data in my case is available in hdfs at /user/hduser/employee directory in multiple part files.
Following is the output from these files :
[[email protected] importtemps]$ hadoop dfs -cat /user/hduser/employee/part-m-00000 Warning: $HADOOP_HOME is deprecated. 1,Shantanu Deo,21 Timbuktoo, Lionfield , CA,200000.0 [[email protected] importtemps]$ hadoop dfs -cat /user/hduser/employee/part-m-00001 Warning: $HADOOP_HOME is deprecated. 2,Suruchi Bhide,22 Aberdeen Park , Georgia Road, CA,400000.0 [[email protected] importtemps]$ hadoop dfs -cat /user/hduser/employee/part-m-00002 Warning: $HADOOP_HOME is deprecated. 3,Neo Dogman,32, Heaven Door, Indrapuri, Swarg,500000.0 4,Scooby Doo,34, Vishnu St. Vaikunth, Swarg,450000.0
Please let me know your comments. Happy Coding.