The following assumptions are made:
- you have set up Hadoop Cluster
- Hive is installed & configured on the cluster
- All environment variables are set accordingly
- Hive Datawareouse default location is - /user/hive/warehouse
In this tutorial, we will do the following
- Copy the data set from local system to HDFS
- create hive database
- create hive table
- load data into hive table
- verify the data in hive table
- run queries to answer few business questions
Dataset
The dataset that I will be using is movies dataset. You can download the data from movies. This is a simple CSV file and contains the following information , in the order specified, related to movies:
- Movie ID - Integer
- Movie Name - String
- Year of Release - Integer ( only year )
- Movie Rating - Float ( decimal )
- Duration in Seconds - Integer
Contents of the file looks as below:
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
Ok, now you got the file and you know the contents of it. Let's get started with the tutorial.
File Transfer
When you downloaded the file it is available on your local drive. Again I am assuming that you have downloaded from your Linux VM. If you have downloaded on your Windows host, FTP the file to Linux VM.
I have set up my directories on Linux VM as below:
/root/samples => local system where the downloaded file exists
/user/root/samples => HDFS Directory where the data file will be made available to Hive
Hadoop allows to copy a file from local file system to HDFS using the command - copyFromLocal.
Syntax: hdfs dfs -copyFromLocal
Verify the file presence and its contents on HDFS. Again most of unix commands like, ls, cat, chmod, rm etc., can be used with hadoop.
To list the contents of a HDFS directory using the command as below:
To view the contents of the file we have copied use the command as below. Certain unix commands are not recognized within hadoop, for example: head or tail, commands. But we can pipe outputs of hadoop command results onto unix shell commands as demonstrated below:
So now, we have copied the file from Local File System to HDFS and verified the contents of the file as well. Now onto Hive operations.
Hive - DDL & DML
If you are a DWH professional like me, who has very limited exposure to Java World, Hive and Impala offers great escape out of writing MapReduce jobs for processing data in Hadoop.
Hive supports most of ANSI SQLs, which makes it very convenient for people who are comfortable with ANSI SQL to venture into Hadoop technologies. Also, Hive translates the DMLs into a series of MapReduce jobs, no need to MapReduce jobs in Java.
So first step, initiating Hive on Cluster. It is very easy, just type "hive" at the command prompt. Assuming that you have set up all environment variables you should see the Hive Shell as below:
Depending on your configuration, hive will be started either in Local Mode , pseudodistributed or Distributed Mode. If you want to know more about Hive, I would start with this Programming Hive book which is a great source.
Database & Table Creation
First we will create a database where we will be creating tables. Database and Table creation commands are similar to Oracle.
Within Hive, when you create a new database it actually creates a directory, under the default Hive Warehouse directory as per your configuration ,on HDFS. As I mentioned before my default Hive Warehouse location is - /user/hive/warehouse, You need to explicitly create the relevant directories for this as part of your Hive Installation & Configuration.
On HDFS, before creating the database available directories are:
We will be creating a database named - practice. So to create the database, from hive shell use the command - create database. Below listing shows available databases before and after creating the "practice" database.
You can verify that after successful creation of the database, a directory with the db name will be created on HDFS as below:
Next, we will create a table named "movies" that matches the structure of the "movies" dataset as described in the section Dataset.
Within Hive, we need to point to the database where we want our tables to be created, otherwise Hive will use the "default" database.
As you see here, the Hive shell has changed to specified database name.
The table creation is as shown below:
The Create table statement is more or less similar to Oracle's command. The difference is that for Hive you need to specify how your underlying data file is organized, which is done by 3 lines starting from ROW FORMAT ... LINES Terminated by. These lines are self explanatory.
The LOCATION parameter is optional, if specified it tells Hive where to place the file on HDFS directory path. Please note that this is the target directory path not the source file path.
After successful creation of the file,
Loading Data into Hive Table
Hive allows to load data into table from source files using the command LOAD DATA. The syntax for this command is as below:
LOAD DATA [LOCAL] INPATH '' OVERWRITE INTO TABLE
When LOCAL is omitted, Hive looks for the file on HDFS, otherwise Hive looks for the source file on local file system. The data loading in Hive is as below:
Verifying Hive Table Contents
To describe the table metadata and to get the row count of the table:
As you can see, the "select count(*) from movies" has initiated MapReduce job by Hive.
To view a limited row set , 5 lines, use the LIMIT clause as below:
In this tutorial, we have shown how to load data into HDFS from Local System, Hive Database and Table creation, Loading data into Hive Table and running some simple HiveQL queries to manipulate the data on the table.
It's a great post! Thank you for sharing your knowledge to others, it was very informative and in depth one.
ReplyDeleteApache Pig Training in Electronic City