2018/07/20

Hive OpenCSVSerde - Impala error

SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde'


I have a Hive external table created on a CSV file using the SerDe - 'org.apache.hadoop.hive.serde2.OpenCSVSerde'.  The external table created OK and am able access from Hive.

But when I try to access the same table from Impala, it starts throwing the error - "SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported."

Looks like Impala does not recognize the Hive's OpenCSVSerde.

To get around the problem, I had to follow these steps :


  1. Create Hive External Table - stored as TextFile
  2. Create Hive table ( managed ) using CTAS - stored as PARQUET
  3. Access the Parquet table from Impala

Hope if any one is facing the same issue will find this solution useful.

2017/12/28

SQOOP Import direct option --ORA-00942: Table or view does not exist error

While testing out "--direct" option available to invoke "OraOOP" with Sqoop imports into HDFS, I have stumbled across the error -

"Ora-00942 Table or View Does not exist"

I am pretty sure that the table exist as the Sqoop Import has been tested using Generic JDBC Driver.

Going through Sqoop docs, it is mentioned that it requires several privileges as below:

The Oracle user for The Data Connector for Oracle and Hadoop requires the following roles and privileges:
  • create session
In addition, the user must have the select any dictionary privilege or select_catalog_role role or all of the following object privileges:
  • select on v_$instance
  • select on dba_tables
  • select on dba_tab_columns
  • select on dba_objects
  • select on dba_extents
  • select on dba_segments — Required for Sqoop imports only
  • select on dba_constraints — Required for Sqoop imports only
  • select on v_$database — Required for Sqoop imports only
  • select on v_$parameter — Required for Sqoop imports only

In addition to the listed roles and privileges in the doc, it also requires "SELECT_CATALOG_ROLE" for the Sqoop Import to succeed using "--direct" option.

After granting the role to the user using which the import is being performed, the Sqoop Import process succeeded.


2017/12/22

Installing Python 3.6.4. , iPython and Jupyter on RHEL 7

My Virtual Box Environment:

VBox   5.1.26
Oracle Linux 7

To install Guest Additions :  as root

yum groupinstall 'Developoment Tools'

yum install kernel-devel
yum install kernel-uek-devel

install guestadditions

Python3.6 install pre-requisites

#1 Requires SQLite3

wget https://www.sqlite.org/2017/sqlite-autoconf-3210000.tar.gz
Extract Source Code - tar xzvf sqlite-autoconf-3210000.tar.gz
./configure
make
make install

OR yum install sqlite
#2 yum install sqlite-devel


Python3.6 installation

cd $HOME
mkdir pythonsw
cd pythonsw

Download Pyhon3.6*
wget https://www.python.org/ftp/python/3.6.4/Python-3.6.4.tgz

Extract source code:
cd $HOME/pythonsw
tar -xzf Python-3.6.4.tgz

This will create a directory named "Python3.6.4" under pythonsw

If you are installing Python3.* on top of existing Python2* then, it is better to keep the installations separate.

Create a directory under $HOME to host Python3*.  In my case I created - /usr/local under $HOME.

cd $HOME
mkdir usr
cd usr
mkdir local

Installation
cd $HOME/pythonsw/Python-3.6.4/

./configure --enable-optimizations
make altinstall prefix=$HOME/usr/local exec-prefix=$HOME/usr/local

This should install Python 3.6


Accessing Python3.6

Create an alias for Python3.6 
cd $HOME/usr/local/bin
ln -s Python3.6 Python3

Now add this path to your $PATH environment variable. Notice that we add the new path before existing $PATH. This is so that our local Python is always used before any other.

echo "export PATH=\$HOME/usr/local/bin:\$PATH" >> ~/.bashrc

source ~/.bashrc

Verify installation by running the commands

which python3 => this should list the custom Python3.6 location path

which python => this should list the default Python 2* location, if you have any

iPython Installation

Preq-Requisite:

After installation of Python3.6 when tried to download iPython using pip, i was getting the error
"
pip is configured with locations that require tls/ssl centos"

Collecting pip
  Could not fetch URL https://pypi.python.org/simple/p
ip/: There was a problem confirming the ssl certificat
e: Can't connect to HTTPS URL because the SSL module i
s not available. - skipping
  Could not find a version that satisfies the requirem
ent pip (from versions: )
No matching distribution found for pip


To resolve this error, the below openSSL config is required. 
You need to remove any Python3.6 installation on your system first, make below changes and re-install Python 3.6

enable openSSL =>
Navigate to the directory where Python3.6 is extracted

vi /root/Downloads/Python3.6/Modules/Setup.dist
search for SSL
make sure SSL is set to correct path and uncomment the lines as below:
SSL=/usr/lib64/openssl  # this is the location where openssl is installed in my VM
_ssl _ssl.c \
-DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \
-L$(SSL)/lib -lssl -lcrypto
Save Changes

Continue to install "pip"

python3.6 -m pip install ipython
Verify --



Jupyter Notebook installation

python3.6 -m pip install jupyter

Initially after installing Jupyter, when attempted to access I got errors stating "SQLite3" was not available.  So I had to trash everything - Python3.6, iPython and Jupyter - and follow the steps starting from "Pre-install requiresites" of this document and re-install Python3.6, iPython and Jupyter.


Verify -





2017/09/11

AWS CLI - Error - AWS was not able to validate credentials

I have set up AWS EC2 instance through AWS Management Console and able to launch it.  While using AWS CLI to connect to instances I encountered the error "AWS was not able to validate credentials" error.

I have carried out below steps:


  • set up AWS instance through Console
  • Create Access Keys and downloaded the rootkeys.csv file
  •  On Linux machine, installed Python 3.6
  • Installed AWS CLI tool
  • Configured AWS CLI on linux machine using  the command - aws configure
After all these steps, when I run the command - aws ec2 describe-instances,  I was encountering the error "AWS was unable to validate credentials".

I verified the AWS credentials file, all the details like AWS Access ID and AWS Secret Key are all present.

To resolve this error, I had to explicitly export the variables AWS_ACCESS_KEY and AWS_SECRET_KEY.




After this I am able to connect to the running instances from AWS CLI.




2017/08/28

Data Visualization with R ggplot2 - Part 2

In my previous post Data Visualization with R ggpplot2 - Part 1, I detailed the pre-requisites for getting started with using ggplot2 with R.

In this post, I will focus more on the usage of R package - ggplot2 and various visualizations that can be generated using this package.

Within R, ggplot2 is initiated by calling the package "ggplot".  The basic syntax of ggplot is  -

ggplot(data = <data set>)) +
<geom_function>(mapping = aes(<MAPPINGS>))

Provide a data set to ggplot through "data" attribute;
Specify what graph you need through "geom_function" attribute;  eg:  geom_bar; geom_point etc,
And specify how your graph should look like through combination of "mapping" and "aes (short for aesthetics)"

Data Set

For illustrating the usage of "ggplot2" I am using "mpg" data set available through "tidyverse" package.  The "mpg" data set provides fuel efficiency of vehicles for the years 1998 - 2008 and it has below variables:
Manufacturer, model, displ(displace of engine in litres), Year of Manufacture, Number of Cylinders, type of transmission, Type of Drive, City Efficiency, Highway Efficiency, Type of Fuel, Vehicle Class

Usage

Blank Graph - No plotting

gglpot(data = mpg) + (mapping = aes(x=displ, y=hwy))

If you omit geom_function in the call to "ggplot" R will produce a blank graph with no plotting of variables. 
As you can see here, R produced a graph with "displ" along X-axis and "hwy" along Y-axis.  But there are no plotting of variables as we have not specified what type of graph we need.

geom_bar: Basic Bar Chart

To produce Bar Charts use "geom_bar" function.  This function by default accepts one variable for x position and produces count of observations for the x position.

gglpot(data = mpg) + geom_bar(mapping = aes(x=displ))

This will generate a vertical bar chart showing number of vehicles by engine size as below
To add some colors to the graph 
gglpot(data = mpg) + geom_bar(mapping = aes(x=displ), color = "orange" )

This will generate a graph with chosen color as below:
To show number of vehicles by their Class and to distinguish the Vehicle Class by color :

gglpot(data = mpg) + geom_bar(mapping = aes(x=class, fill = class ))

Horizontal Bar Chart

By default, the "geom_bar" generates a vertical bar chart.  To display horizontal bar chart add "coord_flip() function to the command as below:

gglpot(data = mpg) + geom_bar(mapping = aes(x=class, fill = class )) + coord_flip()

geom_point: Scatter Chart

To generate Scatter Charts use "geom_point" function.  This function by default accepts one variable for x position and produces count of observations for the x position.

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy))

This will generate a vertical bar chart showing number of vehicles by engine size as below
We can further enhance this chart by adding color or changing the shape by "vehicle class" attribute as below:

Scatter Chart - Color attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, color = class))

In this chart each vehicle class is color coded.

Scatter Chart - Shape attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape = class))

In this chart each vehicle class is given a different shape.


Scatter Chart - Size attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, size = class))

In this chart each vehicle class has different size based on the mileage.

geom_point: Facets / Subplots

If you notice, in the above scatter charts we have displayed all "vehicle classes" in a single chart but distinguished each by a shape or size or color.

What if, we want to produce one chart for each vehicle class but still want to display them together?  R has an option for this.  Using R's Facets/Subplots we can achieve this.  This is equivalent to "trellis" views.

Facets/Subplots - single Variables

We will reproduce the Scatter chart, but will split by vehicle class one for each class.  We will use "facet_wrap" if the plot is split on one variable, in this case by "class".   You can control how many rows in the sub plot by "nrow".

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy)) + facet_wrap( ~ class, nrow = 2)


As you can see, the scatter plot is split by "Vehicle Class" one for each class.

We can change the color of the chart by mapping a required to color to "color" attribute.

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy), color = "blue") + facet_wrap( ~ class, nrow = 2)


Facets/Subplots - two Variables

To facet the plot by 2 variables use "facet_grid" function as below:

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy), color = "blue") + facet_grid( drv ~ class)

We are asking to produce a Scatter plot but produce one sub plot for each combination of "drv" and "class" variables. 


As you can notice, each subplot has 2 variables.

geom_smooth: Line Charts

To generate a line chart in R, use the function geom_smooth as below:

gglpot(data = mpg) + geom_smooth(mapping = aes(x=displ, y=hwy))

The above line chart, basically shows the relationship between the fuel efficiency on highways against engine size in litres.  

Line Types

We can plot the relationship by Vehicle Drive type with one line for each drive type using Line type attribute as below:

gglpot(data = mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype = drv))


Multiple Charts in same plot

R supports to generate multiple chart types (ie scatter & line charts ) in same plot.  The below code example shows the Scatter & Line Charts in the same plot:

gglpot(data = mpg) + 
geom_smooth(mapping = aes(x=displ, y=hwy, linetype = drv, color = drv)) +
geom_point(mapping = aes(x=displ, y=hwy, color = drv)

geom_boxplots

A box plot can be generated in R using below syntax:

gglpot(data = mpg, aes(class, hwy) + 
geom_boxplot()


There are several other options available to plot various charts using R.  See the cheatsheet for all the  available options:  Data Visualization with R

2017/07/15

Data Visualization with R - ggplot2 - Part1

R has several systems for visualizing data, ggplot2 is one of them. ggplot2 is pretty easy to use and offers various options to generate impressive graphs to wow your users. This post is to showcase usage of ggplot and its impressive features.
What you need: 
  • R System installed on your PC.
  • R - Studio Or if you prefer R Console
  • tidyverse package
What is tidyverse ?  tidyverse is a collection of packages that supports most commonly used packages for data manipulation and visualization. It contains below packages:
  • data visualization - ggplot2
  • data manipulation - dplyr
  • data tidying - tidyr
  • data import - readr
  • functional programming -purr
Usage - To start using tidyverse, one need to load the package into R first. Loading the package is very easy, just fire up R-Studio and issue below command:



The "ggplot2" package has several data sets which we can use for plotting graphs. To view details about the data sets ( obeservations => rows, variables => columns) , type the data set name at command prompt , as below. This will lists a sample data and the variable names.

To know what each of these variables (columns) in the data set means just use the help function :
?datasetname

The help function can be used with any of the R commands / packages to know more details about them.

How to use "ggplot2" in next part... soon.

2017/05/23

Apache Sqoop Import - Import data from Relational database to HDFS

Apache Sqoop is used to import data from Relational Databases ( MySql, Oracle, SQL Server etc., ) into Hadoop ecosystem.

This post details different variants of using the Sqoop tool for importing data.

#1 Transfer Entire table content


[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \

The above command will import entire contents of "categories" table into HDFS as a CSV delimited file.

The parameters to sqoop command are -
import  - specifies that the data need to be imported into HDFS from relational db
--connect - contains the JDBC URL for the source relational db
--username & --password - credentials to connect to the relational db
--table -  name of the table to  be transferred to HDFS

When the above command is executed, the contents of the table are imported into HDFS and a file will be created under the directory - /user/$user -  as below:



#2 Specifying a Target Directory


Sqoop offers to import contents of a table into a specific HDFS directory.  These are :
  • --target-dir -  allows to specify the final directory where the data will be imported into.
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \
--target-dir /user/root/output/categories

With this variation, the categories folder will be created under /user/root/output  and data will be imported.  The parameter --target-dir need to modified everytime a new table need to be imported.

  • --warehouse-dir - allows to specify the root directory under which a new folder will be created while importing
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--warehouse-dir /user/root/output

With this variation a parent directory is specified under HDFS.  Sqoop import when executed, rather than writing the contents into the parent directory will create a new folder matching the name of the table and imports data in there.

In both instances, sqoop will reject data if the output directory already exists.

#3 Importing only subset of data


Sqoop allows to import only a portion of the table contents into HDFS with a where clause.

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--where "substr(product_name,1,2) = 'Sn'" \
--target-dir /user/root/output/products-Sn

The above command will only import the product names starting with "Sn" from "Products" table into HDFS and stores under the specified directory.

#4 File Formats


Sqoop can import data in 3 formats:
  • CSV  - text - default
  • Sequencefile - binary
  • Avro - binary
To invoke the sqoop import to store in binary format use below syntax

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-sequencefile \
--warehouse-dir /user/root/output

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-avrodatafile \
--warehouse-dir /user/root/output


#5 Importing all tables


Sqoop supports importing all tables in a database, rather than importing tables one by one.  To import all tables use the tool - import-all-tables.

[root@ODIGettingStarted ~]# sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--warehouse-dir /user/root/output

When executed, this command will create one folder for each of the table in the database and transfers data into it.  The folder structure is as below at the end of this command execution:

mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)


mysql> 

After import, the HDFS directory structure is:
[root@ODIGettingStarted conf.dist]# hdfs dfs -ls /user/root/output
Found 6 items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:36 /user/root/output/categories
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/customers
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/departments
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/order_items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/orders
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/products
[root@ODIGettingStarted conf.dist]# 


Hive OpenCSVSerde - Impala error

SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' I have a Hive external table created on a CSV file using the SerDe - &...