2017/08/28

Data Visualization with R ggplot2 - Part 2

In my previous post Data Visualization with R ggpplot2 - Part 1, I detailed the pre-requisites for getting started with using ggplot2 with R.

In this post, I will focus more on the usage of R package - ggplot2 and various visualizations that can be generated using this package.

Within R, ggplot2 is initiated by calling the package "ggplot".  The basic syntax of ggplot is  -

ggplot(data = <data set>)) +
<geom_function>(mapping = aes(<MAPPINGS>))

Provide a data set to ggplot through "data" attribute;
Specify what graph you need through "geom_function" attribute;  eg:  geom_bar; geom_point etc,
And specify how your graph should look like through combination of "mapping" and "aes (short for aesthetics)"

Data Set

For illustrating the usage of "ggplot2" I am using "mpg" data set available through "tidyverse" package.  The "mpg" data set provides fuel efficiency of vehicles for the years 1998 - 2008 and it has below variables:
Manufacturer, model, displ(displace of engine in litres), Year of Manufacture, Number of Cylinders, type of transmission, Type of Drive, City Efficiency, Highway Efficiency, Type of Fuel, Vehicle Class

Usage

Blank Graph - No plotting

gglpot(data = mpg) + (mapping = aes(x=displ, y=hwy))

If you omit geom_function in the call to "ggplot" R will produce a blank graph with no plotting of variables. 
As you can see here, R produced a graph with "displ" along X-axis and "hwy" along Y-axis.  But there are no plotting of variables as we have not specified what type of graph we need.

geom_bar: Basic Bar Chart

To produce Bar Charts use "geom_bar" function.  This function by default accepts one variable for x position and produces count of observations for the x position.

gglpot(data = mpg) + geom_bar(mapping = aes(x=displ))

This will generate a vertical bar chart showing number of vehicles by engine size as below
To add some colors to the graph 
gglpot(data = mpg) + geom_bar(mapping = aes(x=displ), color = "orange" )

This will generate a graph with chosen color as below:
To show number of vehicles by their Class and to distinguish the Vehicle Class by color :

gglpot(data = mpg) + geom_bar(mapping = aes(x=class, fill = class ))

Horizontal Bar Chart

By default, the "geom_bar" generates a vertical bar chart.  To display horizontal bar chart add "coord_flip() function to the command as below:

gglpot(data = mpg) + geom_bar(mapping = aes(x=class, fill = class )) + coord_flip()

geom_point: Scatter Chart

To generate Scatter Charts use "geom_point" function.  This function by default accepts one variable for x position and produces count of observations for the x position.

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy))

This will generate a vertical bar chart showing number of vehicles by engine size as below
We can further enhance this chart by adding color or changing the shape by "vehicle class" attribute as below:

Scatter Chart - Color attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, color = class))

In this chart each vehicle class is color coded.

Scatter Chart - Shape attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape = class))

In this chart each vehicle class is given a different shape.


Scatter Chart - Size attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, size = class))

In this chart each vehicle class has different size based on the mileage.

geom_point: Facets / Subplots

If you notice, in the above scatter charts we have displayed all "vehicle classes" in a single chart but distinguished each by a shape or size or color.

What if, we want to produce one chart for each vehicle class but still want to display them together?  R has an option for this.  Using R's Facets/Subplots we can achieve this.  This is equivalent to "trellis" views.

Facets/Subplots - single Variables

We will reproduce the Scatter chart, but will split by vehicle class one for each class.  We will use "facet_wrap" if the plot is split on one variable, in this case by "class".   You can control how many rows in the sub plot by "nrow".

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy)) + facet_wrap( ~ class, nrow = 2)


As you can see, the scatter plot is split by "Vehicle Class" one for each class.

We can change the color of the chart by mapping a required to color to "color" attribute.

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy), color = "blue") + facet_wrap( ~ class, nrow = 2)


Facets/Subplots - two Variables

To facet the plot by 2 variables use "facet_grid" function as below:

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy), color = "blue") + facet_grid( drv ~ class)

We are asking to produce a Scatter plot but produce one sub plot for each combination of "drv" and "class" variables. 


As you can notice, each subplot has 2 variables.

geom_smooth: Line Charts

To generate a line chart in R, use the function geom_smooth as below:

gglpot(data = mpg) + geom_smooth(mapping = aes(x=displ, y=hwy))

The above line chart, basically shows the relationship between the fuel efficiency on highways against engine size in litres.  

Line Types

We can plot the relationship by Vehicle Drive type with one line for each drive type using Line type attribute as below:

gglpot(data = mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype = drv))


Multiple Charts in same plot

R supports to generate multiple chart types (ie scatter & line charts ) in same plot.  The below code example shows the Scatter & Line Charts in the same plot:

gglpot(data = mpg) + 
geom_smooth(mapping = aes(x=displ, y=hwy, linetype = drv, color = drv)) +
geom_point(mapping = aes(x=displ, y=hwy, color = drv)

geom_boxplots

A box plot can be generated in R using below syntax:

gglpot(data = mpg, aes(class, hwy) + 
geom_boxplot()


There are several other options available to plot various charts using R.  See the cheatsheet for all the  available options:  Data Visualization with R

1 comment:

File Handling with Python

This little utility is for copying files from source to target directories.  On the way it checks whether a directory exists in the target, ...