Graphs are often the starting point for statistical analysis. One of the main advantages of R is how easy it is for the user to create many different kinds of graphs. We begin this chapter by studying conventional graphs, followed by an examination of some more complex representations. This final part uses the ggplot2 package.

1. Conventional Graphical Functions

To begin with, it may be interesting to examine a few example of graphical representations which can be constructed with R. We use the demo function:

demo(graphics)

The plot function

The plot function is a generic function used to represent all kinds of data. Classical use of the plot function consists of representing a scatterplot for a variable y according to another variable x. For example, to represent the graph of the function \(x\mapsto \sin(2\pi x)\) on \([0,1]\), at regular steps we use the following commands:

x <- seq(-2*pi,2*pi,by=0.05)
y <- sin(x)
plot(x,y) #dot representation (default)
plot(x,y,type="l") #line representation

We provide examples of representations for quantitative and qualitative variables. We use the data file ozone.txt, imported using

setwd("~/Dropbox/LAURENT/COURS/EDHEC/R/FICHES")
path <- file.path("../DATA", "ozone.txt") 
ozone <- read.table(path)
summary(ozone)

Let us start by representing two quantitative variables: maximum ozone maxO3 according to temperature T12:

plot(ozone[,"T12"],ozone[,"maxO3"])

As the two variables are contained and named within the same table, a simpler syntax can be used, which automatically inserts the variables as labels for the axes:

plot(maxO3~T12,data=ozone)

We can also use (more complicated)

plot(ozone[,"T12"],ozone[,"maxO3"],xlab="T12",ylab="maxO3")

Functions histogram, barplot and boxplot allow to draw classical graphs:

hist(ozone$maxO3,main="Histogram")
barplot(table(ozone$wind)/nrow(ozone),col="blue")
boxplot(maxO3~wind,data=ozone)

Interactive graphs with rAmCharts

We can use this package to obtain dynamic graphs. It is easy, we just have to use the prefix am beforme the name of the function:

library(rAmCharts)
amHist(ozone$maxO3)
amPlot(ozone,col=c("T9","T12"))
amBoxplot(maxO3~wind,data=ozone)

Exercise 1

  • Draw the sine function between 0 and \(2\pi\).
  • Add the following title: plot of sine function.

Exercise 2

  • Draw the pdf (probability distribution function) of the standard Gaussian distribution between \(-4\) and 4 (use dnorm).
  • Add a vertical dashed line of equation \(x=0\) (use abline)
  • On the same graph, draw Student’s \(t\)-distribution to 5 and 30 degrees of freedom (use dt). Use the lines function and a different colour for each line.
  • Add a legend at the top left to differentiate between each distribution (use legend).

Exercise 3 (Law of Large Numbers) - optional

  • Having set the seed of the random generator set.seed, simulate a sample \((x_1,...,x_{1000})\) from Bernoulli’s distribution with parameter \(p=0.6\) (use rbinom).
  • Calculate the successive means \(M_l=S_l/l\) where \(S_l=\sum_{i=1}^l X_i\). Draw \(M_l\) according to \(l\), then add the horizontal line with equation \(y=0.6\) (use cumsum).

Exercise 4 (Central Limit Theorem) - optional

  • Let us denote \(X_1,X_2,\cdots,X_N\) i.i.d. random variables following Bernoulli’s distribution with parameter \(p\). Recall the distribution of \(S_N=X_1+\ldots + X_N\). Specify the mean and standard deviation.
  • Set \(p=0.5\). For \(N=10\), using the rbinom function, simulate \(n=1000\) occurrences \(S_1,\cdots,S_{1000}\) of a binomial distribution with parameters \(N\) and \(p\). Organise the quantities \(\frac{S_i-N\times p}{\sqrt{N\times p\times (1-p)}}\) into a vector U10. * Do the same with \(N=30\) and \(N=1000\) to obtain two new vectors U30 and U1000.
  • In one window (use par(mfrow=c(1,3))) represent histograms for U10, U30 and U1000, each time overlapping the density of the standard Gaussian distribution (obtained using dnorm).

2. Ggplot2

ggplot2 is a plotting system for R based on the grammar of graphics (as dplyr to manipulate data). We can find documentation here. We consider a subsample of the diamond dataset from the package ggplot2:

library(ggplot2)
set.seed(1234)
diamonds2 <- diamonds[sample(nrow(diamonds),5000),] 
summary(diamonds2)
help(diamonds)

Given a dataset, a graph is defined from many layers. We have to specify:

Ggplot graphs are defined from these layers. We indicate

The scatterplot carat vs price is obtained with the plot function with

plot(price~carat,data=diamonds2)

With ggplot, we use

ggplot(diamonds2) #nothing
ggplot(diamonds2)+aes(x=carat,y=price) #nothing
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point() #good

Exercise 5

  • Draw the histogram of carat (use geom_histogram)
  • Draw the histogram of carat with 10 bins (help(geom_histogram))
  • Draw the barplot for the variable cut (use geom_bar)

2.1 ggplot grammar

In ggplot, the syntax is defined from independent elements. These elements define the grammar of ggplot. Main elements of the grammar include:

  • Data (ggplot): the dataset, it should be a dataframe
  • Aesthetics (aes): to describe the way that variables in the data are mapped. All the variables used in the graph should be precise in aes
  • Geometrics (geom_…): to control the type of plot
  • Statistics (stat_…): to describe transformation of the data
  • Scales (scale_…): to control the mapping from data to aesthetic attributes (change of colors…)

All these elements are conbined with a +.

Data and aesthetics

These two elements specify the data and the variables we want to represent. For a scaterplot price vs carat we enter the command

ggplot(diamonds2)+aes(x=carat,y=price)

aes also use arguments such as color, size, fill. We use these arguments as soon as a color or a size is defined from a variable of the dataset:

ggplot(diamonds2)+aes(x=carat,y=price,color=cut)

Geometrics

To obtain the graph, we need to precise the type of representation. We use geometrics to do that. For a scatter plot, we use geom_point:

ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()

Observe that ggplot adds the lengend automatically. Exemples of geometrics are described here:

Geom Description Aesthetics
geom_point() Scatter plot x, y, shape, fill
geom_line() Line (ordered according to x) x, y, linetype
geom_abline() Line slope, intercept
geom_path() Line (ordered according to the index) x, y, linetype
geom_text() Text x, y, label, hjust, vjust
geom_rect() Rectangle xmin, xmax, ymin, ymax, fill, linetype
geom_polygon() Polygone x, y, fill, linetype
geom_segment() Segment x, y, fill, linetype
geom_bar() Barplot x, fill, linetype, weight
geom_histogram() Histogram x, fill, linetype, weight
geom_boxplot() Boxplots x, y, fill, weight
geom_density() Density x, y, fill, linetype
geom_contour() Contour lines x, y, fill, linetype
geom_smooth() Smoothers (linear or non linear) x, y, fill, linetype
All color, size, group

Exercise 6

  • Draw the barplot of cut (with blue bars)
  • Draw the barplot of cut with one color for each modality of cut.

Statistics (this part can be omitted for beginners)

Many graphs need to transform the data to make the representation (barplot, histogram). Simple transformations can be obtained quickly. For instance we can draw the sine function with

D <- data.frame(X=seq(-2*pi,2*pi,by=0.01))
ggplot(D)+aes(x=X,y=sin(X))+geom_line()

The sine transformation is precised in aes. For more complex transformations, we have to used statistics. A stat function takes a dataset as input and returns a dataset as output, and so a stat can add new variables to the original dataset. It is possible to map aesthetics to these new variables. For example, stat_bin, the statistic used to make histograms, produces the following variables:

  • count, the number of observations in each bin
  • density, the density of observations in each bin (percentage of total / bar width)
  • x, the center of the bin

By default geom_histogram represents on the \(y\)-axis the number of observations in each bin (the outuput count).

ggplot(diamonds)+aes(x=price)+geom_histogram(bins=40)

For the density, we use

ggplot(diamonds)+aes(x=price,y=..density..)+geom_histogram(bins=40)

ggplot propose another way to make the representations: we can use stat_ instead of geom_. Formally, each stat function has a geom and each geom has a stat. For instance,

ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")
ggplot(diamonds2)+aes(x=carat,y=price)+stat_smooth(method="loess")

lead to the same graph. We can change the type of representation in the stat_ with the argument geom:

ggplot(diamonds2)+aes(x=carat,y=price)+stat_smooth(method="loess",geom="point")

Here are some examples of stat functions

Stat Description Parameters
stat_identity() No transformation
stat_bin() Count binwidth, origin
stat_density() Density adjust, kernel
stat_smooth() Smoother method, se
stat_boxplot() Boxplot coef

stat and geom are not always easy to combine. For beginners, we recommand to only use geom.

Exercise 7

We consider a color variable \(X\) with probability distribution \[P(X=red)=0.3,\ P(X=blue)=0.2,\ P(X=green)=0.4,\ P(X=black)=0.1\] Draw the barplot of this distribution.

Scales

Scales control the mapping from data to aesthetic attributes (change of colors, sizes…). We generally use this element at the end of the process to refine the graph. Scales are defined as follows:

  • begin with scale_
  • add the aesthetics we want to modify (color, fill, x_)
  • end with the name of the scale (manual, identity…)

For instance,

ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()+
scale_color_manual(values=c("Fair"="black","Good"="yellow",
"Very Good"="blue","Premium"="red","Ideal"="green"))

Here are the main scales:

aes Discrete Continuous
Couleur (color et fill) brewer gradient
- grey gradient2
- hue gradientn
- identity
- manual
Position (x et y) discrete continous
- date
Forme shape
- identity
- manual
Taille identity size
- manual

Some examples:

  • color of a barplot
p1 <- ggplot(diamonds2)+aes(x=cut)+geom_bar(aes(fill=cut))
p1

We change colors by using the palette Purples :

p1+scale_fill_brewer(palette="Purples")
  • Gradient color for a scatter plot :
p2 <- ggplot(diamonds2)+aes(x=carat,y=price)+geom_point(aes(color=depth))
p2

We change the gradient color

p2+scale_color_gradient(low="red",high="yellow")
  • Change on the axis
p2+scale_x_continuous(breaks=seq(0.5,3,by=0.5))+scale_y_continuous(name="prix")+scale_color_gradient("Profondeur")

Group and facets

ggplot allows to make representations for subgroup of individuals. We can proceed in two ways:

  • to represent subgroup on the same graph, we use group in aes
  • to represent subgroup on the different graphs, we use facets

We can represent (on the same graph) the smoother price vs carat for each modality of cut with

ggplot(diamonds2)+aes(x=carat,y=price,group=cut,color=cut)+geom_smooth(method="loess")

To obtain the representation on many graphs, we use

ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")+facet_wrap(~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")+facet_wrap(~cut,nrow=1)

facet_grid and facet_wrap do the same job but split the screen in different ways:

ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+geom_smooth(method="lm")+facet_grid(color~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+geom_smooth(method="lm")+facet_wrap(color~cut)

2.2 Complements

Syntax for ggplot is defined according to the following scheme:

ggplot()+aes()+geom_()+scale_()

It is more flexible: for instance aes could be specified in ggplot or in geom_

ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()
ggplot(diamonds2,aes(x=carat,y=price))+geom_point()
ggplot(diamonds2)+geom_point(aes(x=carat,y=price))

We can also built our graph with many datasets:

X <- seq(-2*pi,2*pi,by=0.001)
Y1 <- cos(X)
Y2 <- sin(X)
donnees1 <- data.frame(X,Y1)
donnees2 <- data.frame(X,Y2)
ggplot(donnees1)+geom_line(aes(x=X,y=Y1))+
geom_line(data=donnees2,aes(x=X,y=Y2),color="red")

Many other functions are proposed by ggplot:

  • ggtitle to add a title
  • ggsave ta save a graph
  • theme_ to change the theme of the graph
p <- ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
p+theme_bw()
p+theme_classic()
p+theme_grey()
p+theme_bw()

Exercise 8

  1. Draw the sine and cosine funtions on the same graph. You first use two datasets (one for the sine function, the other for the cosine function).

  2. Do the same with one dataset. Hint: use the melt function of the package reshape2.

  3. Draw the two functions on two different graphs (use facet_wrap).

  4. Do the same with the function grid.arrange from the package gridExtra.

Exercise 9

We consider the dataset mtcars

data(mtcars)
summary(mtcars)
  1. Draw the histogram of mpg (use many number of bins)

  2. Represent the density on the \(y\)-axis.

  3. Draw the barplot of cyl.

  4. Draw the scatter plot disp vs mpg for each value of cyl (one color for each value of cyl).

  5. Add the linear smoother on each graph.

Exercise 10

  1. Draw the sine function on \([-2\pi,2\pi]\)
  2. Add the lines (in blue) of equation \(y=1\) and \(y=-1\). Use size=2.

Exercise 11

  1. Simulate a sample \((x_i,y_i),i=1,\dots,100\) according to the linear model \[Y_i=3+X_i+\varepsilon_i\] where \(X_i\) are i.i.d. and uniform on \([0,1]\) and \(\varepsilon_i\) are gaussian \(N(0,0.2^2)\) (use runif and rnorm)

  2. Draw the scatter plot y vs x and add the linear smoother.

  3. Draw the residuals: add a vertical line from each point to the linear smoother (use geom_segment).

Exercise 12 (optional)

File exo3_ggplot.dta contains \(n=100\) observations. We want to explain \(Y\) by the other variables with a linear model.

  1. Compute the linear model (lm function). What can we say about it?
D1 <- read.table("../DATA/exo3_ggplot.dta",header=T,sep=";")
m1 <- lm(Y~.,data=D1)
summary(m1)
  1. Calculate the partial residuals (residuals(…,type=“partial”)):
res <- data.frame(residuals(m1,type="partial"))
  1. Draw for each explanatory variable, the scatter plot of the partial residuals versus the variable. Add the smoother loess.

  2. Improve the linear model.

m2 <- lm(Y~X1+X2+X3+X4+I(X4^2),data=D1)
summary(m2)

Challenge (optional)

We consider the mtcars dataset (exercise 9).

  1. Obtain the following graphs (use coord_flip for the second graph).

  1. Add on the third graph a line for the quartiles of the variable carat (for each value of cut)

  2. Draw the following graph (use the ggstance package).

