Graphs are often the starting point for statistical analysis. One of the main advantages of R is how easy it is for the user to create many different kinds of graphs. We begin this chapter by studying conventional graphs, followed by an examination of some more complex representations. This final part uses the ggplot2 package.
To begin with, it may be interesting to examine a few example of graphical representations which can be constructed with R. We use the demo function:
demo(graphics)
The plot function is a generic function used to represent all kinds of data. Classical use of the plot function consists of representing a scatterplot for a variable y according to another variable x. For example, to represent the graph of the function \(x\mapsto \sin(2\pi x)\) on \([0,1]\), at regular steps we use the following commands:
x <- seq(-2*pi,2*pi,by=0.05)
y <- sin(x)
plot(x,y) #dot representation (default)
plot(x,y,type="l") #line representation
We provide examples of representations for quantitative and qualitative variables. We use the data file ozone.txt, imported using
setwd("~/Dropbox/LAURENT/COURS/EDHEC/R/FICHES")
path <- file.path("../DATA", "ozone.txt")
ozone <- read.table(path)
summary(ozone)
Let us start by representing two quantitative variables: maximum ozone maxO3 according to temperature T12:
plot(ozone[,"T12"],ozone[,"maxO3"])
As the two variables are contained and named within the same table, a simpler syntax can be used, which automatically inserts the variables as labels for the axes:
plot(maxO3~T12,data=ozone)
We can also use (more complicated)
plot(ozone[,"T12"],ozone[,"maxO3"],xlab="T12",ylab="maxO3")
Functions histogram, barplot and boxplot allow to draw classical graphs:
hist(ozone$maxO3,main="Histogram")
barplot(table(ozone$wind)/nrow(ozone),col="blue")
boxplot(maxO3~wind,data=ozone)
We can use this package to obtain dynamic graphs. It is easy, we just have to use the prefix am beforme the name of the function:
library(rAmCharts)
amHist(ozone$maxO3)
amPlot(ozone,col=c("T9","T12"))
amBoxplot(maxO3~wind,data=ozone)
ggplot2 is a plotting system for R based on the grammar of graphics (as dplyr to manipulate data). We can find documentation here. We consider a subsample of the diamond dataset from the package ggplot2:
library(ggplot2)
set.seed(1234)
diamonds2 <- diamonds[sample(nrow(diamonds),5000),]
summary(diamonds2)
help(diamonds)
Given a dataset, a graph is defined from many layers. We have to specify:
Ggplot graphs are defined from these layers. We indicate
The scatterplot carat vs price is obtained with the plot function with
plot(price~carat,data=diamonds2)
With ggplot, we use
ggplot(diamonds2) #nothing
ggplot(diamonds2)+aes(x=carat,y=price) #nothing
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point() #good
In ggplot, the syntax is defined from independent elements. These elements define the grammar of ggplot. Main elements of the grammar include:
All these elements are conbined with a +.
Data and aesthetics
These two elements specify the data and the variables we want to represent. For a scaterplot price vs carat we enter the command
ggplot(diamonds2)+aes(x=carat,y=price)
aes also use arguments such as color, size, fill. We use these arguments as soon as a color or a size is defined from a variable of the dataset:
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)
Geometrics
To obtain the graph, we need to precise the type of representation. We use geometrics to do that. For a scatter plot, we use geom_point:
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
Observe that ggplot adds the lengend automatically. Exemples of geometrics are described here:
Geom | Description | Aesthetics |
---|---|---|
geom_point() | Scatter plot | x, y, shape, fill |
geom_line() | Line (ordered according to x) | x, y, linetype |
geom_abline() | Line | slope, intercept |
geom_path() | Line (ordered according to the index) | x, y, linetype |
geom_text() | Text | x, y, label, hjust, vjust |
geom_rect() | Rectangle | xmin, xmax, ymin, ymax, fill, linetype |
geom_polygon() | Polygone | x, y, fill, linetype |
geom_segment() | Segment | x, y, fill, linetype |
geom_bar() | Barplot | x, fill, linetype, weight |
geom_histogram() | Histogram | x, fill, linetype, weight |
geom_boxplot() | Boxplots | x, y, fill, weight |
geom_density() | Density | x, y, fill, linetype |
geom_contour() | Contour lines | x, y, fill, linetype |
geom_smooth() | Smoothers (linear or non linear) | x, y, fill, linetype |
All | color, size, group |
Statistics
(this part can be omitted for beginners)Many graphs need to transform the data to make the representation (barplot, histogram). Simple transformations can be obtained quickly. For instance we can draw the sine function with
D <- data.frame(X=seq(-2*pi,2*pi,by=0.01))
ggplot(D)+aes(x=X,y=sin(X))+geom_line()
The sine transformation is precised in aes. For more complex transformations, we have to used statistics. A stat function takes a dataset as input and returns a dataset as output, and so a stat can add new variables to the original dataset. It is possible to map aesthetics to these new variables. For example, stat_bin, the statistic used to make histograms, produces the following variables:
count
, the number of observations in each bindensity
, the density of observations in each bin (percentage of total / bar width)x
, the center of the binBy default geom_histogram represents on the \(y\)-axis the number of observations in each bin (the outuput count).
ggplot(diamonds)+aes(x=price)+geom_histogram(bins=40)
For the density, we use
ggplot(diamonds)+aes(x=price,y=..density..)+geom_histogram(bins=40)
ggplot propose another way to make the representations: we can use stat_ instead of geom_. Formally, each stat function has a geom and each geom has a stat. For instance,
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")
ggplot(diamonds2)+aes(x=carat,y=price)+stat_smooth(method="loess")
lead to the same graph. We can change the type of representation in the stat_ with the argument geom:
ggplot(diamonds2)+aes(x=carat,y=price)+stat_smooth(method="loess",geom="point")
Here are some examples of stat functions
Stat | Description | Parameters |
---|---|---|
stat_identity() | No transformation | |
stat_bin() | Count | binwidth, origin |
stat_density() | Density | adjust, kernel |
stat_smooth() | Smoother | method, se |
stat_boxplot() | Boxplot | coef |
stat and geom are not always easy to combine. For beginners, we recommand to only use geom.
We consider a color variable \(X\) with probability distribution \[P(X=red)=0.3,\ P(X=blue)=0.2,\ P(X=green)=0.4,\ P(X=black)=0.1\] Draw the barplot of this distribution.
Scales
Scales control the mapping from data to aesthetic attributes (change of colors, sizes…). We generally use this element at the end of the process to refine the graph. Scales are defined as follows:
For instance,
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()+
scale_color_manual(values=c("Fair"="black","Good"="yellow",
"Very Good"="blue","Premium"="red","Ideal"="green"))
Here are the main scales:
aes | Discrete | Continuous |
---|---|---|
Couleur (color et fill) | brewer | gradient |
- | grey | gradient2 |
- | hue | gradientn |
- | identity | |
- | manual | |
Position (x et y) | discrete | continous |
- | date | |
Forme | shape | |
- | identity | |
- | manual | |
Taille | identity | size |
- | manual |
Some examples:
color of a barplot
p1 <- ggplot(diamonds2)+aes(x=cut)+geom_bar(aes(fill=cut))
p1
We change colors by using the palette Purples :
p1+scale_fill_brewer(palette="Purples")
Gradient color for a scatter plot
:p2 <- ggplot(diamonds2)+aes(x=carat,y=price)+geom_point(aes(color=depth))
p2
We change the gradient color
p2+scale_color_gradient(low="red",high="yellow")
Change on the axis
p2+scale_x_continuous(breaks=seq(0.5,3,by=0.5))+scale_y_continuous(name="prix")+scale_color_gradient("Profondeur")
Group and facets
ggplot allows to make representations for subgroup of individuals. We can proceed in two ways:
We can represent (on the same graph) the smoother price vs carat for each modality of cut with
ggplot(diamonds2)+aes(x=carat,y=price,group=cut,color=cut)+geom_smooth(method="loess")
To obtain the representation on many graphs, we use
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")+facet_wrap(~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")+facet_wrap(~cut,nrow=1)
facet_grid and facet_wrap do the same job but split the screen in different ways:
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+geom_smooth(method="lm")+facet_grid(color~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+geom_smooth(method="lm")+facet_wrap(color~cut)
Syntax for ggplot is defined according to the following scheme:
ggplot()+aes()+geom_()+scale_()
It is more flexible: for instance aes could be specified in ggplot or in geom_
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()
ggplot(diamonds2,aes(x=carat,y=price))+geom_point()
ggplot(diamonds2)+geom_point(aes(x=carat,y=price))
We can also built our graph with many datasets:
X <- seq(-2*pi,2*pi,by=0.001)
Y1 <- cos(X)
Y2 <- sin(X)
donnees1 <- data.frame(X,Y1)
donnees2 <- data.frame(X,Y2)
ggplot(donnees1)+geom_line(aes(x=X,y=Y1))+
geom_line(data=donnees2,aes(x=X,y=Y2),color="red")
Many other functions are proposed by ggplot:
p <- ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
p+theme_bw()
p+theme_classic()
p+theme_grey()
p+theme_bw()
Draw the sine and cosine funtions on the same graph. You first use two datasets (one for the sine function, the other for the cosine function).
Do the same with one dataset. Hint: use the melt function of the package reshape2.
Draw the two functions on two different graphs (use facet_wrap).
Do the same with the function grid.arrange from the package gridExtra.
We consider the dataset mtcars
data(mtcars)
summary(mtcars)
Draw the histogram of mpg (use many number of bins)
Represent the density on the \(y\)-axis.
Draw the barplot of cyl.
Draw the scatter plot disp vs mpg for each value of cyl (one color for each value of cyl).
Add the linear smoother on each graph.
Simulate a sample \((x_i,y_i),i=1,\dots,100\) according to the linear model \[Y_i=3+X_i+\varepsilon_i\] where \(X_i\) are i.i.d. and uniform on \([0,1]\) and \(\varepsilon_i\) are gaussian \(N(0,0.2^2)\) (use runif and rnorm)
Draw the scatter plot y vs x and add the linear smoother.
Draw the residuals: add a vertical line from each point to the linear smoother (use geom_segment).
File exo3_ggplot.dta contains \(n=100\) observations. We want to explain \(Y\) by the other variables with a linear model.
D1 <- read.table("../DATA/exo3_ggplot.dta",header=T,sep=";")
m1 <- lm(Y~.,data=D1)
summary(m1)
res <- data.frame(residuals(m1,type="partial"))
Draw for each explanatory variable, the scatter plot of the partial residuals versus the variable. Add the smoother loess.
Improve the linear model.
m2 <- lm(Y~X1+X2+X3+X4+I(X4^2),data=D1)
summary(m2)
We consider the mtcars dataset (exercise 9).
Add on the third graph a line for the quartiles of the variable carat (for each value of cut)
Draw the following graph (use the ggstance package).