This in-class exercise explores programming data visualisation in R. It introduces the “Grammar of Graphics”, ggplot2 for static graphics, ggiraph and plotly for interactive graphics, and tidyverse for the data science workflow.
Check, install and launch ggiraph, plotly, DT and tidyverse packages of R
packages = c('DT', 'ggiraph', 'plotly', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Use read_csv() of readr package to import Exam_data.csv into R
exam_data <- read_csv("data/Exam_data.csv")
glimpse(exam_data)
Rows: 322
Columns: 7
$ ID <chr> "Student321", "Student305", "Student289", "Student22~
$ CLASS <chr> "3I", "3I", "3H", "3F", "3I", "3I", "3I", "3I", "3I"~
$ GENDER <chr> "Male", "Female", "Male", "Male", "Male", "Female", ~
$ RACE <chr> "Malay", "Malay", "Chinese", "Chinese", "Malay", "Ma~
$ ENGLISH <dbl> 21, 24, 26, 27, 27, 31, 31, 31, 33, 34, 34, 36, 36, ~
$ MATHS <dbl> 9, 22, 16, 77, 11, 16, 21, 18, 19, 49, 39, 35, 23, 3~
$ SCIENCE <dbl> 15, 16, 16, 31, 25, 16, 25, 27, 15, 37, 42, 22, 32, ~
summary(exam_data)
ID CLASS GENDER
Length:322 Length:322 Length:322
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
RACE ENGLISH MATHS SCIENCE
Length:322 Min. :21.00 Min. : 9.00 Min. :15.00
Class :character 1st Qu.:59.00 1st Qu.:58.00 1st Qu.:49.25
Mode :character Median :70.00 Median :74.00 Median :65.00
Mean :67.18 Mean :69.33 Mean :61.16
3rd Qu.:78.00 3rd Qu.:85.00 3rd Qu.:74.75
Max. :96.00 Max. :99.00 Max. :96.00
Year end examination grades of a cohort of primary 3 students from a local school.
There are a total of seven attributes. Four of them are categorical data type and the other three are in continuous data type.
hist(exam_data$MATHS)
ggplot(data = exam_data, aes(x=MATHS)) +
geom_histogram( bins = 10,
boundary = 100,
color = "black",
fill = "grey") +
ggtitle("Distribution of Maths Score")
Plot a bar chart
ggplot(data = exam_data,
aes(x=RACE)) +
geom_bar()
The width of a dot corresponds to the bin width (or maximum width, depending on the binning algorithm), and dots are stacked, with each dot representing one observation.
ggplot(data = exam_data,
aes(x=MATHS,
fill = RACE)) +
geom_dotplot(binwidth = 2.5,
dotsize = 0.5) +
scale_y_continuous(NULL,
breaks = NULL)
geom_histogram() is used to create a simple histogram by using values in MATHS field of exam_data:
bin argument was changed to 20 from the defaul value of 30
color argument, used to change the outline colour, was set to black
fill argument, used to shade the histogram, was set to light blue
ggplot(data = exam_data,
aes(x=MATHS)) +
geom_histogram(bins = 20,
color = "black",
fill = "light blue")
The interior colour of the histogram was changed using the sub-group of aesthetics and fill argument
ggplot(data = exam_data,
aes(x=MATHS,
fill = GENDER)) +
geom_histogram(bins = 20,
color = "grey30")
geom-density() computes and plots kernel density estimate, which is a smoothed version of the histogram
ggplot(data=exam_data,
aes(x = MATHS)) +
geom_density()
Two kernel density lines by using colour or fill arguments of aes()
ggplot(data=exam_data,
aes(x = MATHS,
colour = GENDER)) +
geom_density()
Boxplots by using geom_boxplot()
ggplot(data=exam_data,
aes(y = MATHS,
x= GENDER)) +
geom_boxplot()
Notches are used in box plots to help visually assess whether the medians of distributions differ. If the notches do not overlap, this is evidence that the medians are different.
ggplot(data=exam_data,
aes(y = MATHS,
x= GENDER)) +
geom_boxplot(notch=TRUE)
Plot data points using both geom_boxplot() and geom_point()
ggplot(data = exam_data,
aes(y = MATHS,
x = GENDER)) +
geom_boxplot() +
geom_point(position = "jitter",
size = 0.5)
Interactivity: hovering displays student’s ID
p <- ggplot(data = exam_data,
aes(x = MATHS)) +
geom_dotplot_interactive(
aes(tooltip = ID),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 6*0.618
)
Interactivity: Elements associated with a data_id (i.e CLASS) will be highlighted upon mouse over
p <- ggplot(data = exam_data,
aes(x = MATHS)) +
geom_dotplot_interactive(
aes(data_id = CLASS,
tooltip = CLASS),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 6*0.618
)
css codes are used to change the highlighting effect
p <- ggplot(data=exam_data,
aes(x = MATHS)) +
geom_dotplot_interactive(
aes(data_id = CLASS),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 6*0.618,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
))
exam_data$onclick <- sprintf("window.open(\"%s%s\")",
"https://www.moe.gov.sg/schoolfinder?journey=Primary%20school", as.character(exam_data$ID) )
p <- ggplot(data=exam_data,
aes(x = MATHS)) +
geom_dotplot_interactive(
aes(onclick = onclick),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 6*0.618)
plot_ly(data = exam_data,
x = ~MATHS,
y = ~ENGLISH)
color argument is mapped to a qualitative visual variable (i.e. RACE)
plot_ly(data = exam_data,
x = ~ENGLISH,
y = ~MATHS,
color = ~RACE)
colors argument is used to change the default colour palette to ColorBrewel colour palette.
plot_ly(data = exam_data,
x = ~ENGLISH,
y = ~MATHS,
color = ~RACE,
colors = "Set1")
pal <- c("red", "purple", "blue", "green")
plot_ly(data = exam_data,
x = ~ENGLISH,
y = ~MATHS,
color = ~RACE,
colors = pal)
text argument is used to change the default tooltip
plot_ly(data = exam_data,
x = ~ENGLISH,
y = ~MATHS,
text = ~paste("Student ID:", ID,
"<br>Class:", CLASS),
color = ~RACE,
colors = "Set1")
layout argument is used to change the default tooltip.
Only extra line you need to include in the code chunk is ggplotly()
Two scatterplots and places them next to each other side-by-side by using subplot() of plotly package
To create a coordinated scatterplots, highlight_key() of plotly package is used
d <- highlight_key(exam_data)
p1 <- ggplot(data=d,
aes(x = MATHS,
y = ENGLISH)) +
geom_point(size=1) +
coord_cartesian(xlim=c(0,100),
ylim=c(0,100))
p2 <- ggplot(data=d,
aes(x = MATHS,
y = SCIENCE)) +
geom_point(size=1) +
coord_cartesian(xlim=c(0,100),
ylim=c(0,100))
subplot(ggplotly(p1),
ggplotly(p2))
Click on a data point of one of the scatterplot and see how the corresponding point on the other scatterplot is selected.
Thing to learn from the code chunk:
A wrapper of the JavaScript Library DataTables
Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’ (typically via R Markdown or Shiny).
DT::datatable(exam_data)
Two scatterplots and places them next to each other side-by-side by using subplot() of plotly package
d <- highlight_key(exam_data)
p <- ggplot(d, aes(ENGLISH, MATHS)) +
geom_point(size=1) +
coord_cartesian(xlim=c(0,100),
ylim=c(0,100))
gg <- highlight(ggplotly(p),
"plotly_selected")
crosstalk::bscols(gg, DT::datatable(d), widths = 5)
Things to learn from the code chunk:
highlight() is a function of plotly package. It sets a variety of options for brushing (i.e., highlighting) multiple plots. These options are primarily designed for linking multiple plotly graphs, and may not behave as expected when linking plotly to another htmlwidget package via crosstalk. In some cases, other htmlwidgets will respect these options, such as persistent selection in leaflet.
bscols() is a helper function of crosstalk package. It makes it easy to put HTML elements side by side. It can be called directly from the console but is especially designed to work in an R Markdown document. Warning: This will bring in all of Bootstrap!.
For attribution, please cite this work as
Dolit (2021, June 26). FinTech & Analytics: Programming Data Visualisation in R. Retrieved from https://adolit.github.io/posts/2021-06-26-programming-datavis-r/
BibTeX citation
@misc{dolit2021programming, author = {Dolit, Archie}, title = {FinTech & Analytics: Programming Data Visualisation in R}, url = {https://adolit.github.io/posts/2021-06-26-programming-datavis-r/}, year = {2021} }