Programming Data Visualisation in R

Static Visualisation Interactive Visualisation R In-class Exercise

This in-class exercise explores programming data visualisation in R. It introduces the “Grammar of Graphics”, ggplot2 for static graphics, ggiraph and plotly for interactive graphics, and tidyverse for the data science workflow.

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/
06-26-2021

Installing R Packages and Importing Data

Install and Lauch R Packages

Check, install and launch ggiraph, plotly, DT and tidyverse packages of R

packages = c('DT', 'ggiraph', 'plotly', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Importing Data

Use read_csv() of readr package to import Exam_data.csv into R

exam_data <- read_csv("data/Exam_data.csv")
glimpse(exam_data)
Rows: 322
Columns: 7
$ ID      <chr> "Student321", "Student305", "Student289", "Student22~
$ CLASS   <chr> "3I", "3I", "3H", "3F", "3I", "3I", "3I", "3I", "3I"~
$ GENDER  <chr> "Male", "Female", "Male", "Male", "Male", "Female", ~
$ RACE    <chr> "Malay", "Malay", "Chinese", "Chinese", "Malay", "Ma~
$ ENGLISH <dbl> 21, 24, 26, 27, 27, 31, 31, 31, 33, 34, 34, 36, 36, ~
$ MATHS   <dbl> 9, 22, 16, 77, 11, 16, 21, 18, 19, 49, 39, 35, 23, 3~
$ SCIENCE <dbl> 15, 16, 16, 31, 25, 16, 25, 27, 15, 37, 42, 22, 32, ~
summary(exam_data)
      ID               CLASS              GENDER         
 Length:322         Length:322         Length:322        
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
     RACE              ENGLISH          MATHS          SCIENCE     
 Length:322         Min.   :21.00   Min.   : 9.00   Min.   :15.00  
 Class :character   1st Qu.:59.00   1st Qu.:58.00   1st Qu.:49.25  
 Mode  :character   Median :70.00   Median :74.00   Median :65.00  
                    Mean   :67.18   Mean   :69.33   Mean   :61.16  
                    3rd Qu.:78.00   3rd Qu.:85.00   3rd Qu.:74.75  
                    Max.   :96.00   Max.   :99.00   Max.   :96.00  

Static Visualisation

Comparing Base R Histogram vs ggplot 2

Base R histogram

hist(exam_data$MATHS)

ggplot2 histogram

ggplot(data = exam_data, aes(x=MATHS)) +
  geom_histogram( bins = 10,
                  boundary = 100,
                  color = "black",
                  fill = "grey") +
  ggtitle("Distribution of Maths Score")

Essential Elements in ggplot2

Geometric Objects: geom_bar

Plot a bar chart

ggplot(data = exam_data,
       aes(x=RACE)) +
  geom_bar()

Geometric Objects: geom_dotplot

The width of a dot corresponds to the bin width (or maximum width, depending on the binning algorithm), and dots are stacked, with each dot representing one observation.

ggplot(data = exam_data,
       aes(x=MATHS,
           fill = RACE)) +
  geom_dotplot(binwidth = 2.5,
               dotsize = 0.5) +
  scale_y_continuous(NULL,
                     breaks = NULL)

Geometric Objects: geom_histogram

geom_histogram() is used to create a simple histogram by using values in MATHS field of exam_data:

ggplot(data = exam_data,
       aes(x=MATHS)) + 
  geom_histogram(bins = 20,
                 color = "black",
                 fill = "light blue")

Modifying a geometric object by changing aes()

The interior colour of the histogram was changed using the sub-group of aesthetics and fill argument

ggplot(data = exam_data,
       aes(x=MATHS,
           fill = GENDER)) + 
  geom_histogram(bins = 20,
                 color = "grey30")

Geometric Objects: geom-density

geom-density() computes and plots kernel density estimate, which is a smoothed version of the histogram

ggplot(data=exam_data, 
       aes(x = MATHS)) +
  geom_density()

Two kernel density lines by using colour or fill arguments of aes()

ggplot(data=exam_data, 
       aes(x = MATHS, 
           colour = GENDER)) +
  geom_density()

Geometric Objects: geom_boxplot

Boxplots by using geom_boxplot()

ggplot(data=exam_data, 
       aes(y = MATHS,
           x= GENDER)) +
  geom_boxplot()

Notches are used in box plots to help visually assess whether the medians of distributions differ. If the notches do not overlap, this is evidence that the medians are different.

ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot(notch=TRUE)

Combined geom Objects

Plot data points using both geom_boxplot() and geom_point()

ggplot(data = exam_data,
       aes(y = MATHS,
           x = GENDER)) + 
  geom_boxplot() +
  geom_point(position = "jitter",
             size = 0.5)

Interactive Data Visualisation with R - ggiraph methods

Interactive dotplot

Tooltip effect with tooltip aesthetic

Interactivity: hovering displays student’s ID

p <- ggplot(data = exam_data,
            aes(x = MATHS)) +
  geom_dotplot_interactive(
    aes(tooltip = ID),
    stackgroups = TRUE,
    binwidth = 1,
    method = "histodot") +
  scale_y_continuous(NULL,
                     breaks = NULL)
girafe(
    ggobj = p,
    width_svg = 6,
    height_svg = 6*0.618
  )

Hover effect with data_id aesthetic

Interactivity: Elements associated with a data_id (i.e CLASS) will be highlighted upon mouse over

p <- ggplot(data = exam_data,
            aes(x = MATHS)) +
  geom_dotplot_interactive(
    aes(data_id = CLASS,
        tooltip = CLASS),
    stackgroups = TRUE,
    binwidth = 1,
    method = "histodot") +
  scale_y_continuous(NULL,
                     breaks = NULL)
girafe(
    ggobj = p,
    width_svg = 6,
    height_svg = 6*0.618
  )

Styling hover effect

css codes are used to change the highlighting effect

p <- ggplot(data=exam_data, 
       aes(x = MATHS)) +
  geom_dotplot_interactive(              
    aes(data_id = CLASS),              
    stackgroups = TRUE,                  
    binwidth = 1,                        
    method = "histodot") +               
  scale_y_continuous(NULL,               
                     breaks = NULL)
girafe(                                  
  ggobj = p,                             
  width_svg = 6,                         
  height_svg = 6*0.618,
  options = list(
    opts_hover(css = "fill: #202020;"),
    opts_hover_inv(css = "opacity:0.2;")
  ))

Click effect with onclick

exam_data$onclick <- sprintf("window.open(\"%s%s\")",
"https://www.moe.gov.sg/schoolfinder?journey=Primary%20school", as.character(exam_data$ID) )
p <- ggplot(data=exam_data, 
       aes(x = MATHS)) +
  geom_dotplot_interactive(              
    aes(onclick = onclick),
    stackgroups = TRUE,                  
    binwidth = 1,                        
    method = "histodot") +               
  scale_y_continuous(NULL,               
                     breaks = NULL)
girafe(                                  
  ggobj = p,                             
  width_svg = 6,                         
  height_svg = 6*0.618)

Interactive Data Visualisation with R - plotly methods

Interactive scatter plot

plot_ly(data = exam_data, 
             x = ~MATHS, 
             y = ~ENGLISH)

Visual Variable

color argument is mapped to a qualitative visual variable (i.e. RACE)

plot_ly(data = exam_data, 
        x = ~ENGLISH, 
        y = ~MATHS, 
        color = ~RACE)

Changing colour pallete

colors argument is used to change the default colour palette to ColorBrewel colour palette.

plot_ly(data = exam_data, 
        x = ~ENGLISH, 
        y = ~MATHS, 
        color = ~RACE, 
        colors = "Set1")

Customising colour scheme

pal <- c("red", "purple", "blue", "green")
plot_ly(data = exam_data, 
        x = ~ENGLISH, 
        y = ~MATHS, 
        color = ~RACE, 
        colors = pal)

Customising tooltip

text argument is used to change the default tooltip

plot_ly(data = exam_data, 
        x = ~ENGLISH, 
        y = ~MATHS,
        text = ~paste("Student ID:", ID,
                      "<br>Class:", CLASS),
        color = ~RACE, 
        colors = "Set1")

Working with layout

layout argument is used to change the default tooltip.

plot_ly(data = exam_data, 
        x = ~ENGLISH, 
        y = ~MATHS,
        text = ~paste("Student ID:", ID,     
                      "<br>Class:", CLASS),  
        color = ~RACE, 
        colors = "Set1") %>%
  layout(title = 'English Score versus Maths Score ',
         xaxis = list(range = c(0, 100)),
         yaxis = list(range = c(0, 100)))

Interactive Data Visualisation with R - ggplotly methods

Interactive scatter plot

Only extra line you need to include in the code chunk is ggplotly()

p <- ggplot(data=exam_data, 
            aes(x = MATHS,
                y = ENGLISH)) +
  geom_point(dotsize = 1) +
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
ggplotly(p)

Coordinated Multiple Views with plotly

Two scatterplots and places them next to each other side-by-side by using subplot() of plotly package

p1 <- ggplot(data=exam_data, 
              aes(x = MATHS,
                  y = ENGLISH)) +
  geom_point(size=1) +
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
p2 <- ggplot(data=exam_data, 
            aes(x = MATHS,
                y = SCIENCE)) +
  geom_point(size=1) +
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
subplot(ggplotly(p1),
        ggplotly(p2))

Coordinated Multiple Views with plotly

To create a coordinated scatterplots, highlight_key() of plotly package is used

d <- highlight_key(exam_data)
p1 <- ggplot(data=d, 
            aes(x = MATHS,
                y = ENGLISH)) +
  geom_point(size=1) +
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
p2 <- ggplot(data=d, 
            aes(x = MATHS,
                y = SCIENCE)) +
  geom_point(size=1) +
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
subplot(ggplotly(p1),
        ggplotly(p2))

Click on a data point of one of the scatterplot and see how the corresponding point on the other scatterplot is selected.

Thing to learn from the code chunk:

Interactive Data Table: DT package

DT::datatable(exam_data)

Linked brushing: crosstalk method

Two scatterplots and places them next to each other side-by-side by using subplot() of plotly package

d <- highlight_key(exam_data)
p <- ggplot(d, aes(ENGLISH, MATHS)) + 
  geom_point(size=1) +
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
gg <- highlight(ggplotly(p), 
                "plotly_selected")
crosstalk::bscols(gg, DT::datatable(d), widths = 5)

Things to learn from the code chunk:

Reference

Citation

For attribution, please cite this work as

Dolit (2021, June 26). FinTech & Analytics: Programming Data Visualisation in R. Retrieved from https://adolit.github.io/posts/2021-06-26-programming-datavis-r/

BibTeX citation

@misc{dolit2021programming,
  author = {Dolit, Archie},
  title = {FinTech & Analytics: Programming Data Visualisation in R},
  url = {https://adolit.github.io/posts/2021-06-26-programming-datavis-r/},
  year = {2021}
}