Data Visualisation with `ggplot2`

Communicating your data

Author

Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

June 3, 2024

1 Learning objectives

visualise variable distributions
visualise summary statistics
save figures as rds or as figures

2 Load packages and data

# load packages
pacman::p_load(tidyverse, here)

# load data
df_lifetime <- readr::read_csv(here::here("data/tidy_data_lifetime_pilot.csv"), 
                               # for special characters
                               locale = readr::locale(encoding = "latin1") 
                               ) |>
  mutate_if(is.character,as.factor) |> # all character variables as factor
  filter(type == "critical", # only critical trials
         px != "px3") # this participant had lots of 0's for some reason

2.1 Summary of first-fixation times

this code is from the previous topic

# compute summary 
summary_ff <- df_lifetime |> 
  filter(region=="verb") |> 
  group_by(condition, lifetime, tense) %>%
  summarise(N = n(),
            mean.ff = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T)) %>%
  # compute standard error, confidence intervals, and lower/upper ci bounds
  mutate(se = sd / sqrt(N),
         ci = qt(1 - (0.05 / 2), N - 1) * se,
         lower.ci = mean.ff - qt(1 - (0.05 / 2), N - 1) * se,
         upper.ci = mean.ff + qt(1 - (0.05 / 2), N - 1) * se) |> 
  ungroup()

3 Plotting reading times

reading times are (usually) continuous variables
- as are e.g., reaction times
they are truncated at 0, meaning they cannot have negative values
- because of this, they tend to have a skewed distribution

4 Plots with `ggplot2`

ggplot2 is part of the tidyverse (like dplyr)
- uses a layered grammar of graphics
- i.e., we build layers

5 An example: histogram

iris |> 
  ggplot(aes(x=Sepal.Length)) +
  geom_histogram() +
  labs(title = "A histogram",
       x = "variable")

5.1 Start layering

df_lifetime |> ggplot(aes(ff)) # aes = 'aesthetic'

5.2 Add labels

df_lifetime |> ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixation times",
       x = "First fixation times (ms)")

5.3 Add

df_lifetime |> ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram()

Distribution of first fixation times at the verb region (raw milliseconds)

5.4 Add condition

df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "First fixataion times at the verb region",
       x = "First fixation times (ms)") +
  geom_histogram()

The colour here is STACKED!! i.e., not layered. Notice the distribution doesn’t change from all grey to coloured

5.5 Customisation

we can add arguments to our geoms
- e.g., transparency: alpha = takes a value between 0 to 1
we can use theme() to customise font sizes, legend placement, etc.
tehre are also popular preset themes, such as theme_bw() and theme_minimal()

theme_bw()
theme_minimal()

df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram(alpha=.5) +
  theme_bw()

df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram(alpha=.5) +
  theme_minimal()

6 Distributions

show the distribution of observations
- so we can see where the data are clustered
- and eyeball the shape of the distribution
we already saw the histogram, which shows the number of observations per variable value
density plots are another useful plot for visualising distributions

6.1 Density plots

below I just replaced geom_histogram() with geom_density()
- I also filtered the data to include only values of ff above 0
what is plotted along the y-axis? how does this differ from a histogram?

df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_density() +
  theme_minimal()

6.2 Grouped density plots

just like with histograms, we can look at the density plots of different subsets of the data with aes(fill = )
- like region

df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff, fill = region)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_minimal()

6.2.1 `facet_grid()`

there are a lot of overlapping density curves, let’s try to separate them with facet_grid(x~y)

df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff, fill = region)) + 
  facet_grid(.~region) +
  labs(title = "Density plot of first fixataion times by region",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_bw()

how would you describe the density plots of the different regions?

6.2.2 re-ordering factors

by default, factors will be ordered alphabetically
- but we don’t always want that
- here, verb-1 should be before verb

df_lifetime <- df_lifetime %>%
  mutate(region = factor(region, 
                         levels = c("verb-1","verb","verb+1","verb+2","verb+3","verb+4")))

summary(df_lifetime$region)

verb-1   verb verb+1 verb+2 verb+3 verb+4 
   559    559    559    559    559    182

6.2.2.1 Exercise

create a density plot with the fill colour set to condition, but:

subset the data to only include the verb region
you can decide if you want to use facets or to have the density curves overlayed
your plot should look something like A or B:

6.2.2.2 Extra exercise

Can you produce these plots?

6.3 Scatterplots

histograms and density plots plot a single variable along the x-axis
- in most other plots the dependent (measure) variable is plotted along the y-axis by convention
scatterplots plot the relationship between two variables

iris |> 
  ggplot(aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_point() +
  labs(title = "A scatterplot",
       x = "variable X",
       y = "variable Y")

6.3.1 Scatterplots

the figure below plots total reading times (verb region) to the verb region (x-axis) and reaction times to the critical sentence (y-axis)
- what does each point represent?
- how would you describe the relationship between the two variables?

df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tt, y = rt)) +
  labs(title = "Scatter plot of total reading times (verb region)
and reaction times (critical sentence)",
       x = "Total reading time (ms)",
       y = "Reaction time (ms)") +
  geom_point(alpha = .2) +
  theme_bw()

6.3.1.1 Exercise

Generate a scatterplot of total reading times and reaction times, with:
- colour and shape set to condition
- tip: these both belong in aes()
What information does this plot suggest?

6.4 Bar plot

show the distribution of categorical factor levels
- i.e., the frequency of observations per level
be sure to read in accept as a factor!

Code

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept) )) +
  geom_bar() +
  theme_bw()

Code

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept), fill = condition)) +
  labs(title = "Binary responses",
       x = "Naturalness response",
       fill = "Condition") +
  geom_bar() +
  theme_bw()

6.5 Grouped bar plots

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept), fill = condition)) +
  labs(title = "Binary responses",
       x = "Naturalness response",
       fill = "Condition") +
  geom_bar(position = "dodge") +
  theme_bw()

6.5.1 Exercise

Generate a grouped bar plot (i.e., dodge) with:
- a facet grid for tense
- plots lifetime on the x-axis
- and fills the bars based on accept
- change the labels accordingly
- customise as you like

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Binary responses",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "dodge") +
  theme_bw()

6.6 Grouped bar plots

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Grouped and faceted barplot",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "dodge") +
  theme_bw()

6.7 Stacked bar plots

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Stacked and faceted barplot",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "stack") +
  theme_bw()

6.7.1 Exercise

Choose the barplot you like best for binary data
Reproduce that barplot, but with reg_in at the verb1 region

6.7.2 Extra exercise

Create another bar plot, but for reg_out for all sentence regions
Use facet_grid()

to have facets by region (columns) and by tense (in 2 rows)

7 Summary statistics

measures of location: mean, median, mode
measures of spread: (interquartile) range, standard deviation

7.1 Boxplots

boxplots provide information about the distribution of a continuous variable
- but includes information like median (dark line) and quartiles (box and whiskers)
- and outliers (dots)
like scatterplots, require x and y variables
- but one of them needs to be categorical

7.2

iris |> 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  labs(title = "A scatterplot",
       x = "Categorical variable",
       y = "Continuous variable") +
  geom_boxplot()

A scatterplot. Median (50th percentile): thick black lines; interquartile range (IQR; 25th and 75th percentile): box limits; minimum (0th percentile) and maximum (100th percentile) excluding outliers: : whiskers; outliers: points

7.3 Boxplot explained

Image source: Winter (2019) (all rights reserved)

7.4 Boxplots

let’s change our scatterplot to a boxplot

df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tense, y = ff)) +
  labs(title = "First-fixation duration (verb region)",
       x = "Tense",
       y = "First-fixation duration (ms)") +
  geom_boxplot(alpha = .2) +
  theme_bw()

7.4.1 Grouped boxplots

df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tense, y = ff, colour = lifetime)) +
  labs(title = "First-fixation duration (verb region)",
       x = "Tense",
       y = "First-fixation duration (ms)") +
  geom_boxplot(alpha = .2) +
  theme_bw()

7.4.1.1 Exercise

Create a group boxplot (x = tense, fill = lifetime) for

first-pass reading time (verb region)
regression path duration (verb region)
total reading time (verb region)
reaction times (use the distinct() verb to have a single observation per participant and per trial)

7.5 Interaction plots

common for factorial designs, i.e., comparing categorical predictors
there are 2 ways of producing them:
- with your data frame and stat_summary()
- or with a summary table and ggplot geoms geom_point(), geom_errorbar(), and geom_line()
we’ll need our summary table to plot an interaction plot

condition	lifetime	tense	N	mean.ff	sd	se	ci	lower.ci	upper.ci
deadPP	dead	PP	140	198.9	57.9	4.9	9.7	189.2	208.6
deadSF	dead	SF	139	194.6	67.9	5.8	11.4	183.2	205.9
livingPP	living	PP	140	194.2	77.3	6.5	12.9	181.3	207.1
livingSF	living	SF	140	186.0	57.6	4.9	9.6	176.4	195.6

Code

library(patchwork)

df_lifetime |> 
  filter(region == "verb") |> 
  ggplot(aes(x = lifetime, y = ff, 
                     shape = tense,
                     group = tense,
                     color = tense)) +
  labs(title="Interaction plot (`stat_summary()`)",
       x = "Lifetime",
       y = "First fix (ms)",
       shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") +
  stat_summary(fun = "mean", geom = "point", size = 3, position = position_dodge(0.2)) +
  stat_summary(fun = "mean", geom = "line", position = position_dodge(0.2), aes(linetype=tense)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2
               , position = position_dodge(0.2)) +
  theme_bw() +
  
summary_ff |> 
  ggplot(aes(x = lifetime, y = mean.ff, 
                     shape = tense,
                     group = tense,
                     color = tense)) +
  labs(title="Interaction plot (geoms)",
       x = "Lifetime",
       y = "First fix (ms)",
       shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") +
  geom_point(size = 3,
                position = position_dodge(0.2)) +
  geom_line(aes(linetype=tense), position = position_dodge(0.2)) +
  geom_errorbar(aes(ymin = mean.ff - ci,
                    ymax = mean.ff + ci),
                width = .2,
                position = position_dodge(0.2)) +
  theme_bw() +
  plot_annotation(tag_levels = "A") +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Figure 1: Identical interaction plots produced two ways: feeding a dataset into `ggplot()` and using `stat_summary()` (A) or feeding a summary table into `ggplot()` and using geoms (B)

8 Question: Binomial data

binomial data are those with 2 categories, for example
- present, absent
- yes, no
in our dataset, each trial ended with a binary naturalness judgement task
- how might we plot such data?

9 Saving our plots

we can save our plots two ways:
1. as image files (e.g., JPEG, PNG, SVG, etc.): when writing in Word, LaTeX, etc.
2. as a single R object (Rds: R data structure): when writing in Rmarkdown/Quarto

9.1 `ggsave()`

the ggsave() function is useful for saving ggplot objects
- we first have to save one of our figures as an object
- I usually save ggplot objects with the prefix fig_ (short for figure)
make sure you also have a useful place to store these figures
- e.g., a folder called figures

Code

fig_lifetime_ff <-
summary_ff |> 
  ggplot(aes(x = lifetime, y = mean.ff, 
                     shape = tense,
                     group = tense,
                     color = tense)) +
  labs(title="Mean first-fixation times (verb region) with 95% CIs",
       x = "Lifetime",
       y = "First fix (ms)",
       shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") +
  geom_point(size = 3,
                position = position_dodge(0.2)) +
  geom_line(aes(linetype=tense), position = position_dodge(0.2)) +
  geom_errorbar(aes(ymin = mean.ff - ci,
                    ymax = mean.ff + ci),
                width = .2,
                position = position_dodge(0.2)) +
  theme_bw()

ggsave(fig_lifetime_ff, filename = here("figures", "fig_lifetime_ff.png"))

ggsave() has lots of arguments to control width, height, resolution, etc.
- to see more, run ?ggsave in the Console
you can also save as JPG/JPEG, SVG, even PDF by just changing the filename extension

9.2 `saveRDS()`

we can also save the figure as R code
- which means we can control the width, height, resolution, etc. later on when we load it in
- useful if you’ll be writing up your results in R markdown or Quarto

saveRDS(fig_lifetime_ff, file = here("figures", "fig_lifetime_ff.rds"))

9.3 `readRDS()`

you can’t click on the file to view the figure because it’s R code
- you’d need to load the data into R again

fig_lifetime_ff <- readRDS(here("figures", "fig_lifetime_ff.rds"))

9.4

Naming files and saving code

You’ll notice I saved the PNG and RDS files using the same name that the I used for the figure in my script. This is an important point: I want to be able to traceback my figures from the code so I can easily track them. It also helps encourage informative object and file names.

Of course, saving the code used to save the files in our scripts is also useful because we can easily adjust the saved files (e.g., change figure width or height)

References

Nordmann, E., & DeBruine, L. (2022). Applied data skills. Zenodo. https://doi.org/10.5281/zenodo.6365078

Nordmann, E., McAleer, P., Toivo, W., Paterson, H., & DeBruine, L. M. (2022). Data Visualization Using R for Researchers Who Do Not Use R. Advances in Methods and Practices in Psychological Science, 5(2), 251524592210746. https://doi.org/10.1177/25152459221074654

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2nd ed.).

Winter, B. (2019). Statistics for Linguists: An Introduction Using R. In Statistics for Linguists: An Introduction Using R. Routledge. https://doi.org/10.4324/9781315165547

--- title: "Data Visualisation with `ggplot2`" subtitle: "Communicating your data" author: "Daniela Palleschi" institute: Humboldt-Universität zu Berlin footer: "Data Visualisation" lang: en date: 2024-06-03 format: revealjs: output-file: slides-data_viz.html theme: [dark] width: 1600 height: 900 progress: true # smaller: true scrollable: true slide-number: c/t code-link: true code-overflow: wrap code-tools: true # logo: logos/hu_logo.png # css: logo.css incremental: true # number-sections: true toc-title: 'Topics' navigation-mode: linear controls-layout: bottom-right fig-cap-location: top font-size: 0.6em slide-level: 4 title-slide-attributes: data-background-image: logos/logos.tif data-background-size: 15% data-background-position: 50% 92% html: self-contained: true output-file: sheet-data_viz.html number-sections: true toc: true code-overflow: wrap code-tools: true pdf: output-file: pdf-data_viz.pdf toc: true number-sections: false colorlinks: true code-overflow: wrap editor_options: chunk_output_type: console bibliography: ../../references.bib csl: ../../apa.csl --- # Learning objectives - visualise variable distributions - visualise summary statistics - save figures as `rds` or as figures # Load packages and data ```{r} #| message: false #| warning: false # load packages pacman::p_load(tidyverse, here) ``` ```{r} #| message: false #| warning: false # load data df_lifetime <- readr::read_csv(here::here("data/tidy_data_lifetime_pilot.csv"), # for special characters locale = readr::locale(encoding = "latin1") ) |> mutate_if(is.character,as.factor) |> # all character variables as factor filter(type == "critical", # only critical trials px != "px3") # this participant had lots of 0's for some reason ``` ## Summary of first-fixation times - this code is from the previous topic ```{r} # compute summary summary_ff <- df_lifetime |> filter(region=="verb") |> group_by(condition, lifetime, tense) %>% summarise(N = n(), mean.ff = mean(ff, na.rm = T), sd = sd(ff, na.rm = T)) %>% # compute standard error, confidence intervals, and lower/upper ci bounds mutate(se = sd / sqrt(N), ci = qt(1 - (0.05 / 2), N - 1) * se, lower.ci = mean.ff - qt(1 - (0.05 / 2), N - 1) * se, upper.ci = mean.ff + qt(1 - (0.05 / 2), N - 1) * se) |> ungroup() ``` # Plotting reading times - reading times are (usually) continuous variables + as are e.g., reaction times - they are truncated at 0, meaning they cannot have negative values + because of this, they tend to have a *skewed distribution* # Plots with `ggplot2` - `ggplot2` is part of the tidyverse (like `dplyr`) + uses a *layered grammar of graphics* + i.e., we build layers # An example: histogram ```{r} #| output-location: column-fragment iris |> ggplot(aes(x=Sepal.Length)) + geom_histogram() + labs(title = "A histogram", x = "variable") ``` ## Start layering ```{r} #| output-location: column df_lifetime |> ggplot(aes(ff)) # aes = 'aesthetic' ``` ## Add labels ```{r} #| output-location: column df_lifetime |> ggplot(aes(ff)) + labs(title = "Histogram of first fixation times", x = "First fixation times (ms)") ``` ## Add ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)"} #| output-location: column df_lifetime |> ggplot(aes(ff)) + labs(title = "Histogram of first fixataion times", x = "First fixation times (ms)") + geom_histogram() ``` ## Add condition ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)"} #| output-location: column df_lifetime |> ggplot(aes(ff, fill = condition)) + labs(title = "First fixataion times at the verb region", x = "First fixation times (ms)") + geom_histogram() ``` ::: {.notes} The colour here is STACKED!! i.e., not layered. Notice the distribution doesn't change from all grey to coloured ::: ## Customisation - we can add arguments to our geoms + e.g., transparency: `alpha = ` takes a value between 0 to 1 - we can use `theme()` to customise font sizes, legend placement, etc. - tehre are also popular preset themes, such as `theme_bw()` and `theme_minimal()` ::: panel-tabset ### `theme_bw()` ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."} #| output-location: column #| code-line-numbers: "|4|5|" df_lifetime |> ggplot(aes(ff, fill = condition)) + labs(title = "Histogram of first fixataion times", x = "First fixation times (ms)") + geom_histogram(alpha=.5) + theme_bw() ``` ### `theme_minimal()` ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."} #| output-location: column #| code-line-numbers: "5" df_lifetime |> ggplot(aes(ff, fill = condition)) + labs(title = "Histogram of first fixataion times", x = "First fixation times (ms)") + geom_histogram(alpha=.5) + theme_minimal() ``` ::: # Distributions {.smaller} - show the distribution of observations + so we can see where the data are clustered + and eyeball the shape of the distribution - we already saw the histogram, which shows the number of observations per variable value - **density plots** are another useful plot for visualising distributions ```{r} #| echo: false iris |> ggplot(aes(x=Sepal.Length)) + geom_density() + labs(title = "A density plot", x = "variable") ``` ## Density plots - below I just replaced `geom_histogram()` with `geom_density()` + I also filtered the data to include only values of `ff` above 0 - what is plotted along the y-axis? how does this differ from a histogram? ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."} #| code-line-numbers: "|2|6|" #| output-location: column-fragment df_lifetime |> filter(ff > 0) |> ggplot(aes(ff)) + labs(title = "Histogram of first fixataion times", x = "First fixation times (ms)") + geom_density() + theme_minimal() ``` ## Grouped density plots - just like with histograms, we can look at the density plots of different subsets of the data with `aes(fill = )` + like region ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."} #| code-line-numbers: "|2|6|" #| output-location: column-fragment df_lifetime |> filter(ff > 0) |> ggplot(aes(ff, fill = region)) + labs(title = "Histogram of first fixataion times", x = "First fixation times (ms)") + geom_density(alpha=.5) + theme_minimal() ``` ### `facet_grid()` - there are a lot of overlapping density curves, let's try to separate them with `facet_grid(x~y)` ```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."} #| code-line-numbers: "|4|" #| output-location: column-fragment df_lifetime |> filter(ff > 0) |> ggplot(aes(ff, fill = region)) + facet_grid(.~region) + labs(title = "Density plot of first fixataion times by region", x = "First fixation times (ms)") + geom_density(alpha=.5) + theme_bw() ``` - how would you describe the density plots of the different regions? ### re-ordering factors - by default, factors will be ordered alphabetically + but we don't always want that + here, `verb-1` should be before `verb` ```{r} #| output-location: column-fragment df_lifetime <- df_lifetime %>% mutate(region = factor(region, levels = c("verb-1","verb","verb+1","verb+2","verb+3","verb+4"))) summary(df_lifetime$region) ``` #### Exercise 1. create a density plot with the fill colour set to `condition`, but: - subset the data to only include the verb region - you can decide if you want to use facets or to have the density curves overlayed - your plot should look something like A or B: ```{r, echo = F} #| output-location: fragment # with facet p_dens1 <- df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(ff, fill = condition)) + facet_grid(. ~ condition) + labs(title = "Density plot of first fixataion times (verb region)", x = "First fixation times (ms)", y = "Density", fill = "Condition") + geom_density(alpha = .5) + theme_bw() + theme(title = element_text(size=10), axis.text.x = element_text(angle = 45)) # without facet p_dens2 <- df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(ff, fill = condition)) + # facet_grid(.~condition) + labs(title = "Density plot of first fixataion times (verb region)", x = "First fixation times (ms)", y = "Density", fill = "Condition") + geom_density(alpha = .5) + theme_bw() + theme(plot.title = element_text(size=10)) # create legend dens_legend <- cowplot::get_legend(p_dens1 + theme(legend.position = "bottom")) dens_legend <- ggpubr::as_ggplot(dens_legend) # combine ggpubr::ggarrange( ggpubr::ggarrange(p_dens1 + theme(legend.position = "none"), p_dens2 +theme(legend.position = "none"), labels = c("A","B"), widths = c(.6,.4)), dens_legend, nrow = 2, heights = c(.9,.1)) ``` #### Extra exercise 2. Can you produce these plots? ```{r, echo = F} p_tense_life <- df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(ff, fill = lifetime)) + facet_grid(.~tense) + labs(title = "Density plot of first fixataion times (verb region)", x = "First fixation times (ms)") + geom_density(alpha=.5) + theme_bw() p_tense <- df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(ff, fill = tense)) + # facet_grid(.~tense) + labs(title = "Density plot of first fixataion times (verb region)", x = "First fixation times (ms)") + geom_density(alpha=.5) + theme_bw() + theme(title = element_text(size=8)) p_lifetime <- df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(ff, fill = lifetime)) + # facet_grid(.~tense) + labs(title = "Density plot of first fixataion times (verb region)", x = "First fixation times (ms)") + geom_density(alpha=.5) + theme_bw() + theme(title = element_text(size=8)) ggpubr::ggarrange( ggpubr::ggarrange(p_tense, p_lifetime, labels = c("A","B")), ggpubr::ggarrange(NULL, p_tense_life,NULL, nrow = 1, widths = c(.1,.8,.1), labels = c("", "C", "")), nrow = 2 ) ``` ## Scatterplots - histograms and density plots plot a **single variable** along the x-axis + in most other plots the dependent (measure) variable is plotted along the y-axis by convention - scatterplots plot the relationship between **two variables** ```{r} #| output-location: column-fragment iris |> ggplot(aes(x=Sepal.Length, y=Sepal.Width)) + geom_point() + labs(title = "A scatterplot", x = "variable X", y = "variable Y") ``` ### Scatterplots - the figure below plots total reading times (verb region) to the verb region (x-axis) and reaction times to the critical sentence (y-axis) + what does each point represent? + how would you describe the relationship between the two variables? ```{r} #| code-line-numbers: "4,7,8,9" #| output-location: column-fragment df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(x = tt, y = rt)) + labs(title = "Scatter plot of total reading times (verb region) and reaction times (critical sentence)", x = "Total reading time (ms)", y = "Reaction time (ms)") + geom_point(alpha = .2) + theme_bw() ``` #### Exercise 1. Generate a scatterplot of total reading times and reaction times, with: - colour and shape set to condition + tip: these both belong in `aes()` 2. What information does this plot suggest? ```{r, eval = F, echo = F} #| code-line-numbers: "4,7,8,9" #| output-location: fragment df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(x = tt, y = rt, colour = condition, shape = condition)) + labs(title = "Scatter plot of total reading times (verb region) and reaction times (critical sentence)", x = "Total reading time (ms)", y = "Reaction time (ms)", colour = "Condition", shape = "Condition") + geom_point(alpha = .5, size = 2) + theme_bw() ``` ## Bar plot :::: columns ::: {.column width="100%"} - show the distribution of categorical factor levels + i.e., the frequency of observations per level - be sure to read in `accept` as a factor! ::: ::: {.column width="50%"} ```{r} #| code-fold: true df_lifetime |> distinct(px,trial,.keep_all=T) |> ggplot(aes(x = as.factor(accept) )) + geom_bar() + theme_bw() ``` ::: ::: {.column width="50%"} ```{r} #| code-fold: true df_lifetime |> distinct(px,trial,.keep_all=T) |> ggplot(aes(x = as.factor(accept), fill = condition)) + labs(title = "Binary responses", x = "Naturalness response", fill = "Condition") + geom_bar() + theme_bw() ``` ::: :::: ## Grouped bar plots ```{r} df_lifetime |> distinct(px,trial,.keep_all=T) |> ggplot(aes(x = as.factor(accept), fill = condition)) + labs(title = "Binary responses", x = "Naturalness response", fill = "Condition") + geom_bar(position = "dodge") + theme_bw() ``` ### Exercise 1. Generate a grouped bar plot (i.e., `dodge`) with: - a facet grid for `tense` - plots `lifetime` on the x-axis - and fills the bars based on `accept` - change the labels accordingly - customise as you like ```{r, eval = F} df_lifetime |> distinct(px,trial,.keep_all=T) |> ggplot(aes(x = lifetime, fill = as.factor(accept))) + facet_grid(.~tense) + labs(title = "Binary responses", x = "Lifetime", fill = "Response") + geom_bar(position = "dodge") + theme_bw() ``` ## Grouped bar plots ```{r, eval = T} df_lifetime |> distinct(px,trial,.keep_all=T) |> ggplot(aes(x = lifetime, fill = as.factor(accept))) + facet_grid(.~tense) + labs(title = "Grouped and faceted barplot", x = "Lifetime", fill = "Response") + geom_bar(position = "dodge") + theme_bw() ``` ## Stacked bar plots ```{r, eval = T} #| code-line-numbers: "8" df_lifetime |> distinct(px,trial,.keep_all=T) |> ggplot(aes(x = lifetime, fill = as.factor(accept))) + facet_grid(.~tense) + labs(title = "Stacked and faceted barplot", x = "Lifetime", fill = "Response") + geom_bar(position = "stack") + theme_bw() ``` ### Exercise 1. Choose the barplot you like best for binary data 2. Reproduce that barplot, but with `reg_in` at the `verb1` region ```{r} #| echo: false #| eval: true #| code-line-numbers: "8" df_lifetime |> filter(region=="verb") |> ggplot(aes(x = lifetime, fill = as.factor(reg_in))) + facet_grid(.~tense) + labs(title = "Regressions in (yes/no, verb region)", x = "Lifetime", fill = "Regressions in (yes/no)") + geom_bar(position = "stack") + theme_bw() ``` ### Extra exercise 1. Create another bar plot, but for `reg_out` for all sentence regions 2. Use `facet_grid()` + to have facets by region (columns) and by tense (in 2 rows) ```{r} #| echo: false #| eval: true #| code-line-numbers: "8" df_lifetime |> # filter(region=="verb") |> ggplot(aes(x = lifetime, fill = as.factor(reg_out))) + facet_grid(tense~region) + labs(title = "Regressions in (yes/no, verb region)", x = "Lifetime", fill = "Regressions out (yes/no)") + geom_bar(position = "stack") + theme_bw() ``` # Summary statistics - measures of location: mean, median, mode - measures of spread: (interquartile) range, standard deviation ::: {.content-visible when-format="revealjs"} ::: notes IQR: middle 50% of data ::: ::: ## Boxplots - boxplots provide information about the distribution of a *continuous* variable + but includes information like *median* (dark line) and *quartiles* (box and whiskers) + and *outliers* (dots) - like scatterplots, require x and y variables + but one of them needs to be **categorical** ## ```{r, fig.cap="A scatterplot. Median (50th percentile): thick black lines; interquartile range (IQR; 25th and 75th percentile): box limits; minimum (0th percentile) and maximum (100th percentile) excluding outliers: : whiskers; outliers: points"} #| output-location: column-fragment iris |> ggplot(aes(x = Species, y = Sepal.Length)) + labs(title = "A scatterplot", x = "Categorical variable", y = "Continuous variable") + geom_boxplot() ``` ## Boxplot explained ::: {.content-visible when-format="revealjs"} ```{r echo = F, fig.align = "center"} #| fig-cap: "Image source: @winter_statistics_2019 (all rights reserved)" #| fig-cap-location: bottom # invert colours for dark mode in slides library(magick) y <- magick::image_read(here::here("media/Winter_2019_boxplot.png")) magick::image_negate(y) ``` ::: ::: {.content-visible unless-format="revealjs"} ```{r echo = F, fig.align = "center"} #| fig-cap: "Image source: @winter_statistics_2019 (all rights reserved)" #| fig-cap-location: bottom magick::image_read(here::here("media/Winter_2019_boxplot.png")) ``` ::: ::: {.content-visible when-format="revealjs"} ::: notes whiskers: 1.5*IQR from Q1 (lower whisker) or Q3 (upper whisker) outliers: outside that range ::: ::: ## Boxplots - let's change our scatterplot to a boxplot ```{r} #| code-line-numbers: "4," #| output-location: column-fragment df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(x = tense, y = ff)) + labs(title = "First-fixation duration (verb region)", x = "Tense", y = "First-fixation duration (ms)") + geom_boxplot(alpha = .2) + theme_bw() ``` ### Grouped boxplots ```{r} #| code-line-numbers: "4," #| output-location: column-fragment df_lifetime |> filter(ff > 0, region == "verb") |> ggplot(aes(x = tense, y = ff, colour = lifetime)) + labs(title = "First-fixation duration (verb region)", x = "Tense", y = "First-fixation duration (ms)") + geom_boxplot(alpha = .2) + theme_bw() ``` #### Exercise 1. Create a group boxplot (x = tense, fill = lifetime) for + first-pass reading time (verb region) + regression path duration (verb region) + total reading time (verb region) + reaction times (use the `distinct()` verb to have a single observation per participant and per trial) ## Interaction plots {.smaller} ::: {.content-visible when-format="revealjs"} :::: {.columns} ::: {.column width="50%"} - common for **factorial** designs, i.e., comparing *categorical* predictors - there are 2 ways of producing them: + with your data frame and `stat_summary()` + or with a summary table and ggplot geoms `geom_point()`, `geom_errorbar()`, and `geom_line()` - we'll need our summary table to plot an interaction plot ::: ::: {.column width="50%"} ```{r} #| echo: false knitr::kable(summary_ff, digits = 1) |> kableExtra::kable_styling(font_size = 20) ``` ```{r} #| code-fold: true #| label: fig-interaction-revealjs #| fig-cap: "Identical interaction plots produced two ways: feeding a dataset into `ggplot()` and using `stat_summary()` (A) or feeding a summary table into `ggplot()` and using geoms (B)" library(patchwork) df_lifetime |> filter(region == "verb") |> ggplot(aes(x = lifetime, y = ff, shape = tense, group = tense, color = tense)) + labs(title="Interaction plot (`stat_summary()`)", x = "Lifetime", y = "First fix (ms)", shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") + stat_summary(fun = "mean", geom = "point", size = 3, position = position_dodge(0.2)) + stat_summary(fun = "mean", geom = "line", position = position_dodge(0.2), aes(linetype=tense)) + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2 , position = position_dodge(0.2)) + theme_bw() + summary_ff |> ggplot(aes(x = lifetime, y = mean.ff, shape = tense, group = tense, color = tense)) + labs(title="Interaction plot (geoms)", x = "Lifetime", y = "First fix (ms)", shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") + geom_point(size = 3, position = position_dodge(0.2)) + geom_line(aes(linetype=tense), position = position_dodge(0.2)) + geom_errorbar(aes(ymin = mean.ff - ci, ymax = mean.ff + ci), width = .2, position = position_dodge(0.2)) + theme_bw() + plot_annotation(tag_levels = "A") + plot_layout(guides = "collect") & theme(legend.position = "bottom") ``` ::: :::: ::: ::: {.content-hidden when-format="revealjs"} - common for **factorial** designs, i.e., comparing *categorical* predictors - there are 2 ways of producing them: + with your data frame and `stat_summary()` + or with a summary table and ggplot geoms `geom_point()`, `geom_errorbar()`, and `geom_line()` - we'll need our summary table to plot an interaction plot ::: {.content-hidden when-format="pdf"} ```{r} #| echo: false knitr::kable(summary_ff, digits = 1) |> kableExtra::kable_styling(font_size = 20) ``` ::: ::: {.content-visible when-format="pdf"} ```{r} #| echo: false # smaller font for PDF output knitr::kable(summary_ff, digits = 1) ``` ::: ```{r} #| code-fold: true #| label: fig-interaction #| fig-cap: "Identical interaction plots produced two ways: feeding a dataset into `ggplot()` and using `stat_summary()` (A) or feeding a summary table into `ggplot()` and using geoms (B)" library(patchwork) df_lifetime |> filter(region == "verb") |> ggplot(aes(x = lifetime, y = ff, shape = tense, group = tense, color = tense)) + labs(title="Interaction plot (`stat_summary()`)", x = "Lifetime", y = "First fix (ms)", shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") + stat_summary(fun = "mean", geom = "point", size = 3, position = position_dodge(0.2)) + stat_summary(fun = "mean", geom = "line", position = position_dodge(0.2), aes(linetype=tense)) + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2 , position = position_dodge(0.2)) + theme_bw() + summary_ff |> ggplot(aes(x = lifetime, y = mean.ff, shape = tense, group = tense, color = tense)) + labs(title="Interaction plot (geoms)", x = "Lifetime", y = "First fix (ms)", shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") + geom_point(size = 3, position = position_dodge(0.2)) + geom_line(aes(linetype=tense), position = position_dodge(0.2)) + geom_errorbar(aes(ymin = mean.ff - ci, ymax = mean.ff + ci), width = .2, position = position_dodge(0.2)) + theme_bw() + plot_annotation(tag_levels = "A") + plot_layout(guides = "collect") & theme(legend.position = "bottom") ``` ::: # Question: Binomial data - binomial data are those with 2 categories, for example + present, absent + yes, no - in our dataset, each trial ended with a binary naturalness judgement task + how might we plot such data? # Saving our plots - we can save our plots two ways: 1. as image files (e.g., JPEG, PNG, SVG, etc.): when writing in Word, LaTeX, etc. 2. as a single R object (Rds: R data structure): when writing in Rmarkdown/Quarto ## `ggsave()` - the `ggsave()` function is useful for saving ggplot objects + we first have to save one of our figures as an object + I usually save ggplot objects with the prefix `fig_` (short for figure) - make sure you also have a useful place to store these figures + e.g., a folder called `figures` ```{r} #| code-fold: true fig_lifetime_ff <- summary_ff |> ggplot(aes(x = lifetime, y = mean.ff, shape = tense, group = tense, color = tense)) + labs(title="Mean first-fixation times (verb region) with 95% CIs", x = "Lifetime", y = "First fix (ms)", shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") + geom_point(size = 3, position = position_dodge(0.2)) + geom_line(aes(linetype=tense), position = position_dodge(0.2)) + geom_errorbar(aes(ymin = mean.ff - ci, ymax = mean.ff + ci), width = .2, position = position_dodge(0.2)) + theme_bw() ``` ```{r} ggsave(fig_lifetime_ff, filename = here("figures", "fig_lifetime_ff.png")) ``` - `ggsave()` has lots of arguments to control width, height, resolution, etc. + to see more, run `?ggsave` in the Console - you can also save as JPG/JPEG, SVG, even PDF by just changing the filename extension ## `saveRDS()` - we can also save the figure as R code + which means we can control the width, height, resolution, etc. later on when we load it in + useful if you'll be writing up your results in R markdown or Quarto ```{r} saveRDS(fig_lifetime_ff, file = here("figures", "fig_lifetime_ff.rds")) ``` ## `readRDS()` - you can't click on the file to view the figure because it's R code + you'd need to load the data into R again ```{r} fig_lifetime_ff <- readRDS(here("figures", "fig_lifetime_ff.rds")) ``` ## ::: {.callout-tip} ## Naming files and saving code You'll notice I saved the PNG and RDS files using the same name that the I used for the figure in my script. This is an important point: I want to be able to traceback my figures from the code so I can easily track them. It also helps encourage informative object and file names. Of course, saving the code used to save the files in our scripts is also useful because we can easily adjust the saved files (e.g., change figure width or height) ::: # References {.unlisted .unnumbered visibility="uncounted"} --- nocite: | @nordmann_data_2022, @nordmann_applied_2022, @wickham_r_2023 --- ::: {#refs custom-style="Bibliography"} :::

1 Learning objectives

2 Load packages and data

2.1 Summary of first-fixation times

3 Plotting reading times

4 Plots with ggplot2

5 An example: histogram

5.1 Start layering

5.2 Add labels

5.3 Add

5.4 Add condition

5.5 Customisation

6 Distributions

6.1 Density plots

6.2 Grouped density plots

6.2.1 facet_grid()

6.2.2 re-ordering factors

6.2.2.1 Exercise

6.2.2.2 Extra exercise

6.3 Scatterplots

6.3.1 Scatterplots

6.3.1.1 Exercise

6.4 Bar plot

6.5 Grouped bar plots

6.5.1 Exercise

6.6 Grouped bar plots

6.7 Stacked bar plots

6.7.1 Exercise

6.7.2 Extra exercise

7 Summary statistics

7.1 Boxplots

7.2

7.3 Boxplot explained

7.4 Boxplots

7.4.1 Grouped boxplots

7.4.1.1 Exercise

7.5 Interaction plots

8 Question: Binomial data

9 Saving our plots

9.1 ggsave()

9.2 saveRDS()

9.3 readRDS()

9.4

References

4 Plots with `ggplot2`

6.2.1 `facet_grid()`

9.1 `ggsave()`

9.2 `saveRDS()`

9.3 `readRDS()`