Reproducible analysis reports with eye-tracking reading time data
  • D. Palleschi
  • Download PDF
  • Download ePub
  1. Exploratory Data Analysis
  2. 5  Data Visualisation with `ggplot2`
  • Course Intro
  • Background
    • 1  Reproducible Analayses
    • 2  Eye-tracking during reading
    • 3  Working with eye-tracking reading data in R
  • Exploratory Data Analysis
    • 4  Data wrangling
    • 5  Data Visualisation with ggplot2
  • Modelling

Table of contents

  • 6 Data communication
    • 6.1 Load packages and data
    • 6.2 Tables
      • 6.2.1 Exercise
  • 7 Plotting reading times
  • 8 Plots with ggplot2
  • 9 An example: histogram
    • 9.1 Start layering
    • 9.2 Add labels
    • 9.3 Add
    • 9.4 Add condition
    • 9.5 Customisation
  • 10 Distributions
    • 10.1 Density plots
    • 10.2 Grouped density plots
      • 10.2.1 facet_grid()
      • 10.2.2 re-ordering factors
    • 10.3 Scatterplots
      • 10.3.1 Scatterplots
  • 11 Summary statistics
    • 11.1 Boxplots
    • 11.2 Boxplot explained
    • 11.3 Boxplots
      • 11.3.1 Grouped boxplots
    • 11.4 Violin plots
    • 11.5 Violin boxplots
  • 12 Summary statistics
    • 12.1 Interaction plots
  • 13 Binomial data
    • 13.1 Bar plot
    • 13.2 Grouped bar plots
      • 13.2.1 Exercise
    • 13.3 Grouped bar plots
    • 13.4 Stacked bar plots
      • 13.4.1 Exercise
      • 13.4.2 Extra exercise
  • 14 Resources

5  Data Visualisation with ggplot2

  • Show All Code
  • Hide All Code

  • View Source

Communicating your data

Author
Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

April 13, 2023

# Create references.json file based on the citations in this script
# make sure you have 'bibliography: references.json' in the YAML
rbbt::bbt_update_bib("data_viz.qmd")

6 Data communication

6.1 Load packages and data

# load tidyverse
library(tidyverse)

# load data
df_lifetime <- readr::read_csv(here::here("data/tidy_data_lifetime_pilot.csv"), 
                               # for special characters
                               locale = readr::locale(encoding = "latin1") 
                               ) |>
  mutate_if(is.character,as.factor) |> # all character variables as factor
  filter(type == "critical", # only critical trials
         px != "px3") # this participant had lots of 0's for some reason

6.2 Tables

  • we can create summaries of our data
Code
# compute summary 
summary_ff <- df_lifetime |> 
  filter(region=="verb") |> 
  group_by(condition,lifetime,tense) %>%
  summarise(N = n(),
            mean.ff = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T)) %>%
  # compute standard error, confidence intervals, and lower/upper ci bounds
  mutate(se = sd / sqrt(N),
         ci = qt(1 - (0.05 / 2), N - 1) * se,
         lower.ci = mean.ff - qt(1 - (0.05 / 2), N - 1) * se,
         upper.ci = mean.ff + qt(1 - (0.05 / 2), N - 1) * se)
  • and print the output with the kable() function from the knitr package
    • for extra customisation you can also use the kableExtra package (e.g., with the kable_styling() function)
# install.packages("knitr") # if not yet installed
knitr::kable(summary_ff, digits=1,
             caption = "Table with summmary statistics for first-fixation duration at the verb region")
Table with summmary statistics for first-fixation duration at the verb region
condition lifetime tense N mean.ff sd se ci lower.ci upper.ci
deadPP dead PP 140 198.9 57.9 4.9 9.7 189.2 208.6
deadSF dead SF 139 194.6 67.9 5.8 11.4 183.2 205.9
livingPP living PP 140 194.2 77.3 6.5 12.9 181.3 207.1
livingSF living SF 140 186.0 57.6 4.9 9.6 176.4 195.6

6.2.1 Exercise

  1. install the knitr package (install.packages("knitr"))
  2. create an object with some summary statistics of the variable rt
  • call it summary_rt
  1. use kable() from knitr to print a table
knitr::kable(summary_rt, digits=1,
             caption = "Summary of reaction times (ms) per condition")
Summary of reaction times (ms) per condition
lifetime tense condition N mean.rt sd
dead PP deadPP 140 3530.5 2915.8
dead SF deadSF 139 1747.0 1153.4
living PP livingPP 140 2257.7 1346.3
living SF livingSF 140 2578.1 1958.7

7 Plotting reading times

  • reading times are (usually) continuous variables
    • as are e.g., reaction times
  • they are truncated at 0, meaning they cannot have negative values
    • because of this, they tend to have a skewed distribution

8 Plots with ggplot2

  • ggplot2 is part of the tidyverse (like dplyr)
    • uses a layered grammar of graphics
    • i.e., we build layers

9 An example: histogram

iris |> 
  ggplot(aes(x=Sepal.Length)) +
  geom_histogram() +
  labs(title = "A histogram",
       x = "variable")

9.1 Start layering

df_lifetime |> ggplot(aes(ff)) # aes = 'aesthetic'

9.2 Add labels

df_lifetime |> ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixation times",
       x = "First fixation times (ms)")

9.3 Add

df_lifetime |> ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram()

Distribution of first fixation times at the verb region (raw milliseconds)

9.4 Add condition

df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "First fixataion times at the verb region",
       x = "First fixation times (ms)") +
  geom_histogram()

Distribution of first fixation times at the verb region (raw milliseconds)

The colour here is STACKED!! i.e., not layered. Notice the distribution doesn’t change from all grey to coloured

9.5 Customisation

  • we can add arguments to our geoms
    • e.g., transparency: alpha = takes a value between 0 to 1
  • we can use theme() to customise font sizes, legend placement, etc.
  • tehre are also popular preset themes, such as theme_bw() and theme_minimal()
  • theme_bw()
  • theme_minimal()
df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram(alpha=.5) +
  theme_bw()

Distribution of first fixation times at the verb region (raw milliseconds).
df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram(alpha=.5) +
  theme_minimal()

Distribution of first fixation times at the verb region (raw milliseconds).

10 Distributions

  • show the distribution of observations
    • so we can see where the data are clustered
    • and eyeball the shape of the distribution
  • we already saw the histogram, which shows the number of observations per variable value
  • density plots are another useful plot for visualising distributions
iris |> 
  ggplot(aes(x=Sepal.Length)) +
  geom_density() +
  labs(title = "A density plot",
       x = "variable")

10.1 Density plots

  • below I just replaced geom_histogram() with geom_density()
    • I also filtered the data to include only values of ff above 0
  • what is plotted along the y-axis? how does this differ from a histogram?
df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_density() +
  theme_minimal()

Distribution of first fixation times at the verb region (raw milliseconds).

10.2 Grouped density plots

  • just like with histograms, we can look at the density plots of different subsets of the data with aes(fill = )
    • like region
df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff, fill = region)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_minimal()

Distribution of first fixation times at the verb region (raw milliseconds).

10.2.1 facet_grid()

  • there are a lot of overlapping density curves, let’s try to separate them with facet_grid(x~y)
df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff, fill = region)) + 
  facet_grid(.~region) +
  labs(title = "Density plot of first fixataion times by region",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_bw()

Distribution of first fixation times at the verb region (raw milliseconds).
  • how would you describe the density plots of the different regions?

10.2.2 re-ordering factors

  • by default, factors will be ordered alphabetically
    • but we don’t always want that
    • here, verb-1 should be before verb
df_lifetime <- df_lifetime %>%
  mutate(region = factor(region, 
                         levels = c("verb-1","verb","verb+1","verb+2","verb+3","verb+4")))

summary(df_lifetime$region)
verb-1   verb verb+1 verb+2 verb+3 verb+4 
   559    559    559    559    559    182 

10.2.2.1 Exercise

  1. create a density plot with the fill colour set to condition, but:
  • subset the data to only include the verb region
  • you can decide if you want to use facets or to have the density curves overlayed
  • your plot should look something like A or B:

10.2.2.2 Extra exercise

  1. Can you produce these plots?

10.3 Scatterplots

  • histograms and density plots plot a single variable along the x-axis
    • in most other plots the dependent (measure) variable is plotted along the y-axis by convention
  • scatterplots plot the relationship between two variables
iris |> 
  ggplot(aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_point() +
  labs(title = "A scatterplot",
       x = "variable X",
       y = "variable Y")

10.3.1 Scatterplots

  • the figure below plots total reading times (verb region) to the verb region (x-axis) and reaction times to the critical sentence (y-axis)
    • what does each point represent?
    • how would you describe the relationship between the two variables?
df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tt, y = rt)) +
  labs(title = "Scatter plot of total reading times (verb region)
and reaction times (critical sentence)",
       x = "Total reading time (ms)",
       y = "Reaction time (ms)") +
  geom_point(alpha = .2) +
  theme_bw()

10.3.1.1 Exercise

  1. Generate a scatterplot of total reading times and reaction times, with:
    • colour and shape set to condition
    • tip: these both belong in aes()
  2. What information does this plot suggest?

11 Summary statistics

  • measures of location: mean, median, mode
  • measures of spread: (interquartile) range, standard deviation

11.1 Boxplots

  • boxplots provide information about the distribution of a continuous variable
    • but includes information like median (dark line) and quartiles (box and whiskers)
    • and outliers (dots)
  • like scatterplots, require x and y variables
    • but one of them needs to be categorical
iris |> 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  labs(title = "A scatterplot",
       x = "Categorical variable",
       y = "Continuous variable") +
  geom_boxplot()

A scatterplot. Median (50th percentile): thick black lines; interquartile range (IQR; 25th and 75th percentile): box limits; minimum (0th percentile) and maximum (100th percentile) excluding outliers: : whiskers; outliers: points

11.2 Boxplot explained

Image source: (winter_statistics_2019?) (all rights reserved)

11.3 Boxplots

  • let’s change our scatterplot to a boxplot
df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tense, y = ff)) +
  labs(title = "First-fixation duration (verb region)",
       x = "Tense",
       y = "First-fixation duration (ms)") +
  geom_boxplot(alpha = .2) +
  theme_bw()

11.3.1 Grouped boxplots

df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tense, y = ff, colour = lifetime)) +
  labs(title = "First-fixation duration (verb region)",
       x = "Tense",
       y = "First-fixation duration (ms)") +
  geom_boxplot(alpha = .2) +
  theme_bw()

11.3.1.1 Exercise

  1. Create a group boxplot (x = tense, fill = lifetime) for
  • first-pass reading time (verb region)
  • regression path duration (verb region)
  • total reading time (verb region)
  • reaction times (use the distinct() verb to have a single observation per participant and per trial)

11.4 Violin plots

11.5 Violin boxplots

12 Summary statistics

12.1 Interaction plots

  • common for factorial designs, i.e., comparing categorical predictors
  • there are 2 ways of producing them:
    • with your data frame and stat_summary()
    • or with a summary table and ggplot geoms geom_point(), geom_errorbar(), and geom_line()
  • we’ll need our summary table to plot an interaction plot
condition lifetime tense N mean.ff sd se ci lower.ci upper.ci
deadPP dead PP 140 198.9 57.9 4.9 9.7 189.2 208.6
deadSF dead SF 139 194.6 67.9 5.8 11.4 183.2 205.9
livingPP living PP 140 194.2 77.3 6.5 12.9 181.3 207.1
livingSF living SF 140 186.0 57.6 4.9 9.6 176.4 195.6

13 Binomial data

  • binomial data are those with 2 categories, for example
    • present, absent
    • yes, no
  • in our dataset, each trial ended with a binary naturalness judgement task
    • how might we plot such data?

13.1 Bar plot

  • be sure to read in accept as a factor!
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept) )) +
  geom_bar() +
  theme_bw()

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept), fill = condition)) +
  labs(title = "Binary responses",
       x = "Naturalness response",
       fill = "Condition") +
  geom_bar() +
  theme_bw()

13.2 Grouped bar plots

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept), fill = condition)) +
  labs(title = "Binary responses",
       x = "Naturalness response",
       fill = "Condition") +
  geom_bar(position = "dodge") +
  theme_bw()

13.2.1 Exercise

  1. Generate a grouped bar plot (i.e., dodge) with:
    • a facet grid for tense
    • plots lifetime on the x-axis
    • and fills the bars based on accept
    • change the labels accordingly
    • customise as you like
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Binary responses",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "dodge") +
  theme_bw()

13.3 Grouped bar plots

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Grouped and faceted barplot",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "dodge") +
  theme_bw()

13.4 Stacked bar plots

df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Stacked and faceted barplot",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "stack") +
  theme_bw()

13.4.1 Exercise

  1. Choose the barplot you like best for binary data
  2. Reproduce that barplot, but with reg_in at the verb1 region

13.4.2 Extra exercise

  1. Create another bar plot, but for reg_out for all sentence regions
  2. Use facet_grid()
  • to have facets by region (columns) and by tense (in 2 rows)

14 Resources

Nordmann et al. (2022)

Nordmann & DeBruine (2022)

Wickham et al. (n.d.), Chapter 2

References

Nordmann, E., & DeBruine, L. (2022). Applied data skills (Version 2.0). Zenodo. https://doi.org/10.5281/zenodo.6365078
Nordmann, E., McAleer, P., Toivo, W., Paterson, H., & DeBruine, L. M. (2022). Data Visualization Using R for Researchers Who Do Not Use R. Advances in Methods and Practices in Psychological Science, 5(2), 251524592210746. https://doi.org/10.1177/25152459221074654
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (n.d.). R for Data Science (2nd ed.). https://r4ds.hadley.nz/
4  Data wrangling
Modelling
Source Code
---
title: "Data Visualisation with `ggplot2`"
subtitle: "Communicating your data"
author: "Daniela Palleschi"
institute: Humboldt-Universität zu Berlin
footer: "Data Visualisation"
lang: en
date: 2023-04-13
bibliography: references/references.json
csl: references/apa.csl
---

```{r}
knitr::opts_chunk$set(eval = T, # evaluate chunks
                      echo = T, # 'print code chunk?'
                      message = F, # 'print messages (e.g., warnings)?'
                      error = F, # stop when error encountered
                      warning = F) # don't print warnings
```

```{r, eval = T, cache = F}
# Create references.json file based on the citations in this script
# make sure you have 'bibliography: references.json' in the YAML
rbbt::bbt_update_bib("data_viz.qmd")
```

# Data communication

## Load packages and data

```{r}
# load tidyverse
library(tidyverse)

# load data
df_lifetime <- readr::read_csv(here::here("data/tidy_data_lifetime_pilot.csv"), 
                               # for special characters
                               locale = readr::locale(encoding = "latin1") 
                               ) |>
  mutate_if(is.character,as.factor) |> # all character variables as factor
  filter(type == "critical", # only critical trials
         px != "px3") # this participant had lots of 0's for some reason
```

## Tables

- we can create summaries of our data

```{r, eval = F, echo = F}
# with Rmisc::summarySEwithin
df_lifetime |> 
  filter(region == "verb",
         !is.na(ff)) |> 
  Rmisc::summarySEwithin(measurevar="ff", withinvars ="condition")
```

```{r}
#| code-fold: true
# compute summary 
summary_ff <- df_lifetime |> 
  filter(region=="verb") |> 
  group_by(condition,lifetime,tense) %>%
  summarise(N = n(),
            mean.ff = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T)) %>%
  # compute standard error, confidence intervals, and lower/upper ci bounds
  mutate(se = sd / sqrt(N),
         ci = qt(1 - (0.05 / 2), N - 1) * se,
         lower.ci = mean.ff - qt(1 - (0.05 / 2), N - 1) * se,
         upper.ci = mean.ff + qt(1 - (0.05 / 2), N - 1) * se)
```

- and print the output with the `kable()` function from the `knitr` package
  + for extra customisation you can also use the `kableExtra` package (e.g., with the `kable_styling()` function)

```{r}
#| output-location: fragment

# install.packages("knitr") # if not yet installed
knitr::kable(summary_ff, digits=1,
             caption = "Table with summmary statistics for first-fixation duration at the verb region")
```

### Exercise

1. install the `knitr` package (`install.packages("knitr")`)
2. create an object with some summary statistics of the variable `rt`
  + call it `summary_rt`
3. use `kable()` from `knitr` to print a table

```{r}
#| echo: false
summary_rt <- df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  group_by(lifetime,tense,condition) %>%
  summarise(N = n(),
            mean.rt = mean(rt, na.rm = T),
            sd = sd(rt, na.rm = T))
```

```{r}
#| output-location: fragment
knitr::kable(summary_rt, digits=1,
             caption = "Summary of reaction times (ms) per condition")
```

# Plotting reading times

- reading times are (usually) continuous variables
  + as are e.g., reaction times
- they are truncated at 0, meaning they cannot have negative values
  + because of this, they tend to have a *skewed distribution*

# Plots with `ggplot2`

- `ggplot2` is part of the tidyverse (like `dplyr`)
  + uses a *layered grammar of graphics*
  + i.e., we build layers

# An example: histogram

```{r}
#| output-location: column-fragment
iris |> 
  ggplot(aes(x=Sepal.Length)) +
  geom_histogram() +
  labs(title = "A histogram",
       x = "variable")
```

## Start layering

```{r}
#| output-location: column
df_lifetime |> ggplot(aes(ff)) # aes = 'aesthetic'
```

## Add labels

```{r}
#| output-location: column
df_lifetime |> ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixation times",
       x = "First fixation times (ms)")
```

## Add 

```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)"}
#| output-location: column
df_lifetime |> ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram()
```

## Add condition

```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)"}
#| output-location: column
df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "First fixataion times at the verb region",
       x = "First fixation times (ms)") +
  geom_histogram()
```

::: {.notes}
The colour here is STACKED!! i.e., not layered. Notice the distribution doesn't change from all grey to coloured
:::

## Customisation

- we can add arguments to our geoms
  + e.g., transparency: `alpha = ` takes a value between 0 to 1
- we can use `theme()` to customise font sizes, legend placement, etc.
- tehre are also popular preset themes, such as `theme_bw()` and `theme_minimal()`

::: panel-tabset
### `theme_bw()`
```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."}
#| output-location: column
#| code-line-numbers: "|4|5|"
df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram(alpha=.5) +
  theme_bw()
```

### `theme_minimal()`
```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."}
#| output-location: column
#| code-line-numbers: "5"
df_lifetime |> ggplot(aes(ff, fill = condition)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_histogram(alpha=.5) +
  theme_minimal()
```
:::

# Distributions {.smaller}

- show the distribution of observations
  + so we can see where the data are clustered
  + and eyeball the shape of the distribution
- we already saw the histogram, which shows the number of observations per variable value
- **density plots** are another useful plot for visualising distributions

```{r}
#| output-location: column-fragment
iris |> 
  ggplot(aes(x=Sepal.Length)) +
  geom_density() +
  labs(title = "A density plot",
       x = "variable")
```

## Density plots

- below I just replaced `geom_histogram()` with `geom_density()`
  + I also filtered the data to include only values of `ff` above 0
- what is plotted along the y-axis? how does this differ from a histogram?

```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."}
#| code-line-numbers: "|2|6|"
#| output-location: column-fragment
df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_density() +
  theme_minimal()
```

## Grouped density plots

- just like with histograms, we can look at the density plots of different subsets of the data with `aes(fill = )`
  + like region

```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."}
#| code-line-numbers: "|2|6|"
#| output-location: column-fragment
df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff, fill = region)) + 
  labs(title = "Histogram of first fixataion times",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_minimal()
```

### `facet_grid()`

- there are a lot of overlapping density curves, let's try to separate them with `facet_grid(x~y)`

```{r, fig.cap="Distribution of first fixation times at the verb region (raw milliseconds)."}
#| code-line-numbers: "|4|"
#| output-location: column-fragment
df_lifetime |> 
  filter(ff > 0) |> 
  ggplot(aes(ff, fill = region)) + 
  facet_grid(.~region) +
  labs(title = "Density plot of first fixataion times by region",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_bw()
```

- how would you describe the density plots of the different regions?

### re-ordering factors

- by default, factors will be ordered alphabetically
  + but we don't always want that
  + here, `verb-1` should be before `verb`
  
```{r}
#| output-location: column-fragment
df_lifetime <- df_lifetime %>%
  mutate(region = factor(region, 
                         levels = c("verb-1","verb","verb+1","verb+2","verb+3","verb+4")))

summary(df_lifetime$region)
```

#### Exercise


1. create a density plot with the fill colour set to `condition`, but:
  
- subset the data to only include the verb region
- you can decide if you want to use facets or to have the density curves overlayed
- your plot should look something like A or B:

```{r, echo = F}
#| output-location: fragment

# with facet
p_dens1 <- df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(ff, fill = condition)) +
  facet_grid(. ~ condition) +
  labs(title = "Density plot of first fixataion times (verb region)",
       x = "First fixation times (ms)",
       y = "Density",
       fill = "Condition") +
  geom_density(alpha = .5) +
  theme_bw() +
  theme(title = element_text(size=10),
        axis.text.x = element_text(angle = 45))

# without facet
p_dens2 <- df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(ff, fill = condition)) +
  # facet_grid(.~condition) +
  labs(title = "Density plot of first fixataion times
(verb region)",
       x = "First fixation times (ms)",
       y = "Density",
       fill = "Condition") +
  geom_density(alpha = .5) +
  theme_bw() +
  theme(plot.title = element_text(size=10))

# create legend

dens_legend <- cowplot::get_legend(p_dens1 + theme(legend.position = "bottom"))
dens_legend <- ggpubr::as_ggplot(dens_legend)

# combine
ggpubr::ggarrange(
ggpubr::ggarrange(p_dens1 + theme(legend.position = "none"),
                  p_dens2 +theme(legend.position = "none"),
                  labels = c("A","B"),
                  widths = c(.6,.4)),
dens_legend, nrow = 2, heights = c(.9,.1))

```


#### Extra exercise

2. Can you produce these plots?

```{r, echo = F}
p_tense_life <- df_lifetime |> 
  filter(ff > 0,
         region == "verb") |> 
  ggplot(aes(ff, fill = lifetime)) + 
  facet_grid(.~tense) +
  labs(title = "Density plot of first fixataion times (verb region)",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_bw() 

p_tense <- df_lifetime |> 
  filter(ff > 0,
         region == "verb") |> 
  ggplot(aes(ff, fill = tense)) + 
  # facet_grid(.~tense) +
  labs(title = "Density plot of first fixataion times (verb region)",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_bw() +
  theme(title = element_text(size=8))

p_lifetime <- df_lifetime |> 
  filter(ff > 0,
         region == "verb") |> 
  ggplot(aes(ff, fill = lifetime)) + 
  # facet_grid(.~tense) +
  labs(title = "Density plot of first fixataion times (verb region)",
       x = "First fixation times (ms)") +
  geom_density(alpha=.5) +
  theme_bw() +
  theme(title = element_text(size=8))

ggpubr::ggarrange(
  ggpubr::ggarrange(p_tense, p_lifetime,
                    labels = c("A","B")),
  ggpubr::ggarrange(NULL, p_tense_life,NULL,
                    nrow = 1,
                    widths = c(.1,.8,.1),
                    labels = c("", "C", "")),
  nrow = 2
)
```


## Scatterplots

- histograms and density plots plot a **single variable** along the x-axis
  + in most other plots the dependent (measure) variable is plotted along the y-axis by convention
- scatterplots plot the relationship between **two variables**

```{r}
#| output-location: column-fragment
iris |> 
  ggplot(aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_point() +
  labs(title = "A scatterplot",
       x = "variable X",
       y = "variable Y")
```

### Scatterplots

- the figure below plots total reading times (verb region) to the verb region (x-axis) and reaction times to the critical sentence (y-axis)
  + what does each point represent?
  + how would you describe the relationship between the two variables?

```{r}
#| code-line-numbers: "4,7,8,9"
#| output-location: column-fragment
df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tt, y = rt)) +
  labs(title = "Scatter plot of total reading times (verb region)
and reaction times (critical sentence)",
       x = "Total reading time (ms)",
       y = "Reaction time (ms)") +
  geom_point(alpha = .2) +
  theme_bw()
```

#### Exercise

1. Generate a scatterplot of total reading times and reaction times, with:
    - colour and shape set to condition
    + tip: these both belong in `aes()`
2. What information does this plot suggest?

```{r, eval = F, echo = F}
#| code-line-numbers: "4,7,8,9"
#| output-location: fragment
df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tt, y = rt, colour = condition, shape = condition)) +
  labs(title = "Scatter plot of total reading times (verb region)
and reaction times (critical sentence)",
       x = "Total reading time (ms)",
       y = "Reaction time (ms)",
colour = "Condition", shape = "Condition") +
  geom_point(alpha = .5, size = 2) +
  theme_bw()
```

# Summary statistics

- measures of location: mean, median, mode
- measures of spread: (interquartile) range, standard deviation

::: {.content-visible when-format="revealjs"}
::: notes
IQR: middle 50% of data
:::
:::

## Boxplots

- boxplots provide information about the distribution of a *continuous* variable
  + but includes information like *median* (dark line) and *quartiles* (box and whiskers)
  + and *outliers* (dots)
- like scatterplots, require x and y variables
  + but one of them needs to be **categorical**

```{r, fig.cap="A scatterplot. Median (50th percentile): thick black lines; interquartile range (IQR; 25th and 75th percentile): box limits; minimum (0th percentile) and maximum (100th percentile) excluding outliers: : whiskers; outliers: points"}
#| output-location: column-fragment
iris |> 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  labs(title = "A scatterplot",
       x = "Categorical variable",
       y = "Continuous variable") +
  geom_boxplot()
```

## Boxplot explained

::: {.content-visible when-format="revealjs"}
```{r echo = F, fig.align = "center"}
#| fig-cap: "Image source: @winter_statistics_2019 (all rights reserved)"
#| fig-cap-location: bottom

# invert colours for dark mode in slides
library(magick)
y <- magick::image_read(here::here("media/Winter_2019_boxplot.png"))

magick::image_negate(y)
```
:::

::: {.content-visible unless-format="revealjs"}
```{r echo = F, fig.align = "center"}
#| fig-cap: "Image source: @winter_statistics_2019 (all rights reserved)"
#| fig-cap-location: bottom
magick::image_read(here::here("media/Winter_2019_boxplot.png"))
```
:::

::: {.content-visible when-format="revealjs"}
::: notes
whiskers: 1.5*IQR from Q1 (lower whisker) or Q3 (upper whisker)
outliers: outside that range
:::
:::

## Boxplots

- let's change our scatterplot to a boxplot

```{r}
#| code-line-numbers: "4,"
#| output-location: column-fragment
df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tense, y = ff)) +
  labs(title = "First-fixation duration (verb region)",
       x = "Tense",
       y = "First-fixation duration (ms)") +
  geom_boxplot(alpha = .2) +
  theme_bw()
```

### Grouped boxplots

```{r}
#| code-line-numbers: "4,"
#| output-location: column-fragment
df_lifetime |>
  filter(ff > 0,
         region == "verb") |>
  ggplot(aes(x = tense, y = ff, colour = lifetime)) +
  labs(title = "First-fixation duration (verb region)",
       x = "Tense",
       y = "First-fixation duration (ms)") +
  geom_boxplot(alpha = .2) +
  theme_bw()
```

#### Exercise

1. Create a group boxplot (x = tense, fill = lifetime) for
  + first-pass reading time (verb region)
  + regression path duration (verb region)
  + total reading time (verb region)
  + reaction times (use the `distinct()` verb to have a single observation per participant and per trial)

## Violin plots

## Violin boxplots


# Summary statistics

## Interaction plots {.smaller}

:::: {.columns}
::: {.column width="50%"}
- common for **factorial** designs, i.e., comparing *categorical* predictors
- there are 2 ways of producing them:
  + with your data frame and `stat_summary()`
  + or with a summary table and ggplot geoms `geom_point()`, `geom_errorbar()`, and `geom_line()`
- we'll need our summary table to plot an interaction plot
:::

::: {.content-hidden when-format="pdf"}
::: {.column width="50%"}
```{r}
#| echo: false
knitr::kable(summary_ff, digits = 1) |> 
  kableExtra::kable_styling(font_size = 20)
```
:::
:::

::: {.content-visible when-format="pdf"}
::: {.column width="50%"}
```{r}
#| echo: false

# smaller font for PDF output
knitr::kable(summary_ff, digits = 1)
```
:::
:::

::::

::: {.column width="50%"}
```{r}
#| echo: false
df_lifetime |> 
  filter(region == "verb") |> 
  ggplot(aes(x = lifetime, y = ff, 
                     shape = tense,
                     group = tense,
                     color = tense)) +
  labs(title="Interaction plot with `stat_summary()`",
       x = "Lifetime",
       y = "First fix (ms)",
       shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") +
  stat_summary(fun = "mean", geom = "point", size = 3, position = position_dodge(0.2)) +
  stat_summary(fun = "mean", geom = "line", position = position_dodge(0.2), aes(linetype=tense)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2
               , position = position_dodge(0.2)) +
  theme_bw()
```
:::

::: {.column width="50%"}
```{r}
#| echo: false
# plot the means and standard errors
summary_ff |> 
  ggplot(aes(x = lifetime, y = mean.ff, 
                     shape = tense,
                     group = tense,
                     color = tense)) +
  labs(title="Interaction plot with summary table and geoms",
       x = "Lifetime",
       y = "First fix (ms)",
       shape = "Tense", group = "Tense", color = "Tense", linetype = "Tense") +
  geom_point(size = 3,
                position = position_dodge(0.2)) +
  geom_line(aes(linetype=tense), position = position_dodge(0.2)) +
  geom_errorbar(aes(ymin = mean.ff - ci,
                    ymax = mean.ff + ci),
                width = .2,
                position = position_dodge(0.2)) +
  theme_bw()
```
:::

# Binomial data

- binomial data are those with 2 categories, for example
  + present, absent
  + yes, no
- in our dataset, each trial ended with a binary naturalness judgement task
  + how might we plot such data?
  
## Bar plot

::: {.column width="100%"}
- be sure to read in `accept` as a factor!
:::

::: {.column width="50%"}

```{r}
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept) )) +
  geom_bar() +
  theme_bw()
```

:::

::: {.column width="50%"}

```{r}
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept), fill = condition)) +
  labs(title = "Binary responses",
       x = "Naturalness response",
       fill = "Condition") +
  geom_bar() +
  theme_bw()
```

:::

## Grouped bar plots

```{r}
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = as.factor(accept), fill = condition)) +
  labs(title = "Binary responses",
       x = "Naturalness response",
       fill = "Condition") +
  geom_bar(position = "dodge") +
  theme_bw()
```

### Exercise

1. Generate a grouped bar plot (i.e., `dodge`) with:
    - a facet grid for `tense`
    - plots `lifetime` on the x-axis
    - and fills the bars based on `accept`
    - change the labels accordingly
    - customise as you like
  
```{r, eval = F}
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Binary responses",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "dodge") +
  theme_bw()
```

## Grouped bar plots

```{r, eval = T}
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Grouped and faceted barplot",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "dodge") +
  theme_bw()
```

## Stacked bar plots

```{r, eval = T}
#| code-line-numbers: "8"
df_lifetime |> 
  distinct(px,trial,.keep_all=T) |> 
  ggplot(aes(x = lifetime, fill = as.factor(accept))) +
  facet_grid(.~tense) +
  labs(title = "Stacked and faceted barplot",
       x = "Lifetime",
       fill = "Response") +
  geom_bar(position = "stack") +
  theme_bw()
```

### Exercise

1. Choose the barplot you like best for binary data
2. Reproduce that barplot, but with `reg_in` at the `verb1` region

```{r}
#| echo: false
#| eval: true
#| code-line-numbers: "8"
df_lifetime |> 
  filter(region=="verb") |> 
  ggplot(aes(x = lifetime, fill = as.factor(reg_in))) +
  facet_grid(.~tense) +
  labs(title = "Regressions in (yes/no, verb region)",
       x = "Lifetime",
       fill = "Regressions in (yes/no)") +
  geom_bar(position = "stack") +
  theme_bw()
```

### Extra exercise

1. Create another bar plot, but for `reg_out` for all sentence regions
2. Use `facet_grid()`
  + to have facets by region (columns) and by tense (in 2 rows)

```{r}
#| echo: false
#| eval: true
#| code-line-numbers: "8"
df_lifetime |> 
  # filter(region=="verb") |> 
  ggplot(aes(x = lifetime, fill = as.factor(reg_out))) +
  facet_grid(tense~region) +
  labs(title = "Regressions in (yes/no, verb region)",
       x = "Lifetime",
       fill = "Regressions out (yes/no)") +
  geom_bar(position = "stack") +
  theme_bw()
```

# Resources

@nordmann_data_2022

@nordmann_applied_2022

@wickham_r_nodate, [Chapter 2](https://r4ds.hadley.nz/data-visualize.html)

# References {.unlisted .unnumbered visibility="uncounted"}

::: {#refs custom-style="Bibliography"}
:::