5 Categorical predictors

Regression for Linguists

Author

Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

October 11, 2023

Modified

January 5, 2024

Learning Objectives

Today we will learn…

about cateogorical predictors
how to interpret different contrast coding

Set-up environment

# suppress scientific notation
options(scipen=999)

We’ll also need to load in our required packages.

# load libraries
pacman::p_load(
               tidyverse,
               here,
               broom,
               lme4,
               janitor,
               languageR)

Load data

Let’s continue working with the english dataset from the languageR package. Let’s just call it df_freq_eng.

df_freq_eng <-
  as.data.frame(english) |> 
  dplyr::select(RTlexdec, RTnaming, Word, LengthInLetters, AgeSubject, WrittenFrequency) |> 
  rename(rt_lexdec = RTlexdec,
         rt_naming = RTnaming,
         freq_written = WrittenFrequency) |> 
  clean_names() |> 
  # standardize continuous predictors
  mutate(
    freq_z = scale(freq_written),
    length_z = scale(length_in_letters)
  ) |> 
  relocate(word) |> 
  arrange(word)

In your exploratory data analysis, you might’ve noticed a bimodal distribution.

This looks like a bimodal distribution, i.e., there are two modes (most frequent value, i.e., peak in a histogram). What might be driving this? We know that there were two subject groups: old and young. How does the distribution of these two groups look?

Running our model of the log reaction times as predicted by frequency and length, we see:

fit_freq_length <-
  lm(rt_lexdec ~ freq_z*length_z,
     data = df_freq_eng)

glance(fit_freq_length)$r.squared

[1] 0.1896649

glance(fit_freq_length)$adj.r.squared

[1] 0.1891323

Seems like we don’t have any overfitting in our model (\(R^2\) and adjusted \(R^2\) are comparable). Let’s look at our coeffiecients.

tidy(fit_freq_length) |> select(term, estimate)

# A tibble: 4 × 2
  term            estimate
  <chr>              <dbl>
1 (Intercept)      6.55   
2 freq_z          -0.0682 
3 length_z         0.00328
4 freq_z:length_z -0.00196

There is a negative slope for frequency, indicating shorter reaction times for words with higher frequency (when holding length constant). There is a positive slope for length, indicating longer reaction times for longer words (holding frequency constant). There is also a negative interaction estimate, indicating that when both length and frequency increase, reaction times decrease. This seems similar to the dataset we explored in the previous sections. But, this bimodal distribution is suggesting we should include age group as a predictor, since the two groups seem to pattern differently in their reading times. Could it be that the effect of frequency and length also differ as a function of age group?

5.1 Categorical predictors

In linguistic research we often want to compare the effect of groups or categories, such as native or non-native speakers, or grammatical or ungrammatical stimuli. We might expect longer reading times for non-native (compared to native) speakers of a language, or for ungrammatical (versus grammatical) sentences. With our current dataset, we’d predict longer reading times for older participants than younger participants (although we should hypothesise before collecting and visualising our data!). How might these age effects interact with effects of word frequency and length?

5.1.1 Including a categorical predictor

What would happen if we just include age_subject in our model?

fit_age <-
  lm(rt_lexdec ~ freq_z*length_z + age_subject,
     data = df_freq_eng)

First, we see that adding age to our model results in a large increase in variance explained, and that the \(R^2\) and adjusted \(R^2\) values are comparable. In addition, the VIF values for all coefficients are near 1. This indicates that our predictors all contribute to the variance explained by the model and are not correlated.

glance(fit_age)$r.squared

[1] 0.6888949

glance(fit_age)$adj.r.squared

[1] 0.6886222

car::vif(fit_age)

         freq_z        length_z     age_subject freq_z:length_z 
       1.012553        1.004461        1.000000        1.008108

Now that we see that our model is not overfit and that our predictors are not correlatd, let’s take a look at our model estimates.

tidy(fit_age) |> select(term,estimate)

# A tibble: 5 × 2
  term             estimate
  <chr>               <dbl>
1 (Intercept)       6.66   
2 freq_z           -0.0682 
3 length_z          0.00328
4 age_subjectyoung -0.222  
5 freq_z:length_z  -0.00196

In addition to the effects we observed in our earlier model, we see that there is a negative slope for age_subjectyoung, indicating that reaction times decrease when…what? How do we interpret a slope for a categorical variable? Regression works with numerical values, so how does a categorical variable get fit to a line? If we feed a categorical variable into the lm() function, the factor levels (i.e., the categories in a categorical variable) are given numerical values. We need to know what these values are in order to know how to interpret our model estimates. We call these numerical values mapped onto factor levels contrast coding, and we can check the contrasts of a given factor using the function contrasts().

contrasts(df_freq_eng$age_subject)

      young
old       0
young     1

We see that old was coded at \(0\) and young as \(1\). This means that our slope for age_subjectyoung represents the change in reaction times when we move from old to young, which corresponds to a 1-unit change in our predictor (because the difference between 0 and 1 is 1). This is called treatment coding, or dummy coding, where one factor level is coded as 0 and the other as 1. Let’s remove the continuous variable for now and focus on age_subject. Let’s also look at raw reaction times, to more easily interpret the results.

fit_age <-
  lm(exp(rt_lexdec) ~ age_subject,
     data = df_freq_eng)

glance(fit_age)$r.squared

[1] 0.4682224

Our \(R^2\) value is lower than when we included frequency and length, but higher still than our model with frequeny and length but no age.

tidy(fit_age) |> select(term, estimate)

# A tibble: 2 × 2
  term             estimate
  <chr>               <dbl>
1 (Intercept)          787.
2 age_subjectyoung    -157.

We see that there is an estimated decrease in reaction times of 157ms for the young group compared to the old group. But what does the intercept represent here? Let’s look at our data again.

df_freq_eng |> 
  select(rt_lexdec, age_subject) |> 
  mutate(rt_lexdec = exp(rt_lexdec)) |> 
  summary()

   rt_lexdec      age_subject 
 Min.   : 495.4   old  :2284  
 1st Qu.: 617.4   young:2284  
 Median : 699.6               
 Mean   : 708.1               
 3rd Qu.: 775.3               
 Max.   :1323.2

And how does rt_lexdec differ between the groups?

df_freq_eng |> 
  select(rt_lexdec, age_subject) |> 
  mutate(rt_lexdec = exp(rt_lexdec)) |> 
  summarise(mean = mean(rt_lexdec),
            min = min(rt_lexdec),
            max = max(rt_lexdec),
    .by = "age_subject"
  )

  age_subject     mean    min    max
1       young 629.5473 495.38  971.8
2         old 786.7200 603.77 1323.2

We see here that the intercept for our model actually corresponds to the mean reaction time for the old group. Why is this? Recall that the intercept corresponds to the \(y\) value (reaction time) when \(x\) is \(0\). In treatment/dummy coding, one factor level is coded as \(0\). In our case this was old, and so the intercept corresponds to the mean reaction time for participants in the old group. How does R choose which variable to code as \(0\)? It simply takes the first level name alphabetically: old comes before young, so old was automatically taken as the ‘baseline’ to which young was compared.

And if we were to add the slope to the intercept, we would get the mean for the \(young\) group. Why is this?

coef(fit_age)['(Intercept)'] + coef(fit_age)['age_subjectyoung']

(Intercept) 
   629.5473

Why are the means for the two groups used? The mean is the value closest to all values in a univariate dataset, and regression aims to inimise residuals (recall the line of best fit). So, a line is fit between the means of these two factor levels to achieve minimal residuals. This actually is the same thing as a t-test:

t.test(exp(rt_lexdec) ~ age_subject, data = df_freq_eng)


    Welch Two Sample t-test

data:  exp(rt_lexdec) by age_subject
t = 63.406, df = 4144.6, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group old and group young is not equal to 0
95 percent confidence interval:
 152.3128 162.0325
sample estimates:
  mean in group old mean in group young 
           786.7200            629.5473

If we compare this to our model, we see that the t- and p-values are identical (more on these later).

tidy(fit_age)

# A tibble: 2 × 5
  term             estimate std.error statistic p.value
  <chr>               <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)          787.      1.75     449.        0
2 age_subjectyoung    -157.      2.48     -63.4       0

fig_nocontrasts <-
df_freq_eng |> 
  ggplot() +
  aes(x = age_subject, y = exp(rt_lexdec)) +
  labs(title = "No contrasts") +
  # geom_vline(xintercept = 0, linetype="dashed", size = .5) +  
  geom_point(position = position_dodge(.6)) + 
  geom_smooth(method = 'lm', aes(group=1)) + theme_minimal() +
  theme_bw()

fig_treatment <-
df_freq_eng |> 
  mutate(age_subject = if_else(age_subject=="young",1,0)) |>
  ggplot() +
  aes(x = age_subject, y = exp(rt_lexdec)) +
  labs(title = "Treatment contrasts") +
  geom_vline(xintercept = 0, linetype="dashed", size = .5) +
  geom_point(position = position_dodge(.6)) + 
  geom_smooth(method = 'lm', aes(group=1)) + theme_minimal() +
  theme_bw()

fig_nocontrasts + fig_treatment

5.2 Sum contrasts

Treatment/dummy coding is the default contrast coding scheme. Sum coding is another frequently used coding scheme, which is essentially centring categorical variables. Just as with continuous variables, the motivation for sum contrast coding mainly lies in the interpretation of interaction effects. How can we tell R we want to use sum contrast coding, and not dummy coding? There are different ways to do this:

# first, make sure your variable is a factor
df_freq_eng$age_subject <- as.factor(df_freq_eng$age_subject)
# check
class(df_freq_eng$age_subject)

[1] "factor"

# next, you could use the contr.sum() function
contrasts(df_freq_eng$age_subject) <- contr.sum(2) # where 2 means we have 2 levels
contrasts(df_freq_eng$age_subject)

      [,1]
old      1
young   -1

Here we see that old is coded as \(-1\) and young as \(+1\). I prefer to use +/-0.5 for reasons we don’t need to go into here. I would also prefer to have young coded in the negative value, and old in the positive value. This aids in the way I interpret the slope: a change in reaction times for the older group compared to the younger group.

#or, you could manually control the sum contrasts
## check the order of the levels
levels(df_freq_eng$age_subject)

[1] "old"   "young"

## code 'old' as +.5 and 'young' as -.5
contrasts(df_freq_eng$age_subject) <- c(+0.5, -0.5)
contrasts(df_freq_eng$age_subject)

      [,1]
old    0.5
young -0.5

You could also choose to store the contrast values in their own variable.

df_freq_eng <- 
  df_freq_eng |> 
  mutate(age_numeric = ifelse(age_subject == "young", -0.5, +0.5))

df_freq_eng |> 
  select(age_subject, age_numeric) |> 
  head()

     age_subject age_numeric
338        young        -0.5
1790         old         0.5
3125       young        -0.5
3957         old         0.5
3313       young        -0.5
4145         old         0.5

Now, we can run our model using either age_subject or age_numeric.

fit_age_sum <-
  lm(exp(rt_lexdec) ~ age_subject,
     data = df_freq_eng)

glance(fit_age_sum)$r.squared

[1] 0.4682224

glance(fit_age)$r.squared

[1] 0.4682224

No difference in variance account for by our model.

tidy(fit_age_sum) |> select(term,estimate)

# A tibble: 2 × 2
  term         estimate
  <chr>           <dbl>
1 (Intercept)      708.
2 age_subject1     157.

But there is a difference in the intercept, and a change in sign in our slope. Why is this?

fig_sum1 <-
df_freq_eng |> 
  mutate(age_subject = if_else(age_subject=="young",-1,1)) |>
  ggplot() +
  aes(x = age_subject, y = exp(rt_lexdec)) +
  labs(title = "Sum contrasts") +
  geom_vline(xintercept = 0, linetype="dashed", size = .5) +
  geom_point(position = position_dodge(.6)) + 
  geom_smooth(method = 'lm', aes(group=1)) + theme_minimal() +
  theme_bw()

fig_sum5 <-
df_freq_eng |> 
  mutate(age_subject = if_else(age_subject=="young",-.5,.5)) |>
  ggplot() +
  aes(x = age_subject, y = exp(rt_lexdec)) +
  labs(title = "Sum contrasts") +
  geom_vline(xintercept = 0, linetype="dashed", size = .5) +
  geom_point(position = position_dodge(.6)) + 
  geom_smooth(method = 'lm', aes(group=1)) + theme_minimal() +
  theme_bw()

fig_treatment + fig_sum5 + plot_annotation(tag_levels = "A")

Figure 5.1: The difference in slope corresponds to which level is coded as 0 (dummy coding) or -5/-1 (sum coding)

As we see in Figure 5.1, the sign of the slope depends on how we’ve contrast coded our factor levels. In Figure 5.1 A, the old group is coded as \(0\) and young as \(1\). In Figure 5.1 B, the young group is coded as \(-.5\) and the old group as \(+.5\).

The intercept value is also now the overall mean of all observed reaction times, because now the \(y\) value when \(x\) equals zero lies in the middle of the two groups. The slope magnitude (i.e., size of the value) hasn’t changed, because the difference betwen the two group means has not changed.

mean(exp(df_freq_eng$rt_lexdec))

[1] 708.1336

5.2.1 Exploring predicted values

Let’s also explore the predicted values of our model with a categorical variable.

head(fitted(fit_age), n = 10)

     338     1790     3125     3957     3313     4145      337     1789 
629.5473 786.7200 629.5473 786.7200 629.5473 786.7200 629.5473 786.7200 
    3513     4345 
629.5473 786.7200

We see that there are only 2 values, 630 and 787. These correspond to the means for each group that we saw above. They also seem to be in a pattern: young-mean, old-mean, young-mean, old-mean, etc. How does this correspond to the age group of the participant for the first ten observations?

head(df_freq_eng$age_subject, n = 10)

 [1] young old   young old   young old   young old   young old  
attr(,"contrasts")
      [,1]
old    0.5
young -0.5
Levels: old young

The first ten observations in our data are in young-old pairs. What are the first values in the raw data?

head(exp(df_freq_eng$rt_lexdec), n = 10)

 [1] 623.61 775.67 617.10 715.52 575.70 742.19 592.42 748.37 541.67 824.76

And what is the difference between these reaction times and the fitted values?

head(exp(df_freq_eng$rt_lexdec), n = 10) - head(fitted(fit_age), n = 10)

       338       1790       3125       3957       3313       4145        337 
 -5.937299 -11.049991 -12.447299 -71.199991 -53.847299 -44.529991 -37.127299 
      1789       3513       4345 
-38.349991 -87.877299  38.040009

head(residuals(fit_age))

       338       1790       3125       3957       3313       4145 
 -5.937299 -11.049991 -12.447299 -71.199991 -53.847299 -44.529991

5.3 Summary

we saw that the equation for a straight line boils down to its intercept and slope
we fit our first linear model with a categorical predictor

Important terms

term	description/other terms
NA	NA
:----	:-----------------------

Learning Objectives

Today we learned…

5.4 Task

We’ll use a dataset from Biondo et al. (2022), an eye-tracking reading study exploring the processing of adverb-tense concord in Spanish past and future tenses. Participants read sentences that began with a temporal adverb (e.g., yesterday/tomorrow), and had a verb marked with the congruent or incongruent tense (past/future).

Load in the data.

df_tense <-
  read_csv(here("data", "Biondo.Soilemezidi.Mancini_dataset_ET.csv"),
           locale = locale(encoding = "Latin1") # for special characters in Spanish
           ) |> 
  mutate(gramm = ifelse(gramm == "0", "ungramm", "gramm")) |> 
  clean_names()

5.4.1 Treatment contrasts

We will look at the measure total reading time (tt) at the verb region (roi == 4). Subset the data to only include the verb region.

df_verb <-
  df_tense |> 
  filter(roi == 4)

Run a simple linear model with (log-transformed) total reading time (tt) as an independent variable and grammaticality (gramm) as a dependent variable. Use treatment contrasts.
Inspect your coefficients again. What conclusions do you draw?
Run model diagnostics:
- check model assumptions where relevant (normality, constant variance, collinearity)
- check model fit (\(R^2\))

5.4.2 Sum contrasts

Re-run your model with sum contrasts.
Inspect your coefficients again. Do your conclusions change?
Re-run your model diagnostics. How does it compare to your first model?

5.4.3 Multiple regression

Add verb tense (verb_t: past, future) as a predictor, including an interaction term. Use sum contrasts.
Inspect your coefficients again. Do your conclusions change?
Re-run your model diagnostics. How does it compare to the last models?

Session Info

Developed with Quarto using R version 4.4.0 (2024-04-24) (Puppy Cup) and RStudio version 2023.9.0.463 (Desert Sunflower), and the following packages:

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] kableExtra_1.4.0 knitr_1.43       patchwork_1.2.0  languageR_1.5.0 
 [5] janitor_2.2.0    lme4_1.1-35.3    Matrix_1.7-0     broom_1.0.5     
 [9] here_1.0.1       lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1   
[13] dplyr_1.1.4      purrr_1.0.2      readr_2.1.5      tidyr_1.3.1     
[17] tibble_3.2.1     ggplot2_3.5.1    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      xfun_0.40         htmlwidgets_1.6.4 lattice_0.22-6   
 [5] tzdb_0.4.0        vctrs_0.6.5       tools_4.4.0       generics_0.1.3   
 [9] parallel_4.4.0    fansi_1.0.6       highr_0.10        pacman_0.5.1     
[13] pkgconfig_2.0.3   lifecycle_1.0.4   farver_2.1.1      compiler_4.4.0   
[17] munsell_0.5.1     carData_3.0-5     snakecase_0.11.1  htmltools_0.5.8.1
[21] yaml_2.3.7        crayon_1.5.2      car_3.1-2         pillar_1.9.0     
[25] nloptr_2.0.3      MASS_7.3-60.2     abind_1.4-5       boot_1.3-30      
[29] nlme_3.1-164      tidyselect_1.2.1  digest_0.6.33     stringi_1.8.3    
[33] labeling_0.4.3    splines_4.4.0     rprojroot_2.0.4   fastmap_1.1.1    
[37] grid_4.4.0        colorspace_2.1-0  cli_3.6.2         magrittr_2.0.3   
[41] utf8_1.2.4        withr_3.0.0       scales_1.3.0      backports_1.4.1  
[45] bit64_4.0.5       timechange_0.3.0  rmarkdown_2.24    bit_4.0.5        
[49] hms_1.1.3         evaluate_0.21     viridisLite_0.4.2 mgcv_1.9-1       
[53] rlang_1.1.3       Rcpp_1.0.12       glue_1.7.0        xml2_1.3.6       
[57] renv_1.0.7        vroom_1.6.5       svglite_2.1.3     rstudioapi_0.16.0
[61] minqa_1.2.6       jsonlite_1.8.7    R6_2.5.1          systemfonts_1.0.6

Learning Objectives

Set-up environment

Load data

5.1 Categorical predictors

5.1.1 Including a categorical predictor

5.2 Sum contrasts

5.2.1 Exploring predicted values

5.3 Summary

Important terms

Learning Objectives

5.4 Task

5.4.1 Treatment contrasts

5.4.2 Sum contrasts

5.4.3 Multiple regression

Session Info

References