12 Model selection

Strolling through the garden of forking paths

Author

Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

February 9, 2024

Modified

March 11, 2024

Under construction

This chapter is not fully translated from bullet points (from my slides) to prose. This will happen eventually (hopefully by spring 2024).

Learning Objectives

Today we will learn about…

the history of mixed models (again)
strategies for model selection
variability in model selection

and how to…

apply remedies for nonconvergence
reduce our RES with a data-driven approach
compare a parsimonious model to maximal and intercept-only models

Resources

relevant papers for this topic
- Barr et al. (2013)
- Bates et al. (2015)
- Matuschek et al. (2017)
- Brauer & Curtin (2018)
- Meteyard & Davies (2020)
practical resources covered in the practical application:
- Sections 10.3-5 in Sonderegger (2023)
- Section 15.7.3 ‘Convergence Issues’ in Winter (2019)
- Brauer & Curtin (2018)
- Meteyard & Davies (2020)
we will continue using the data from Biondo et al. (2022)

12.1 Review: random intercepts and slopes

violation of independence assumption –> increased Type I error (false positive)
- include random effects per plausible grouping factor (herein “unit,” à la Barr et al., 2013)
random-intercepts only models –> increased Type I error (false positive)
- add random slopes per within-unit manipulation, i.e., “maximal” models (Barr et al., 2013)
but such models often fail to “converge”
- and have been shown to increase Type II error (false negative) (Bates et al., 2015; Matuschek et al., 2017)
some questions remain:
- how to define our maximal model
- how to handle convergence issues

12.2 History of LMMs revisited

recall Clark (1973)’s language-as-fixed-effect fallacy and the issue of generalisability (see also Winter & Grice, 2021; Yarkoni, 2022)
Baayen et al. (2008) motivated LMM’s for linguistic data in the JML special issue
- effect: everybody adopted random-intercepts only models
Barr et al. (2013): random-intercepts only models are overconfident, “keep it maximal!”
- effect: everybody adopted maximal models
Matuschek et al. (2017) and Bates et al. (2015): maximal models are underconfident and lower statistical power! Use data-driven model selection to find a “parsimonious” model!
- effect: some people adopt this method, but many psycholinguists just want a “recipe” to follow

12.2.1 2013: Keep it maximal

A maximal model should optimize generalization of the findings to new subjects and new items.

– Barr et al. (2013), p. 261

random-intercepts-only models tend to be underpowered
for this reason, Barr et al. (2013) suggested using a maximal random effects structure justified by the experimental design

12.2.2 2015 & 2017: Parsimonious models

[W]hile the maximal model indeed performs well as far as Type I error rates were concerned, power decreases substantially with model complexity.

— Matuschek et al. (2017), p. 310-311

there is a trade-off between Type I (overconfidence) and Type II error (underconfidence)
i.e., maximal models can lead to over-fitting
- lowers statistical power, which increases Type II error (false rejection)
but we should strive for the most parsimonious model
- parsimonous models are a compromise between the maximal model justified by your design and theory, and a given data set
best way to maintain low Type I and II error: collect lots of data

12.3 Model building

Every statistical model is a description of some real or hypothetical state of affairs in the world.

– Yarkoni (2022), p. 2

our models reflect not only our hypotheses or effects of interest, but any other plausible or known co-variates
our predictors should reflect our research questions or theories tested
- plus any plausible or previously motivated co-variates (e.g., trial order)
- plus known sources of nonindependence, i.e., our random effects

12.3.1 Choosing predictors

your model should be defined a priori
- i.e., you should define what predictors you will include and any covariates
- e.g., if you have a prediction about the effect of phonological neighbours on vowel duration
  - define what phonological characteristics you will include (e.g., place of articulation? manner?)
  - these should be related to specific hypotheses/research questions

12.3.2 Choosing a maximal random effects structure (RES)

how do we define our maximal model? Some tips from Barr et al. (2013)
- between-unit factor (e.g., age): include random intercept only
- within-unit factor with multiple observations per unit-level (e.g., age in longitudinal data): include random slopes
all factors in an interaction are within-unit: include by-unit random slopes for interaction terms

12.3.2.1 Example: Biondo et al. (2022)

Biondo et al. (2022): 2x2 design
- verb-tense (past, future) and grammaticality (grammatical, ungrammatical)
- repeated-measures: within-participant and -item design
  - so we should have by-participant and -item random intercepts (multiple observations per unit level)
- each participant and item contributed multiple data points per condition (i.e., tense and grammaticality were manipulated within each unit level)
  - so we should have varying tense and grammaticality slopes by- item and -participant

Code

fit_verb_fp_mm <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 + verb_t*gramm|sj) +
                      (1 + verb_t*gramm|item),
                    data = df_biondo,
                    subset = roi == 4)

12.3.3 Observations per cell

if there is only a single observation per cell, e.g., you collected one observation from each participant per condition, then you can’t fit random intercepts or slopes
ideally you would have at least 5 observations per cell (per unit level per condition, e.g., each participant has at least 5 observations per condition)
this is also a question of statistical power

# obvz per sj per condition
df_biondo |> 
  filter(roi == 4) |> 
  count(sj, verb_t, gramm) |> 
  count(n)

# A tibble: 1 × 2
      n    nn
  <int> <int>
1    16   240

# obvz per item per condition
df_biondo |> 
  filter(roi == 4) |> 
  count(item, verb_t, gramm) |> 
  arrange(desc(n)) |> 
  count(n)

# A tibble: 1 × 2
      n    nn
  <int> <int>
1    10   384

12.3.4 Data structure

random effects must be factors/categorical
single observation per row
- generally speaking, there should be n(participants) * n(items) rows
- every fixed or random effect in your model should correspond to a column in your dataset

12.4 Variability in methods

Meteyard & Davies (2020)
- survey of (psychology) researchers
- review of papers using LMMs
insecurity in researchers re: choosing models
great variation in papers in how models are built and reported

12.4.1 Researcher degrees of freedom

What we hope to make clear is that there is no single correct way in which LMM analyses should be conducted, and this has important implications for how the reporting of LMMs should be approached.

— Meteyard & Davies (2020), p. 9

the problem:
- ‘researcher degrees of freedom’ (Simmons et al., 2011), or ‘the garden of forking paths’ (Gelman & Loken, 2013)
- the same data can be analysed in a variety of ways
this leads to insecurity for many researchers

12.4.2 Justify and document

Replicability and reproducibility are critical for scientific progress, so the way in which researchers have implemented LMM analysis must be entirely transparent. We also hope that the sharing of analysis code and data becomes widespread, enabling the periodic re-analysis of raw data over multiple experiments as studies accumulate over time.

— Meteyard & Davies (2020), p. 9

the (partial) solution:
- make model building/selection decisions a priori
- be transparent
- share your data and code

12.5 A practical example

We’ll now look at an example of a model that encounters convergence issues, and take some steps to reach convergence. We are considering a case where your predictors and covariates are already selected, and will focus on selecting model options and random effects structure that achieve model convergence.

12.5.1 Set-up

We first need to set up our environment.

## suppress scientific notation
options(scipen=999)

Load packages

## load libraries
pacman::p_load(
               tidyverse,
               here,
               janitor,
               ## new packages for mixed models:
               lme4,
               lmerTest,
               broom.mixed,
               lattice)

lmer <- lmerTest::lmer

Load data

data from Biondo et al. (2022)

df_biondo <-
  read_csv(here("data", "Biondo.Soilemezidi.Mancini_dataset_ET.csv"),
           locale = locale(encoding = "Latin1") ### for special characters in Spanish
           ) |> 
  clean_names() |> 
  mutate(gramm = ifelse(gramm == "0", "ungramm", "gramm")) |> 
  mutate_if(is.character,as_factor) |> ## all character variables as factors
  droplevels() |> 
  filter(adv_type == "Deic")

12.5.1.1 Set contrasts

contrasts(df_biondo$verb_t) <- c(-0.5,+0.5)
contrasts(df_biondo$gramm) <- c(-0.5,+0.5)

contrasts(df_biondo$verb_t)

       [,1]
Past   -0.5
Future  0.5

contrasts(df_biondo$gramm)

        [,1]
gramm   -0.5
ungramm  0.5

12.5.2 Start maximal

model structure should be decided a priori
- included fixed (predictors and covariates) and random effects

12.5.2.1 Maximal model

starting point: most maximal model structure justified by your design
- if this converges, great!
- if it doesn’t, what does this mean and what should we do?

fit_verb_fp_mm <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 + verb_t*gramm|sj) +
                      (1 + verb_t*gramm|item),
                    data = df_biondo,
                    subset = roi == 4)

we get a warning of singular fit

12.5.3 Convergence issues

“Convergence is not a metric of model quality” (Sonderegger, 2023, p. 365, Box 10.2)
- convergence does not always indicate “overfitting” or “overparameterisation”
- can also be due to optimizer choice
  - since default optimizer was changed to nloptwrap from bobyqa, there seem to be more ‘false positive’ convergence warnings
false-positive convergence: you get a convergence warning, but changing the optimizer and/or iteration count does not produce a warning
false-negative convergence: you do not get a warning, but your variance-covariance matrix might indicate overfitting

12.5.3.1 Nonconvergence remedies

unfortunately there is no one “right” way to deal with convergence issues
- important is to transparently report and justify your method
Table 17 in Brauer & Curtin (2018) (p. 404) suggests 20 remedies, whittled down to 10 suggestions in Sonderegger (2023)

Figure 12.1: From Sonderegger (2023), p. 366

12.5.3.2 Intrusive vs. Non-intrusive remedies

non-intrusive remedies amount to checking/adjusting data and model specifications
intrusive remedies involve reducing random effects structure
- there are different schools of thought
  - random-intercepts only: increased Type I error rate = overconfident estimates
  - maximal-but-singular-fit model (Barr et al., 2013): reduces power = underconfident estimates
  - data-driven approach (Bates et al., 2015): can lose the forest for the trees, e.g., removing random slopes for predictors of interest
each strategy has its drawback
- important is to choose your strategy a priori and transparently report and justify your strategy
- even better: share/publish your data and code, which should be reproducible

12.5.3.3 `?convergence`

type ?convergence in the Console and read the vignette
- what suggestions does it make?
compare this to ?isSingular

12.5.4 Non-intrusive methods

check your data structure/variables
- check model assumptions (e.g., normality, missing transformations of variables)
- check your RES is justified by your experimental design/data structure
- centre your predictors (e.g., sum contrasts, or centring/standardizing) to reduce multicollinearity; reduces collinearity in the random effects (a possible source of nonconvergence)
- check observations per cell (e.g., is there a participant very few observations, or few observations per one condition? Should be at least >5 per cell)
alter model controls:
- increase iterations
- check optimizer

12.5.4.1 Check optimzer

optimizer
- lme4::allFit(model) (can take a while to run)

all_fit_verb_fp_mm <- allFit(fit_verb_fp_mm)
## bobyqa : boundary (singular) fit: see help('isSingular')
## [OK]
## Nelder_Mead : [OK]
## nlminbwrap : boundary (singular) fit: see help('isSingular')
## [OK]
## nmkbw : [OK]
## optimx.L-BFGS-B : boundary (singular) fit: see help('isSingular')
## [OK]
## nloptwrap.NLOPT_LN_NELDERMEAD : boundary (singular) fit: see help('isSingular')
## [OK]
## nloptwrap.NLOPT_LN_BOBYQA : boundary (singular) fit: see help('isSingular')
## [OK]
## There were 11 warnings (use warnings() to see them)

12.5.4.2 Optimizers

default optimizer for lmer() is nloptwrap, formerly bobyqa (Bound Optimization by Quaradric Approximiation)
- usually changing to bobyqa helps
see ?lmerControl for more info
if fits are very similar (or all optimizeres except the default), the nonconvergent fit was a false positive
- it’s safe to use the new optimizer

summary(all_fit_verb_fp_mm)$llik

                       bobyqa                   Nelder_Mead 
                    -2105.109                     -2179.479 
                   nlminbwrap                         nmkbw 
                    -2105.106                     -2105.109 
              optimx.L-BFGS-B nloptwrap.NLOPT_LN_NELDERMEAD 
                    -2105.106                     -2105.106 
    nloptwrap.NLOPT_LN_BOBYQA 
                    -2105.106

summary(all_fit_verb_fp_mm)$fixef

                              (Intercept)    verb_t1      gramm1 verb_t1:gramm1
bobyqa                           5.956403 0.06170602 0.003369634    -0.01418865
Nelder_Mead                      5.956350 0.06188102 0.003488675    -0.01397531
nlminbwrap                       5.956403 0.06170726 0.003369637    -0.01419047
nmkbw                            5.956404 0.06170653 0.003369153    -0.01419036
optimx.L-BFGS-B                  5.956403 0.06170717 0.003369787    -0.01419044
nloptwrap.NLOPT_LN_NELDERMEAD    5.956403 0.06170725 0.003369649    -0.01419046
nloptwrap.NLOPT_LN_BOBYQA        5.956403 0.06170771 0.003369203    -0.01419184

12.5.4.3 Increase iterations

and/or increase number of iterations
- default is 10 000 (1e5 in scientific notation)
- you can try 20 000, 100 000, etc.
- this usually helps with larger data or models with complex RES

## check n of iterations
fit_verb_fp_mm@optinfo$feval

[1] 2318

12.5.4.4 `lmerControl()`

fit_verb_fp_mm <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 + verb_t*gramm|sj) +
                      (1 + verb_t*gramm|item),
                    data = df_biondo,
                    subset = roi == 4,
                    control = lmerControl(optimizer = "bobyqa",
                                          optCtrl = list(maxfun = 2e5))
)

or you can just ‘update’ the model to save some syntax

fit_verb_fp_mm <- update(fit_verb_fp_mm,
                         control = lmerControl(optimizer = "bobyqa", 
                                                optCtrl = list(maxfun = 2e5)))

boundary (singular) fit: see help('isSingular')

Warning: Model failed to converge with 1 negative eigenvalue: -5.3e-01

12.5.4.5 Removing parameters

still won’t converge?
- it’s time to consider intrusive remedies: removing random effects parameters

12.5.5 Intrusive methods

nonconvergence in maximal models is often due to overfitting
- i.e., the model is overly complex given your data
- this is typically due to an overly complex random effects structure
if the non-intrusive methods don’t lead to convergence, the problem is likely overfitting

12.5.5.1 Parsimonious vs. maximal

there are different camps on how to deal with this issue
I personally follow the suggestions in Bates et al. (2015) (for now)
1. run random effects Principal Components Analysis (summary(rePCA(model)), lme4 package)
  - informs by how many parameters our model is overfit
2. check variance-covariance matrix (VarCorr(model))
3. remove parameters with very high or low Correlation terms and/or much lower variance compared to other terms
4. fit simplified model
5. wash, rinse, repeat
we’ll practice this method today, but keep in mind that it’s up to you to decide and justify which method you use

12.5.5.2 Random effects Principal Components Analysis

gives us a ranking of all parameters (‘components’) in our RES per unit

summary(rePCA(fit_verb_fp_mm))

$item
Importance of components:
                         [,1]   [,2]    [,3]                    [,4]
Standard deviation     0.3638 0.2493 0.08366 0.000000000000000001309
Proportion of Variance 0.6567 0.3085 0.03474 0.000000000000000000000
Cumulative Proportion  0.6567 0.9653 1.00000 1.000000000000000000000

$sj
Importance of components:
                         [,1]    [,2]        [,3]          [,4]
Standard deviation     0.6490 0.01470 0.000001281 0.00000001467
Proportion of Variance 0.9995 0.00051 0.000000000 0.00000000000
Cumulative Proportion  0.9995 1.00000 1.000000000 1.00000000000

12.5.5.3

important is the Cumulative Proportion
- how much of the cumulative variance explained by all the by-unit parameters does this one parameter contribute?
- we see for item, the first component accounts for 66% of the variance explained, and the next contributes an additional 31%, and the next 3%
- so two components account for roughly 97% of variance explained by our RES
- in other words, we can remove one component for sure, and possibly another
- we could potentially remove 3 components from participant

12.5.5.4 Variance-covariance matrix

so we can remove 2 parameters from item and participant
- so either the varying intercept, or slope for tense, grammaticality, or their interaction
we can check this with VarCorr(fit_verb_fp_mm)

VarCorr(fit_verb_fp_mm)

 Groups   Name           Std.Dev. Corr                
 item     (Intercept)    0.139189                     
          verb_t1        0.055890  0.488              
          gramm1         0.022569 -0.109 -0.921       
          verb_t1:gramm1 0.095314 -0.283  0.456 -0.646
 sj       (Intercept)    0.257535                     
          verb_t1        0.018296 0.974               
          gramm1         0.012055 0.960  0.872        
          verb_t1:gramm1 0.017731 0.991  0.934  0.990 
 Residual                0.399095

for item I would remove gramm because it has the lowest variance, and has a pretty high correlation with verb_t (which is unlikely to be true)
I would also remove gramm for participant for the same reason, as well as its high correlation with the intercept and verb_t

12.5.5.4.1 Alternate model 1

for now let’s just remove the interaction term
- for reproducibility reasons, do not delete the code for a model that did not converge
- rather, write a comment on what decision was made (and why) for the new model

fit_verb_fp_m1 <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 + verb_t+gramm|sj) +
                      (1 + verb_t+gramm|item),
                    data = df_biondo,
                    subset = roi == 4,
                    control = lmerControl(optimizer = "bobyqa",
                                          optCtrl = list(maxfun = 2e5))
)

boundary (singular) fit: see help('isSingular')

12.5.5.4.1.1 `rePCA()`

summary(rePCA(fit_verb_fp_m1))

$item
Importance of components:
                         [,1]   [,2]          [,3]
Standard deviation     0.3559 0.1291 0.00000007181
Proportion of Variance 0.8837 0.1163 0.00000000000
Cumulative Proportion  0.8837 1.0000 1.00000000000

$sj
Importance of components:
                         [,1]         [,2] [,3]
Standard deviation     0.6465 0.0000006824    0
Proportion of Variance 1.0000 0.0000000000    0
Cumulative Proportion  1.0000 1.0000000000    1

12.5.5.4.1.2 `VarCorr()`

VarCorr(fit_verb_fp_m1)

 Groups   Name        Std.Dev. Corr         
 item     (Intercept) 0.139274              
          verb_t1     0.055550  0.489       
          gramm1      0.020747 -0.117 -0.924
 sj       (Intercept) 0.257657              
          verb_t1     0.017584 1.000        
          gramm1      0.011554 1.000  1.000 
 Residual             0.399869

when we see Corr +/-1, this tells us there was an error computing correlations between parameters
- it is an invitation to explore
this is not plausible, and indicates overfitting in our model
- we can remove all slopes from sj

12.5.5.4.1.3 by-item random effects

lattice::dotplot(ranef(fit_verb_fp_m1))$item

12.5.5.4.1.4 by-participant random effects (with +1 correlations)

lattice::dotplot(ranef(fit_verb_fp_m1))$sj

12.5.5.4.2 Alternate model 2

fit_verb_fp_m2 <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 |sj) +
                      (1 + verb_t+gramm|item),
                    data = df_biondo,
                    subset = roi == 4,
                    control = lmerControl(optimizer = "bobyqa",
                                          optCtrl = list(maxfun = 2e5))
)

boundary (singular) fit: see help('isSingular')

12.5.5.4.2.1 `rePCA()`

summary(rePCA(fit_verb_fp_m2))

$item
Importance of components:
                         [,1]   [,2]          [,3]
Standard deviation     0.3559 0.1297 0.00000001647
Proportion of Variance 0.8827 0.1173 0.00000000000
Cumulative Proportion  0.8827 1.0000 1.00000000000

$sj
Importance of components:
                         [,1]
Standard deviation     0.6441
Proportion of Variance 1.0000
Cumulative Proportion  1.0000

12.5.5.4.2.2 `VarCorr()`

VarCorr(fit_verb_fp_m2)

 Groups   Name        Std.Dev. Corr         
 item     (Intercept) 0.139364              
          verb_t1     0.055805  0.485       
          gramm1      0.020546 -0.097 -0.917
 sj       (Intercept) 0.257648              
 Residual             0.399995

by-item slopes for gramm and verb_t are highly correlated
gramm has least variance, so let’s remove it

12.5.5.4.2.3 by-item random effects

lattice::dotplot(ranef(fit_verb_fp_m2))$item

12.5.5.4.3 Alternate model 3

fit_verb_fp_m3 <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 |sj) +
                      (1 + verb_t|item),
                    data = df_biondo,
                    subset = roi == 4,
                    control = lmerControl(optimizer = "bobyqa",
                                          optCtrl = list(maxfun = 2e5))
)

converged!

12.5.5.4.3.1 `rePCA()`

summary(rePCA(fit_verb_fp_m3))

$item
Importance of components:
                         [,1]    [,2]
Standard deviation     0.3553 0.10311
Proportion of Variance 0.9223 0.07768
Cumulative Proportion  0.9223 1.00000

$sj
Importance of components:
                         [,1]
Standard deviation     0.6438
Proportion of Variance 1.0000
Cumulative Proportion  1.0000

12.5.5.4.3.2 `VarCorr()`

VarCorr(fit_verb_fp_m3)

 Groups   Name        Std.Dev. Corr 
 item     (Intercept) 0.139365      
          verb_t1     0.050134 0.542
 sj       (Intercept) 0.257714      
 Residual             0.400315

12.5.5.4.4 Alternate model 4

but we might’ve also decided to remove verb_t, so let’s run that model

fit_verb_fp_m4 <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 |sj) +
                      (1 + gramm|item),
                    data = df_biondo,
                    subset = roi == 4,
                    control = lmerControl(optimizer = "bobyqa",
                                          optCtrl = list(maxfun = 2e5))
)

boundary (singular) fit: see help('isSingular')

does not converge, so we’re justified in keeping by-item verb_t slopes

12.5.5.5 Final model

the final model name should be some sort of convention to make your life easier
- so remove model index

fit_verb_fp <- fit_verb_fp_m3

12.5.5.5.0.1 by-item random effects

lattice::dotplot(ranef(fit_verb_fp))$sj

12.5.5.5.0.2 by-participant random effects

lattice::dotplot(ranef(fit_verb_fp))$item

12.5.5.5.0.3 `summary()`

summary(fit_verb_fp)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: log(fp) ~ verb_t * gramm + (1 | sj) + (1 + verb_t | item)
   Data: df_biondo
Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 200000))
 Subset: roi == 4

REML criterion at convergence: 4216.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.1758 -0.6096 -0.0227  0.6060  4.0568 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 item     (Intercept) 0.019423 0.13936      
          verb_t1     0.002513 0.05013  0.54
 sj       (Intercept) 0.066417 0.25771      
 Residual             0.160252 0.40032      
Number of obs: 3795, groups:  item, 96; sj, 60

Fixed effects:
                  Estimate  Std. Error          df t value             Pr(>|t|)
(Intercept)       5.956384    0.036763   79.243172 162.021 < 0.0000000000000002
verb_t1           0.061733    0.013971   93.410519   4.419            0.0000267
gramm1            0.003298    0.012999 3544.451690   0.254                 0.80
verb_t1:gramm1   -0.014380    0.025998 3544.762213  -0.553                 0.58
                  
(Intercept)    ***
verb_t1        ***
gramm1            
verb_t1:gramm1    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) vrb_t1 gramm1
verb_t1      0.077              
gramm1       0.000 -0.002       
vrb_t1:grm1  0.000  0.002  0.000

IMPORTANTLY, only look at the fixed effects after you’ve got your final model!!!!
- i.e., run model -> convergence error -> rePCA() + VarCorr() -> run model -> … -> converges -> only NOW run summary(model)

12.5.6 Comparing to ‘bad’ models

let’s compare our final model to our ‘bad’ models
- random intercepts-only model (overconfident)
- maximal model (underconfident)

12.5.6.1 Random-intercepts only

fit_verb_fp_intercepts <- lmer(log(fp) ~ verb_t*gramm + 
                      (1 |sj) +
                      (1 |item),
                    data = df_biondo,
                    subset = roi == 4
)

converges

Code

sum_fit_verb_fp <-
  tidy(fit_verb_fp,
     effects = "fixed") |> 
  as_tibble() |> 
  mutate(p_value = p.value,
         model = "parsimonious") 

sum_fit_verb_fp_mm <-
  tidy(fit_verb_fp_mm,
     effects = "fixed") |> 
  as_tibble() |> 
  mutate(p_value = p.value,
         model = "maximal") 

sum_fit_verb_fp_intercepts <-
  tidy(fit_verb_fp_intercepts,
     effects = "fixed") |> 
  as_tibble() |> 
  mutate(p_value = p.value,
         model = "intercepts")

12.5.6.2 coefficient estimates

Code

rbind(sum_fit_verb_fp, sum_fit_verb_fp_intercepts, sum_fit_verb_fp_mm) |> 
  select(term, estimate, model) |>
  mutate(estimate = round(estimate,4)) |> 
  pivot_wider(
    id_cols = c(term),
    names_from = model,
    values_from = estimate
  ) |> 
  mutate(measure = "estimate") |> 
  kable() |> 
  kable_styling()

Table 12.1: Coefficient estimates for our parsimonious model, a random-intercepts only model, and a maximal model
term	parsimonious	intercepts	maximal	measure
(Intercept)	5.9564	5.9564	5.9564	estimate
verb_t1	0.0617	0.0619	0.0617	estimate
gramm1	0.0033	0.0032	0.0034	estimate
verb_t1:gramm1	-0.0144	-0.0143	-0.0142	estimate

12.5.6.3 standard error

Code

rbind(sum_fit_verb_fp, sum_fit_verb_fp_intercepts, sum_fit_verb_fp_mm) |> 
  select(term, std.error, model) |>
  mutate(std.error = round(std.error,4)) |> 
  pivot_wider(
    id_cols = c(term),
    names_from = model,
    values_from = std.error
  ) |> 
  mutate(measure = "std.error") |> 
  kable() |> 
  kable_styling()

Table 12.2: Standard error of coefficient estimates for our parsimonious model, a random-intercepts only model, and a maximal model
term	parsimonious	intercepts	maximal	measure
(Intercept)	0.0368	0.0368	0.0367	std.error
verb_t1	0.0140	0.0130	0.0144	std.error
gramm1	0.0130	0.0130	0.0133	std.error
verb_t1:gramm1	0.0260	0.0260	0.0278	std.error

standard error (\\(SE = \frac{\sigma}{\sqrt{n}}\\\)) is a measure of uncertainty
- larger values reflect greater uncertainty
- because \(n\) is in the denominator, SE gets smaller with more observations
compared to our parsimonious model with by-item varying verb_t slopes:
- smaller SE for our overconfident (intercepts) model
- larger SE for our underconfident (maximal) model
- but only for the estimate also included in the random effects

12.5.6.4 t-values

Code

rbind(sum_fit_verb_fp, sum_fit_verb_fp_intercepts, sum_fit_verb_fp_mm) |> 
  select(term, statistic, model) |>
  mutate(statistic = round(statistic,4)) |> 
  pivot_wider(
    id_cols = c(term),
    names_from = model,
    values_from = statistic
  ) |> 
  mutate(measure = "statistic") |> 
  kable() |> 
  kable_styling()

Table 12.3: t-values of each estimates for our parsimonious model, a random-intercepts only model, and a maximal model
term	parsimonious	intercepts	maximal	measure
(Intercept)	162.0213	161.9025	162.1605	statistic
verb_t1	4.4188	4.7517	4.2982	statistic
gramm1	0.2537	0.2466	0.2542	statistic
verb_t1:gramm1	-0.5531	-0.5496	-0.5108	statistic

t-value (\\(t = \frac{\bar{x}_1 - \bar{x}_2}{SE}\\\)) is a measure of uncertainty
- larger values reflect greater effect
- more \(n\) increases \(t\)
again, verb_t: \(t_{max}\) < \(t_{pars}\) < \(t_{int}\)

12.5.6.5 degrees of freedom

Code

rbind(sum_fit_verb_fp, sum_fit_verb_fp_intercepts, sum_fit_verb_fp_mm) |> 
  select(term, df, model) |>
  mutate(df = round(df,4)) |> 
  pivot_wider(
    id_cols = c(term),
    names_from = model,
    values_from = df
  ) |> 
  mutate(measure = "df") |> 
  kable() |> 
  kable_styling()

Table 12.4: Degrees of freedom of each estimates for our parsimonious model, a random-intercepts only model, and a maximal model
term	parsimonious	intercepts	maximal	measure
(Intercept)	79.2432	79.2008	79.1789	df
verb_t1	93.4105	3637.1332	71.4491	df
gramm1	3544.4517	3637.1834	179.9254	df
verb_t1:gramm1	3544.7622	3637.1023	91.8597	df

degrees of freedom: not trivially defined in mixed models; we’re using Satterthwaite approximiation (default in lmerTest::lmer())
- larger degrees of freedom corresponds to larger \(n\)
- including more random effects reduces our \(n\) and therefore reduces \(df\)
again, verb_t: \(df_{max}\) < \(df_{pars}\) < \(df_{int}\)
- and large differences between our maximal model and the other two for other terms

12.5.6.6 p-values

Code

rbind(sum_fit_verb_fp, sum_fit_verb_fp_intercepts, sum_fit_verb_fp_mm) |> 
  select(term, p.value, model) |>
  mutate(p.value = round(p.value, 8)) |> 
  pivot_wider(
    id_cols = c(term),
    names_from = model,
    values_from = p.value
  ) |> 
  mutate(measure = "p.value") |> 
  kable() |> 
  kable_styling()

Table 12.5: p-values of coefficient estimates for our parsimonious model, a random-intercepts only model, and a maximal model
term	parsimonious	intercepts	maximal	measure
(Intercept)	0.0000000	0.0000000	0.0000000	p.value
verb_t1	0.0000267	0.0000021	0.0000535	p.value
gramm1	0.7997645	0.8052568	0.7996181	p.value
verb_t1:gramm1	0.5802114	0.5826522	0.6107496	p.value

p-values: inversely related to t-values (larger t-values = smaller p-values)
again, verb_t: \(p_{max}\) < \(p_{pars}\) < \(p_{int}\)
- this would be important for ‘signicance’ if the values were closer to the convential alpha-levels (p < .05, p < .01, p < .001)
- but here the different random effects structures don’t qualitatively change (all are < .001)
this is not always the case, however!
- this is why we do not peek at the fixed effects until we have our final model
- we don’t want to be influenced (consciously or not) by seeing small p-values in one model but not another

12.5.7 Reporting

in Data Analysis section, e.g.,

We included Time Reference (past, future), and Verb Match (match, mismatch) as fixed-effect factors in the models used to investigate the processing of past–future violations (Q1), by adopting sum contrast coding (Schad et al., 2020): past and match conditions were coded as –.5. while future and mismatch conditions were coded as .5. […] Moreover, we included crossed random intercepts and random slopes for all fixed-effect parameters for subject and item grouping factors (Barr et al., 2013) in all models.

We reduced the complexity of the random effect structure of the maximal model by performing a principal component analysis so as to identify the most parsimonious model properly supported by the data (Bates et al., 2015). […] all reading time data were log transformed before performing the analyses.

— Biondo et al. (2022), p. 9

12.5.7.1 Formatted p-values

We can create our own function, which we will call format_pval(), to produce formatted p-values. Here we use the case_when() function to print < .05, < .01 or < .001 when the p-value is smaller than these values (which is a convention). Otherwise (TRUE), round the p-value to 3 decimal points.

## function to format p-values to APA7 guidelines
format_pval <- function(pval){
    dplyr::case_when(
        pval < .001 ~ "< .001",
        TRUE ~ str_remove(round(pval, 3), "^0+")
    )
}

We can now use our format_pval() function to format our p-values and print them in a table, as in Table 13.6.

Code

  tidy(fit_verb_fp,
     effects = "fixed") |> 
  as_tibble() |> 
  mutate(p_value = format_pval(p.value)) |> 
  select(-p.value) |> 
  knitr::kable() |> 
  kableExtra::kable_styling()

Table 12.6: Table with formatted p-values from `format_pval()`
effect	term	estimate	std.error	statistic	df	p_value
fixed	(Intercept)	5.9563839	0.0367630	162.0213327	79.24317	< .001
fixed	verb_t1	0.0617330	0.0139706	4.4187865	93.41052	< .001
fixed	gramm1	0.0032976	0.0129994	0.2536709	3544.45169	.8
fixed	verb_t1:gramm1	-0.0143804	0.0259984	-0.5531269	3544.76221	.58

APA 7 guidelines: formatting p-values

The guidelines from the 7th edition from the American Psychological Association pertaining to numbers and statistics define how we should be formatting p-values, summarised in the Guide for Numbers and Statistics (American Psychological Association, 2022). A summary:

Don’t use leading 0’s (i.e., zero before a decimal) when the statistic cannot be greater than 1 (proportion, correlation, level of statistical significance).
Report exact p-values to two or three decimals (e.g., p = .006, p = .03). i.e., no trailing 0’s
Report p-values less than .001 as “p < .001”.
Do not repeat statistics in both the text and a table or figure

We used the round() function from base R to round p-values to 3 decimal points. This function drops trailing 0’s, so 0.030 will become 0.03. We also use the str_remove() function from the stringr package (Tidyverse), which takes the results from round() and removes any leading 0’s (i.e., those before the decimal) as defined by the regular expression (regex) ^0+. This will take 0.03 and print .03.

Here is some example output using round():

round(0.020001,3)

[1] 0.02

And removing the leading 0 using str_remove():

str_remove(round(0.020001,3), "^0+")

[1] ".02"

If our p-value is smaller than .001, then it will be written as < .001:

format_pval(c(0.2, 0.02, 0.002, 0.0002))

[1] ".2"     ".02"    ".002"   "< .001"

So now our p-values are formatted according to APA 7 guidelines, as long as we pass them through format_pval().

12.6 Alternatives to `lme4`

other alternatives that have fewer convergence issues:
- Julia
  - e.g., in VS Code IDE
- Bayesian framework (e.g., brms R package)
  - also run (G)LMMs, but abandons arbitrary p-values
  - instead quantifies uncertainty
both are more more computationally powerful
- the are not (yet) as widely used in the field

Learning objectives 🏁

Today we learned…

the history of mixed models (again) ✅
strategies for model selection ✅
variability in model selection ✅

and how to…

apply remedies for nonconvergence ✅
reduce our RES with a data-driven approach ✅
compare a parsimonious model to maximal and intercept-only models ✅

Important terms

Term	Definition	Equation/Code
linear mixed (effects) model	NA	NA

References

American Psychological Association. (2022). APA Style numbers and statistics guide. American Psychological Association.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. https://doi.org/10.1016/j.jml.2007.12.005

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious Mixed Models. arXiv Preprint, 1–27. https://doi.org/10.48550/arXiv.1506.04967

Biondo, N., Soilemezidi, M., & Mancini, S. (2022). Yesterday is history, tomorrow is a mystery: An eye-tracking investigation of the processing of past and future time reference during sentence reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 48(7), 1001–1018. https://doi.org/10.1037/xlm0001053

Brauer, M., & Curtin, J. J. (2018). Linear mixed-effects models and the analysis of nonindependent data: A unified framework to analyze categorical and continuous independent variables that vary within-subjects and/or within-items. Psychological Methods, 23(3), 389–411. https://doi.org/10.1037/met0000159

Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4), 335–359. https://doi.org/10.1016/S0022-5371(73)80014-3

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing Type I error and power in linear mixed models. Journal of Memory and Language, 94, 305–315. https://doi.org/10.1016/j.jml.2017.01.001

Meteyard, L., & Davies, R. A. I. (2020). Best practice guidance for linear mixed-effects models in psychological science. Journal of Memory and Language, 112, 104092. https://doi.org/10.1016/j.jml.2020.104092

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Sonderegger, M. (2023). Regression Modeling for Linguistic Data.

Winter, B. (2019). Statistics for Linguists: An Introduction Using R. In Statistics for Linguists: An Introduction Using R. Routledge. https://doi.org/10.4324/9781315165547

Winter, B., & Grice, M. (2021). Independence and generalizability in linguistics. Linguistics, 59(5), 1251–1277. https://doi.org/10.1515/ling-2019-0049

Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45, e1. https://doi.org/10.1017/S0140525X20001685

Learning Objectives

Resources

12.1 Review: random intercepts and slopes

12.2 History of LMMs revisited

12.2.1 2013: Keep it maximal

12.2.2 2015 & 2017: Parsimonious models

12.3 Model building

12.3.1 Choosing predictors

12.3.2 Choosing a maximal random effects structure (RES)

12.3.2.1 Example: Biondo et al. (2022)

12.3.3 Observations per cell

12.3.4 Data structure

12.4 Variability in methods

12.4.1 Researcher degrees of freedom

12.4.2 Justify and document

12.5 A practical example

12.5.1 Set-up

Load packages

Load data

12.5.1.1 Set contrasts

12.5.2 Start maximal

12.5.2.1 Maximal model

12.5.3 Convergence issues

12.5.3.1 Nonconvergence remedies

12.5.3.2 Intrusive vs. Non-intrusive remedies

12.5.3.3 ?convergence

12.5.4 Non-intrusive methods

12.5.4.1 Check optimzer

12.5.4.2 Optimizers

12.5.4.3 Increase iterations

12.5.4.4 lmerControl()

12.5.4.5 Removing parameters

12.5.5 Intrusive methods

12.5.5.1 Parsimonious vs. maximal

12.5.5.2 Random effects Principal Components Analysis

12.5.5.3

12.5.5.4 Variance-covariance matrix

12.5.5.4.1 Alternate model 1

12.5.5.4.1.1 rePCA()

12.5.5.4.1.2 VarCorr()

12.5.5.4.1.3 by-item random effects

12.5.5.4.1.4 by-participant random effects (with +1 correlations)

12.5.5.4.2 Alternate model 2

12.5.5.4.2.1 rePCA()

12.5.5.4.2.2 VarCorr()

12.5.5.4.2.3 by-item random effects

12.5.5.4.3 Alternate model 3

12.5.5.4.3.1 rePCA()

12.5.5.4.3.2 VarCorr()

12.5.5.4.4 Alternate model 4

12.5.5.5 Final model

12.5.5.5.0.1 by-item random effects

12.5.5.5.0.2 by-participant random effects

12.5.5.5.0.3 summary()

12.5.6 Comparing to ‘bad’ models

12.5.6.1 Random-intercepts only

12.5.6.2 coefficient estimates

12.5.6.3 standard error

12.5.6.4 t-values

12.5.6.5 degrees of freedom

12.5.6.6 p-values

12.5.7 Reporting

12.5.7.1 Formatted p-values

12.6 Alternatives to lme4

Learning objectives 🏁

Important terms

References

12.5.3.3 `?convergence`

12.5.4.4 `lmerControl()`

12.5.5.4.1.1 `rePCA()`

12.5.5.4.1.2 `VarCorr()`

12.5.5.4.2.1 `rePCA()`

12.5.5.4.2.2 `VarCorr()`

12.5.5.4.3.1 `rePCA()`

12.5.5.4.3.2 `VarCorr()`

12.5.5.5.0.3 `summary()`

12.6 Alternatives to `lme4`