::p_load(
pacman
tidyverse,
lme4,
broom,
broom.mixed,
brms )
Reporting regression results
Extracting model estimates and data structure
Learning objectives
Today we will…
- run our first linear (mixed) model
- produce model summary tables
- report analyses and cite packages
- report model results reproducibly
Resources
Set-up
Packages
- we’ll need the following packages for today
Data
- we’ll need the dataset we’ve been working with
<- readr::read_csv(here::here("data", "tidy_data_lifetime_pilot.csv"),
df_lifetime # for special characters
locale = readr::locale(encoding = "latin1")
|>
) mutate_if(is.character,as.factor) |> # all character variables as factor
filter(type == "critical", # only critical trials
!= "px3") # this participant had lots of 0's for some reason px
Data dictionary
- let’s review the data dictionary
<- read_csv(
dict_lifetime ::here("data", "tidy_data_lifetime_pilot_dictionary.csv")
here )
Model prep
- we want to only have critical items
<-
df_lifetime |>
df_lifetime filter(type == "critical") |>
droplevels()
- we want our predictors (
lifetime
andtense
) to be factors
$lifetime <- as.factor(df_lifetime$lifetime)
df_lifetime$tense <- as.factor(df_lifetime$tense) df_lifetime
- and to have sum contrast coding:
PP
andliving
are-0,5
SF
anddead
are+0,5
contrasts(df_lifetime$lifetime)
living
dead 0
living 1
contrasts(df_lifetime$tense)
SF
PP 0
SF 1
- we see the default treatment (or ‘dummy’) contrasts here
- to change them:
contrasts(df_lifetime$lifetime) <- c(-0.5, +0.5)
contrasts(df_lifetime$tense) <- c(+0.5, -0.5)
contrasts(df_lifetime$lifetime)
[,1]
dead -0.5
living 0.5
contrasts(df_lifetime$tense)
[,1]
PP 0.5
SF -0.5
1 Linear regression
fitting a straight line to data
for an introduction see webbook for my course Regression for Linguists, e.g., Ch. 1 Understanding straight lines
fixed-effects only multiple regression:
lm(
log(fp) ~ tense * lifetime,
subset = region == "verb" & ff > 0,
data = df_lifetime
)
Call:
lm(formula = log(fp) ~ tense * lifetime, data = df_lifetime,
subset = region == "verb" & ff > 0)
Coefficients:
(Intercept) tense1 lifetime1 tense1:lifetime1
5.63477 0.03357 -0.10130 -0.08320
1.1 Naming conventions
- it’s very helpful to save our models to our environment so we can inspect them
- always try to use a prefix that defines what the object is
- e.g.,
fit_
ormod_
for models more generally - or
lm_
orglm_
to specify the function used to produce the model - whatever you choose, be consistent within your project
<-
lm_fp lm(
log(fp) ~ tense * lifetime,
subset = region == "verb" & ff > 0,
data = df_lifetime
)
1.2 Model summary
- the most straightforward way to inspect your model and the results is with
summary()
summary(lm_fp)
Call:
lm(formula = log(fp) ~ tense * lifetime, data = df_lifetime,
subset = region == "verb" & ff > 0)
Residuals:
Min 1Q Median 3Q Max
-1.18141 -0.34282 0.00058 0.26923 1.38822
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.63477 0.01877 300.193 < 2e-16 ***
tense1 0.03357 0.03754 0.894 0.37158
lifetime1 -0.10130 0.03754 -2.698 0.00718 **
tense1:lifetime1 -0.08320 0.07508 -1.108 0.26828
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4374 on 539 degrees of freedom
Multiple R-squared: 0.01712, Adjusted R-squared: 0.01165
F-statistic: 3.129 on 3 and 539 DF, p-value: 0.02538
1.3 Extracting information
lme4
objects contains lots of information- navigate to the model object in your Environment and explore the elements
- for example
nobs(model)
will tell you how many data points were included in the modelformula(model)
will give you the model formula
nobs(lm_fp)
[1] 543
formula(lm_fp)
log(fp) ~ tense * lifetime
- want to check which contrasts were used in your model?
$contrasts lm_fp
$tense
[,1]
PP 0.5
SF -0.5
$lifetime
[,1]
dead -0.5
living 0.5
2 Running models reproducibly
- as your models get more computationally complex, they can take time to fit
- sometimes you also run several models in the same script, which can take a long time
- with literate programming, we want to be able to produce dynamic reports
- we don’t want to run our models anew every time we re-render our document!
- this also might result in different results if e.g., we update a package or R version
- or if we send the script to somebody else with a different set-up
- ideally, we would have our source code (in our
.qmd
script), output (results printed in e.g., HTML our PDF output), and the actual model all stored and retrievable (.rds
files)
2.1 Storing our models
- we can save our models as R Data Serialisation files (RDS,
.rds
file extension)saveRDS(object, filename)
writes an R object as RDS filereadRDS(filename)
reads it in
- first, you’d want to create a folder where you store your models
- then save your models once you’ve run them
- but MAKE SURE you set
eval: false
- or even better, comment out this code!
```{r}
#| eval: false
saveRDS(lm_fp, here::here("models", "lm_fp"))
```
- ideally you would also set the code chunks that fit the model to
echo: false
- you don’t want to re-fit the model every time you render the document
- what’s more, any differences in the re-fit model will not reflect what you’re actually reporting elsewhere when using the stored RDS model
2.2 Load in RDS model
- and load them in e.g., when you want to report your models in Rmarkdown
<- readRDS(here::here("models", "lm_fp")) lm_fp
3 Model reporting
- when you run your models, you’ll want to also print the output
- this way you avoid needing to re-run your code to see the results
- for
lme4
models we can usesummary()
- as well as
formula()
,nobs()
- as well as
3.1 broom
- the
broom
package also contains useful functions for reportinglme4
modelsbroom::tidy()
produces the model estimates for our fixed effects
::tidy(lm_fp) broom
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.63 0.0188 300. 0
2 tense1 0.0336 0.0375 0.894 0.372
3 lifetime1 -0.101 0.0375 -2.70 0.00718
4 tense1:lifetime1 -0.0832 0.0751 -1.11 0.268
- it’s also helpful to store the output as an object
<- broom::tidy(lm_fp) tidy_lm_fp
3.1.1 Example table
Code
|>
tidy_lm_fp mutate(p.value = case_when(
< .001 ~ "< .001",
p.value TRUE ~ as.character(round(p.value,3))
|>
)) ::kable(digits = 2) |>
knitr::kable_styling() kableExtra
broom::tidy()
, knitr::kable()
, and kableExtra::kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.63 | 0.02 | 300.19 | < .001 |
tense1 | 0.03 | 0.04 | 0.89 | 0.372 |
lifetime1 | -0.10 | 0.04 | -2.70 | 0.007 |
tense1:lifetime1 | -0.08 | 0.08 | -1.11 | 0.268 |
3.2 Publication-ready tables
- we’ve already seen how to produce publication-ready tables
- TASK: produce a summary of the fixed effects of a model and feed it into our
knitr
andkableExtra
functions to produce a publication-ready table
- TASK: produce a summary of the fixed effects of a model and feed it into our
4 Mixed models
- the assumption of independence is often violated
- due to multiple observations collected from e.g., the same person or item, language, etc.
- this can lead to the inflation of Type I error (the chance of a false positive)
- to account for this nonindependence of data points, we can used mixed models
- mixed because they contain fixed effects (i.e., our predictors at the population-level) as well as random effects (group-level overall means and predictor effects)
4.1 lmer()
<-
lmer_lifetime lmer(log(fp) ~ tense * lifetime +
1|px) + (1|item_id),
(subset = region == "verb" & fp > 0,
data = df_lifetime)
4.2 Extracting model elements
- we can use the same functions as for
lm()
models
nobs(lmer_lifetime)
[1] 543
formula(lmer_lifetime)
log(fp) ~ tense * lifetime + (1 | px) + (1 | item_id)
- we can also check how many levels we had for each grouping factor (i.e., random effect: items and participants)
summary(lmer_lifetime)$ngrps["px"]
px
7
summary(lmer_lifetime)$ngrps["item_id"]
item_id
80
4.3 summary()
summary(lmer_lifetime)
Linear mixed model fit by REML ['lmerMod']
Formula: log(fp) ~ tense * lifetime + (1 | px) + (1 | item_id)
Data: df_lifetime
Subset: region == "verb" & fp > 0
REML criterion at convergence: 582.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.99792 -0.67159 0.02895 0.67995 2.67790
Random effects:
Groups Name Variance Std.Dev.
item_id (Intercept) 0.007012 0.08374
px (Intercept) 0.034124 0.18473
Residual 0.154885 0.39355
Number of obs: 543, groups: item_id, 80; px, 7
Fixed effects:
Estimate Std. Error t value
(Intercept) 5.63161 0.07245 77.736
tense1 0.03794 0.03389 1.120
lifetime1 -0.10308 0.03387 -3.044
tense1:lifetime1 -0.09587 0.06778 -1.414
Correlation of Fixed Effects:
(Intr) tense1 liftm1
tense1 -0.001
lifetime1 0.000 0.013
tens1:lftm1 0.003 0.002 -0.002
4.4 broom.mixed
- the
broom.mixed
package also contains useful functions for reportinglme4
mixed modelsbroom.mixed::tidy()
produces the model estimates for fixed and random effects- if you only want fixed effects, use the argument
effects = "fixed"
- if you only want random effects, use the argument
effects = "ran_pars"
4.4.1 tidy()
<- broom.mixed::tidy(lmer_lifetime) tidy_lmer_lifetime
tidy_lmer_lifetime
# A tibble: 7 × 6
effect group term estimate std.error statistic
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 fixed <NA> (Intercept) 5.63 0.0724 77.7
2 fixed <NA> tense1 0.0379 0.0339 1.12
3 fixed <NA> lifetime1 -0.103 0.0339 -3.04
4 fixed <NA> tense1:lifetime1 -0.0959 0.0678 -1.41
5 ran_pars item_id sd__(Intercept) 0.0837 NA NA
6 ran_pars px sd__(Intercept) 0.185 NA NA
7 ran_pars Residual sd__Observation 0.394 NA NA
4.5 lmerTest
- is something missing from our model summary?
- which effects were statistically significant (p < .05)?
- calculating p-values is not trivial for mixed models
lme4
doesn’t do it for linear models
- we can use the
lmerTest
package, which also haslme4
<-
lmerTest_lifetime ::lmer(log(fp) ~ tense * lifetime +
lmerTest1|px) + (1|item_id),
(subset = region == "verb" & fp > 0,
data = df_lifetime)
<- broom.mixed::tidy(lmerTest_lifetime) tidy_lmerTest_lifetime
tidy_lmerTest_lifetime
# A tibble: 7 × 8
effect group term estimate std.error statistic df p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 fixed <NA> (Intercept) 5.63 0.0724 77.7 6.20 1.65e-10
2 fixed <NA> tense1 0.0379 0.0339 1.12 471. 2.63e- 1
3 fixed <NA> lifetime1 -0.103 0.0339 -3.04 468. 2.47e- 3
4 fixed <NA> tense1:lifeti… -0.0959 0.0678 -1.41 471. 1.58e- 1
5 ran_pars item_id sd__(Intercep… 0.0837 NA NA NA NA
6 ran_pars px sd__(Intercep… 0.185 NA NA NA NA
7 ran_pars Residual sd__Observati… 0.394 NA NA NA NA
4.5.1 Publication-ready tables
- we’ve already seen how to produce publication-ready tables
- TASK: produce publication-ready tables using
knitr
andkableExtra
for:- a table of the model estimates of the fixed effects of our model
- a table of the model estimates of the random effects of our model
- a table of the with both the fixed and random effects
- TASK: produce publication-ready tables using
- produce one of these tables as LaTeX code (if you haven’t already)
- can you then produce this table in Overleaf?
4.5.2 Example
Code
|>
tidy_lmerTest_lifetime mutate(p.value = case_when(
< .001 ~ "< .001",
p.value TRUE ~ as.character(round(p.value,3))
|>
)) ::kable(digits = 2) |>
knitr::kable_styling() kableExtra
broom.mixed::tidy()
, knitr::kable()
, and kableExtra::kable_styling()
effect | group | term | estimate | std.error | statistic | df | p.value |
---|---|---|---|---|---|---|---|
fixed | NA | (Intercept) | 5.63 | 0.07 | 77.74 | 6.20 | < .001 |
fixed | NA | tense1 | 0.04 | 0.03 | 1.12 | 470.73 | 0.263 |
fixed | NA | lifetime1 | -0.10 | 0.03 | -3.04 | 468.30 | 0.002 |
fixed | NA | tense1:lifetime1 | -0.10 | 0.07 | -1.41 | 470.76 | 0.158 |
ran_pars | item_id | sd__(Intercept) | 0.08 | NA | NA | NA | NA |
ran_pars | px | sd__(Intercept) | 0.18 | NA | NA | NA | NA |
ran_pars | Residual | sd__Observation | 0.39 | NA | NA | NA | NA |
4.6 Save our model
- let’s store our model locally for future use
```{r}
#| eval: false
lmerTest_lifetime <- saveRDS(lmerTest_lifetime, here::here("models", "lmerTest_lifetime"))
```
- another tip: you don’t want to re-run and re-save your
5 Citing resources
- you should cite the packages you used to fit your models
- as well as other package versions used (but
SessionInfo()
also takes care of this)
- as well as other package versions used (but
- to cite packages in-text, you can first extract the citation: with
citation(package = "lme4")
To cite lme4 in publications use:
Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015).
Fitting Linear Mixed-Effects Models Using lme4. Journal of
Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
A BibTeX entry for LaTeX users is
@Article{,
title = {Fitting Linear Mixed-Effects Models Using {lme4}},
author = {Douglas Bates and Martin M{\"a}chler and Ben Bolker and Steve Walker},
journal = {Journal of Statistical Software},
year = {2015},
volume = {67},
number = {1},
pages = {1--48},
doi = {10.18637/jss.v067.i01},
}
- or as BibTeX citation:
print(citation(package = "lme4"), bibtex=TRUE)
To cite lme4 in publications use:
Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015).
Fitting Linear Mixed-Effects Models Using lme4. Journal of
Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
A BibTeX entry for LaTeX users is
@Article{,
title = {Fitting Linear Mixed-Effects Models Using {lme4}},
author = {Douglas Bates and Martin M{\"a}chler and Ben Bolker and Steve Walker},
journal = {Journal of Statistical Software},
year = {2015},
volume = {67},
number = {1},
pages = {1--48},
doi = {10.18637/jss.v067.i01},
}
- then copy it to your
.bib
file, add a citation reference key (e.g.,bates_fitting_2015
), and cite it in-text with@bates_fitting_2015
, which will be formatted as Bates, Mächler, et al. (2015)- alternatively, save the package reference in Zotero and access it using
rbbt
- e.g., copy the DOI then find the Zotero button for ‘Add Item by Identifier’
- alternatively, save the package reference in Zotero and access it using
5.1 Citing packages
- would usually be done in Data Analysis
`lmerTest` package which produces the Satterwaite approximation for degrees of freedom to calculate *p*-values [@kuznetsova_lmertest_2017]. Linear mixed models were fit using the
Linear mixed models were fit using the
lmerTest
package which produces the Satterwaite approximation for degrees of freedom to calculate p-values (Kuznetsova et al., 2017).
6 How to report
- you would typically have a (sub)section called Data Analysis which includes all the steps taken prior to inspecting your results
- i.e., how you prepared and analysed your data
- this should be information stated in a pre-registration but in the past tense
- this would typically be followed by a Results section
- containing the results from these cleaning/analysis steps
6.1 Data analysis
- should include, e.g.,
- model summaries in tables and/or in-text
- any inclusion/exclusion criteria for participants or observations
- any transformations performed on dependent or independent variables
- e.g., we log-transformed first-pass reading times, and used sum contrast coding for our two 2-level predictors
- we should therefore define which levels were coded as what value (+/-0.5)
- model selection
- what model did you start with? E.g., maximal model given the data and research questions (à la Barr et al., 2013)
- how did you handle convergence issues? E.g., parsimonious model selection (à la Bates, Kliegl, et al., 2015)
6.2 Results
- this would be followed by a section called Results, which reports e.g.,
- the number of participants and trials ex-/included in analyses
- possibly raw means and 95% confidence intervals
- model formula
- information about the number of observations per model
- a structure overview of model results
6.3 Example model report
- in an Rmarkdown script where you write up a paper or thesis, you could include the following hidden code chunk
- i.e., set
{r, echo = FALSE}
, which is slightly different than the Quarto#| eval: false
chunk option syntax
- i.e., set
::p_load(lme4, lmerTest, tidyverse)
pacman
<- readRDS(here::here("models", "lmerTest_lifetime")) lmerTest_lifetime
`r summary(lmerTest_lifetime)$ngrps["px"]` participants and `r summary(lmerTest_lifetime)$ngrps["item_id"] items`. The final model formula was `log(fp) ~ tense * lifetime + (1 | px) + (1 | item_id)`.
A linear mixed model was fit to log-transformed first-pass reading times, with
`r signif(tidy_lmerTest_lifetime$p.value[tidy_lmer_lifetime$term=="tense1:lifetime1"],2)`). An effect of *lifetime* was found in first-pass reading times at the verb region (Est = `r signif(tidy_lmerTest_lifetime$estimate[tidy_lmer_lifetime$term=="lifetime1"],2)`, *t* = `r signif(tidy_lmerTest_lifetime$statistic[tidy_lmer_lifetime$term=="lifetime1"],2)`, *p* = `r signif(tidy_lmerTest_lifetime$p.value[tidy_lmer_lifetime$term=="lifetime1"],2)`). A significant effect of *tense* was not found found (Est = `r signif(tidy_lmerTest_lifetime$estimate[tidy_lmer_lifetime$term=="tense1"],2)`, *t* = `r signif(tidy_lmerTest_lifetime$statistic[tidy_lmer_lifetime$term=="tense1"],2)`, *p* = `r signif(tidy_lmerTest_lifetime$p.value[tidy_lmer_lifetime$term=="tense1"],2)`), nor an interaction of *lifetime* and *tense* (Est = `r signif(tidy_lmerTest_lifetime$estimate[tidy_lmer_lifetime$term=="tense1:lifetime1"],2)`, *t* = `r signif(tidy_lmerTest_lifetime$statistic[tidy_lmer_lifetime$term=="tense1:lifetime1"],2)`, *p* =
6.4 Output
A linear mixed model was fit to log-transformed first-pass reading times, with 7 participants and 80 items. The final model formula was
log(fp) ~ tense * lifetime + (1 | px) + (1 | item_id)
.An effect of lifetime was found in first-pass reading times at the verb region (Est = -0.1, t = -3, p = 0.0025). A significant effect of tense was not found found (Est = 0.038, t = 1.1, p = 0.26), nor an interaction of lifetime and tense (Est = -0.096, t = -1.4, p = 0.16).
Learning objectives 🏁
Today we…
- run our first linear (mixed) model ✅
- produce model summary tables ✅
- report analyses and cite packages ✅
- report model results reproducibly ✅
Session Info
print(sessionInfo(), locale = F)
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] lmerTest_3.1-3 brms_2.21.5 Rcpp_1.0.12
[4] broom.mixed_0.2.9.5 broom_1.0.6 lme4_1.1-35.3
[7] Matrix_1.7-0 lubridate_1.9.3 forcats_1.0.0
[10] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[16] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 loo_2.7.0
[4] fastmap_1.2.0 tensorA_0.36.2.1 pacman_0.5.1
[7] digest_0.6.35 timechange_0.3.0 estimability_1.5.1
[10] lifecycle_1.0.4 magrittr_2.0.3 posterior_1.5.0
[13] compiler_4.4.0 rlang_1.1.4 tools_4.4.0
[16] utf8_1.2.4 yaml_2.3.8 knitr_1.47
[19] bridgesampling_1.1-2 htmlwidgets_1.6.4 bit_4.0.5
[22] here_1.0.1 xml2_1.3.6 abind_1.4-5
[25] numDeriv_2016.8-1.1 withr_3.0.0 grid_4.4.0
[28] fansi_1.0.6 xtable_1.8-4 colorspace_2.1-0
[31] future_1.33.2 globals_0.16.3 emmeans_1.10.2
[34] scales_1.3.0 MASS_7.3-60.2 cli_3.6.2
[37] mvtnorm_1.2-5 crayon_1.5.2 rmarkdown_2.27
[40] generics_0.1.3 RcppParallel_5.1.7 rstudioapi_0.16.0
[43] tzdb_0.4.0 minqa_1.2.7 splines_4.4.0
[46] bayesplot_1.11.1 parallel_4.4.0 matrixStats_1.3.0
[49] vctrs_0.6.5 boot_1.3-30 jsonlite_1.8.8
[52] hms_1.1.3 bit64_4.0.5 listenv_0.9.1
[55] systemfonts_1.1.0 glue_1.7.0 parallelly_1.37.1
[58] nloptr_2.0.3 codetools_0.2-20 distributional_0.4.0
[61] stringi_1.8.4 gtable_0.3.5 munsell_0.5.1
[64] furrr_0.3.1 pillar_1.9.0 htmltools_0.5.8.1
[67] Brobdingnag_1.2-9 R6_2.5.1 rprojroot_2.0.4
[70] kableExtra_1.4.0 vroom_1.6.5 evaluate_0.23
[73] lattice_0.22-6 highr_0.11 backports_1.5.0
[76] renv_1.0.7 rstantools_2.4.0 svglite_2.1.3
[79] coda_0.19-4.1 nlme_3.1-165 checkmate_2.3.1
[82] xfun_0.44 pkgconfig_2.0.3