Regression for Linguists
  • D. Palleschi
  1. Overview
  2. Resources and Set-up
  • Overview
    • Course overview
    • Resources and Set-up
  • Day 1: Simple linear regression
    • 1  Understanding straight lines
    • 2  Simple linear regression
    • 3  Continuous predictors
  • Day 2: Multiple regression
    • 4  Multiple Regression
    • 5  Categorical predictors
  • Day 3: Logistic regression
    • 6  Logistic regression
  • Report 1
    • 7  Report 1
  • Mixed models
    • 8  Independence
    • 9  Random intercepts
    • 10  Random slopes
    • 11  Shrinkage and Partial Pooling
    • 12  Model selection
    • 13  Model selection: Example
  • Report 2
    • 14  Report 2
  • References

Inhaltsverzeichnis

  • Resources
  • Assumptions about you
  • Software
    • Install R
    • Install RStudio
    • Install LaTeX
  • resources
    • Troubleshooting (EN: Troubleshooting)
    • Session Information

Resources and Set-up

Autor:in
Zugehörigkeit

Daniela Palleschi

Humboldt-Universität zu Berlin

Veröffentlichungsdatum

29. April 2024

Resources

This course is mainly based on Winter (2019), which is an excellent introduction into regression for linguists. For even more introductory tutorials, I recommend going through Winter (2013) and Winter (2014) For a more intermediate textbook, I’d recommend Sonderegger (2023).

If you’re interested in the foundational writings on the topic of (frequentist) linear mixed models in (psycho)linguistic research, I’d recommend reading Baayen (2008); Baayen et al. (2008);Barr et al. (2013); Bates et al. (2015); Jaeger (2008); Matuschek et al. (2017); Vasishth (2022); Vasishth & Nicenboim (2016).

Assumptions about you

For this course, I assume that you are familiar with more classical statistical tests, such as the t-test, Chi-square test, etc. I also assume you are familiar with measures of central tendency (mean, median, mode) measures dispersion/spread (standard deviation), and with the concept of a normal distribution. Lacking this knowledge will not impeded your progress in the course, but is an important foundation on which we’ll be building. We can review these concepts in-class as needed.

Software

  • R: a statistical programming language (the underlying language)

  • RStudio: an program that facilitates working with R; our preferred IDE integrated development environment

  • LaTeX: a typesetting system that generates documents in PDF format

  • why R?

    • R and RStudio are open-source and free software
    • they are widely used in science and business

Install R

  • we need the free and open source statistical software R to analyze our data
  • download and install R: https://www.r-project.org

Install RStudio

  • we need RStudio to work with R more easily
  • Download and install RStudio: https://rstudio.com
  • it can be helpful to keep English as language in RStudio
    • we will find more helpful information if we search error messages in English on the internet
  • If you have problems installing R or RStudio, check out this help page (in German): http://methods-berlin.com/wp-content/uploads/Installation.html

Install LaTeX

  • we will not work with LaTeX directly, but it is needed in the background
  • Download and install LaTeX: https://www.latex-project.org/get/

resources

  • many aspects of this course are inspired by (nordmann_applied_2022?) and (wickham_r_nodate?)
    • both freely available online (in English)
  • for German-language resources, visit the website of Methodengruppe Berlin

Troubleshooting (EN: Troubleshooting)

  • Error messages are very common in programming, at all levels.
  • How to find solutions for these error messages is an art in itself
  • Google is your friend! If possible, google in English to get more information

Session Information

The current version of this Quarto book was developed using R version 4.4.0 (2024-04-24) (Puppy Cup) in RStudioversion 2023.3.0.386 (Cherry Blossom). At the bottom of each chapter is a list of the packages (and version info) used in that chapter (under Session Information). I highly recommend you do the same at the bottom of each script that you write. You can easily do this by writing the following at the bottom of any Rmarkdown (.Rmd) or Quarto (.qmd) script:

# Session Info

```{r}
sessionInfo()
```

References

American Psychological Association. (2022). APA Style numbers and statistics guide. American Psychological Association.
Baayen, R. H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics using R.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. https://doi.org/10.1016/j.jml.2007.12.005
Baayen, R. H., & Shafaei-Bajestan, E. (2019). languageR: Analyzing linguistic data: A practical introduction to statistics. https://CRAN.R-project.org/package=languageR
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013a). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013b). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001
Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious Mixed Models. arXiv Preprint, 1–27. https://doi.org/10.48550/arXiv.1506.04967
Biondo, N., Soilemezidi, M., & Mancini, S. (2022). Yesterday is history, tomorrow is a mystery: An eye-tracking investigation of the processing of past and future time reference during sentence reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 48(7), 1001–1018. https://doi.org/10.1037/xlm0001053
Brauer, M., & Curtin, J. J. (2018). Linear mixed-effects models and the analysis of nonindependent data: A unified framework to analyze categorical and continuous independent variables that vary within-subjects and/or within-items. Psychological Methods, 23(3), 389–411. https://doi.org/10.1037/met0000159
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4), 335–359. https://doi.org/10.1016/S0022-5371(73)80014-3
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59(4), 434–446. https://doi.org/10.1016/j.jml.2007.11.007
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13
Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139. https://doi.org/10.21105/joss.03139
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing Type I error and power in linear mixed models. Journal of Memory and Language, 94, 305–315. https://doi.org/10.1016/j.jml.2017.01.001
Meteyard, L., & Davies, R. A. I. (2020). Best practice guidance for linear mixed-effects models in psychological science. Journal of Memory and Language, 112, 104092. https://doi.org/10.1016/j.jml.2020.104092
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Sonderegger, M. (2023a). Regression Modeling for Linguistic Data.
Sonderegger, M. (2023b). Regression Modeling for Linguistic Data.
Troyer, M., & Kutas, M. (2020). To catch a Snitch: Brain potentials reveal variability in the functional organization of (fictional) world knowledge during reading. Journal of Memory and Language, 113(August 2019), 104111. https://doi.org/10.1016/j.jml.2020.104111
Vasishth, S. (2022). Some right ways to analyze (psycho)linguistic data [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/y54va
Vasishth, S., & Nicenboim, B. (2016). Statistical methods for linguistic research: Foundational Ideas. Language and Linguistics Compass, 10(11), 591–613. https://doi.org/10.1111/lnc3.12207
Winter, B. (2011). PSEUDOREPLICATION IN PHONETIC RESEARCH.
Winter, B. (2013). Linear models and linear mixed effects models in R: Tutorial 1.
Winter, B. (2014). A very basic tutorial for performing linear mixed effects analyses (Tutorial 2).
Winter, B. (2019). Statistics for Linguists: An Introduction Using R. In Statistics for Linguists: An Introduction Using R. Routledge. https://doi.org/10.4324/9781315165547
Winter, B., & Grice, M. (2021). Independence and generalizability in linguistics. Linguistics, 59(5), 1251–1277. https://doi.org/10.1515/ling-2019-0049
Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45, e1. https://doi.org/10.1017/S0140525X20001685
Course overview
1  Understanding straight lines
Quellcode
---
author: "Daniela Palleschi"
institute: Humboldt-Universität zu Berlin
# footer: "Lecture 1.1 - R und RStudio"
lang: de
date: "`r Sys.Date()`"
format:
  html:
    number-sections: true
    number-depth: 3
    toc: true
    code-overflow: wrap
    code-tools: true
    self-contained: true
    fig-width: 6
bibliography: references.bib
csl: apa.csl
execute: 
  eval: true # evaluate chunks
  echo: true # 'print code chunk?'
  message: false # 'print messages (e.g., warnings)?'
  error: true # ignore errors when rendering?
  warning: false
---

# Resources and Set-up {.unnumbered}

```{r, eval = T, cache = F}
#| echo: false
# Create references.json file based on the citations in this script
# make sure you have 'bibliography: references.json' in the YAML
```

# Resources

This course is mainly based on @winter_statistics_2019, which is an excellent introduction into regression for linguists. For even more introductory tutorials, I recommend going through @winter_linear_2013 and @winter_very_2014 For a more intermediate textbook, I'd recommend @sonderegger_regression_2023.

If you're interested in the foundational writings on the topic of (frequentist) linear mixed models in (psycho)linguistic research, I'd recommend reading @baayen_analyzing_2008; @baayen_mixed-effects_2008;@barr_random_2013-1; @bates_parsimonious_2015; @jaeger_categorical_2008; @matuschek_balancing_2017; @vasishth_right_2022-1; @vasishth_statistical_2016.
    
# Assumptions about you

For this course, I assume that you are familiar with more classical statistical tests, such as the t-test, Chi-square test, etc. I also assume you are familiar with measures of central tendency (mean, median, mode) measures dispersion/spread (standard deviation), and with the concept of a normal distribution. Lacking this knowledge will not impeded your progress in the course, but is an important foundation on which we'll be building. We can review these concepts in-class as needed.

# Software {#sec-software}

- R: a statistical programming language (the underlying language)
- RStudio: an program that facilitates working with R; our preferred IDE integrated development environment
- LaTeX: a typesetting system that generates documents in PDF format

- why R?
  -  R and RStudio are open-source and free software
  -  they are widely used in science and business

::: {.content-hidden when-format="pdf"}
::: {.column width="30%"}
```{r eval = F, fig.env = "figure", out.width="50%", fig.align = "center"}
#| echo: false

magick::image_read(here::here("media/R_logo.png"))
```
:::

::: {.column width="30%"}
```{r eval =F , fig.env = "figure", out.width="75%", fig.align = "center"}
#| echo: false

magick::image_read(here::here("./media/RStudio_logo.png"))
```
:::
:::

```{r eval = F, fig.env = "figure", out.width="75%", fig.align = "center"}
#| echo: false

magick::image_read(here::here("./media/LaTeX_logo.png"))
```


::: {.content-visible when-format="pdf"}
```{r eval = F, fig.env = "figure", fig.pos="H", out.width="75%", fig.align = "center"}
#| echo: false

R <- grid::rasterGrob(as.raster(png::readPNG(here::here("./media", "R_logo.png"))))

RStudio <- grid::rasterGrob(as.raster(png::readPNG(here::here("./media", "RStudio_logo.png"))))

latex <- grid::rasterGrob(as.raster(png::readPNG(here::here("./media", "LaTeX_logo2.png"))))

gridExtra::grid.arrange(R, NULL, RStudio, NULL, latex, ncol=5,
                        widths=c(.25,.125,.25,.125,.25))
```
:::

## Install R

- we need the free and open source statistical software R to analyze our data
- download and install R: <https://www.r-project.org>

## Install RStudio

- we need RStudio to work with R more easily
- Download and install RStudio: <https://rstudio.com>
- it can be helpful to keep English as language in RStudio
    - we will find more helpful information if we search error messages in English on the internet

- If you have problems installing R or RStudio, check out this help page (in German): <http://methods-berlin.com/wp-content/uploads/Installation.html>

## Install LaTeX

- we will not work with LaTeX directly, but it is needed in the background
- Download and install LaTeX: <https://www.latex-project.org/get/>

# resources

- many aspects of this course are inspired by @nordmann_applied_2022 and @wickham_r_nodate
    - both freely available online (in English)
- for German-language resources, visit the website of [Methodengruppe Berlin](http://methods-berlin.com/de/r-lernplattform/)

## Troubleshooting (EN: Troubleshooting)

- Error messages are very common in programming, at all levels.
- How to find solutions for these error messages is an art in itself
- Google is your friend! If possible, google in English to get more information

## Session Information

The current version of this Quarto book was developed using `r R.version.string` (`r R.version$nickname`) in RStudioversion 2023.3.0.386 (Cherry Blossom). At the bottom of each chapter is a list of the packages (and version info) used in that chapter (under Session Information). I highly recommend you do the same at the bottom of each script that you write. You can easily do this by writing the following at the bottom of any Rmarkdown (`.Rmd`) or Quarto (`.qmd`) script: 

````markdown
# Session Info

```{r}`r ''`
sessionInfo()
```
````


# References {.unlisted .unnumbered visibility="uncounted"}

::: {#refs custom-style="Bibliography"}
:::