Open Science Practices for Linguistic Research: Reproducible Analyses in R (SSOL 2024 Workshop)
  • D. Palleschi
  1. Workshop overview
  • Workshop overview
  • Day 1: Introduction to reproducibility
    • Reproducibility in linguistic research and R
    • Building a reproducible workflow in R
  • Day 2: Project-oriented workflow
    • RProjects
    • Package management
  • Day 3: Literate programming and Modularity
    • Writing Reproducible Code
  • Day 4: Publishing and Peer Review
    • Publishing analyses + Peer code review

On this page

  • Workshop overview
    • Schedule
    • Workshop goals
    • How to navigate this website
  • Preparation
    • Software (before Day 1)
    • Additional steps (before Day 4)
  • Suggested readings (before, during, or after the workshop)

Workshop overview

This web-book contains slides for the workshop ‘Open Science Practices for Linguistic Research: Reproducible Analyses in R’ given given by Daniela Palleschi at the Summer School of Linguistics in Budweis, Czechia in August 2024. The tools discussed are specific to the R enviornment, but the concepts are universal and programming language agnostic. The materials are re-structured from a semester-length course on the same topic given at the Humboldt-Universität zu Berlin in the summer semester 2024, the materials for which are more exhaustive (click here to see course materials).

Workshop abstract:

The Open Science movement began as an answer to the replication crisis and aims to encourage transparency across all stages of research. In this workshop, we will focus on practicing transparency in our analyses through reproducibility: what does it mean, why should we practice it, and how can we do it? We will focus on establishing and maintaining a reproducible, project-oriented workflow in the R environment. After the workshop, participants will be able to put reproducibility concepts and tools into practice, such as data management and documentation, R projects, and R packages developed specifically for reproducibility. The workshop assumes participants have at least basic familiarity with R and RStudio.

Schedule

Table 1 shows the tentative plan for the workshop and may be adjusted based on the needs of the participants.

Table 1: Tentative schedule for the 4-day workshop
Session Topic(s)
Day 1 The state of reproducibility in linguistic research
Day 2 (i) Setting up a reproducible project-oriented workflow
(ii) Implementing literate and modular analyses
Day 3 Putting it into practice
Day 4 (i) Publishing analyses
(ii) Conducting peer code review

Workshop goals

We will discuss and implement the following:

  • reproducibility rates in linguistics
  • project-oriented workflow in R
    • with RProjects
    • folder structure
    • naming conventions
    • using project-relative filepaths with the here package
  • literate programming
    • writing linear code
    • modular analyses
    • dynamic reports with Quarto
  • sharing and checking our code
    • uploading code to an OSF repository
    • conducting a code review

What we will NOT cover:

  • version control (e.g., git/GitHub)
  • learning R/the RStudio environment
  • how to appropriately analyse data (e.g., which analyses to use, etc.)
  • how to produce tables and figures
  • how to write a manuscript in Rmarkdown

What we might cover if there’s interest and time:

  • project-relative package management with the renv package

How to navigate this website

Each topic is listed in the sidebar in chronological order. Three output formats are available, all with the same content:

  1. HTML page
  2. PDF of content (sub-optimally formatted)
  3. RevealJS slides

The contents were formatted for the slide output. Tables and figures may be too large/small in HTML and PDF format (especially the latter). Each page of the website presents the HTML format. The other 2 formats can be viewed by clicking on their symbol under ‘Other Formats’ (right sidebar).

Preparation

It is assumed you have at least some basic familiarity with R and R Studio. Please at the very least make sure you have the required software before Day 1. If you have any problems, I can take a look after class on Day 1 (we will begin using the software from Day 2).

Software (before Day 1)

  1. Install or update R
    • N.B., I am currently using version 4.4.0 (Puppy Cup, 2024-04-24), although there is a newer version 4.4.1 (Race for Your Life, 2024-06-14)
    • having an R version from 2022.07 or later should suffice
    • Disclaimer: updating R can interfere with on-going R projects you are currently working on, most notably because you will need to re-install packages (and thus you may be installing more recent package versions which may break existing code). If you are currently in the middle of analysing some data, you may not want to update R right now. In this case, just make note of which version you’re currently running (e.g., by running R.version in the Console)
  2. Install or update RStudio
    • I am currently using RStudio version 2023.12.1+402, as I encountered issues when updating to 2024.04.2+764 in April when it was released. As a rule of thumb, I update R and/or RStudio a few months after their initial release, and when I know I have time to fix any bugs that might pop up (i.e., I don’t have a looming deadline)

To check which version of R you currently have, run the command R.version$version.string in the Console (to print just the version name and release date), or R.version$nickname (to print the nickname).

In the Console: print R version and release date
R.version$version.string
[1] "R version 4.4.0 (2024-04-24)"
In the Console: print R version nickname
R.version$nickname
[1] "Puppy Cup"

To check which version of RStudio you currently have, go to Help > About RStudio in RStudio. You should see a pop-up like Figure 1.

Figure 1: Help > About RStudio

Additional steps (before Day 4)

Create an OSF account here if you don’t have one already.

Suggested readings (before, during, or after the workshop)

There is currently a wealth of literature on the topic of reproducibility, both in terms of meta-science reviews of rates of reproducibility and in terms of best-practice advice. Some reading I would suggest for a soft introduction into the latter would be:

  • Nagler, J. (1995). Coding Style and Good Computing Practices. PS: Political Science & Politics, 28(3), 488–492. https://doi.org/10.2307/420315
  • Bowers, J., & Voors, M. (2016). How to improve your relationship with your future self. Revista de Ciencia Política (Santiago), 36(3), 829–848. https://doi.org/10.4067/S0718-090X2016000300011
  • Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510
  • Seibold, H. (2024). 6 Steps Towards Reproducible Research (v1 ed.). Zenodo. https://doi.org/10.5281/zenodo.12744715

For a book-length treatment of R-specific reproducible workflows:

  • Rodrigues, B. (2023). Building reproducible analytical pipelines with R. https://raps-with-r.dev/

And for an overview of R-specific data analysis suggestions, I recommend the following on-line resources:

  • Bryan, J., Hester, J., Pileggi, S., & Aja, D. E. (n.d.). What They Forgot to Teach You About R. Retrieved May 6, 2024, from https://rstats.wtf/
  • Bryan, J., & TAs, T. S. 545. (n.d.). R Basics and workflows. In STAT 545 Course materials. Retrieved May 6, 2024, from https://stat545.com/

For more general discussions on Open Science Practices:

  • Kathawalla, U.-K., Silverstein, P., & Syed, M. (2021). Easing Into Open Science: A Guide for Graduate Students and Their Advisors. Collabra: Psychology, 7(1), 18684. https://doi.org/10.1525/collabra.18684
  • Crüwell, S., Van Doorn, J., Etz, A., Makel, M. C., Moshontz, H., Niebaum, J. C., Orben, A., Parsons, S., & Schulte-Mecklenbeck, M. (2019). Seven Easy Steps to Open Science: An Annotated Reading List. Zeitschrift Für Psychologie, 227(4), 237–248. https://doi.org/10.1027/2151-2604/a000387
Source Code
---
csl: apa-cv.csl
bibliography: references.bib
suppress-bibliography: true
link-citations: false
citations-hover: false
---

# Workshop overview

```{r}
#| eval: false
#| echo: false
# https://www.andrewheiss.com/blog/2023/01/09/syllabus-csl-pandoc/#using-other-styles
# run manually
rbbt::bbt_update_bib(here::here("index.qmd"))
```

This web-book contains slides for the workshop 'Open Science Practices for Linguistic Research: Reproducible Analyses in R' given given by Daniela Palleschi at the [Summer School of Linguistics](https://ssol.ff.cuni.cz/) in Budweis, Czechia in August 2024. The tools discussed are specific to the R enviornment, but the concepts are universal and programming language agnostic. The materials are re-structured from a semester-length course on the same topic given at the Humboldt-Universität zu Berlin in the summer semester 2024, the materials for which are more exhaustive ([click here](https://daniela-palleschi.github.io/r4repro_SoSe2024/) to see course materials).

Workshop abstract:

> The Open Science movement began as an answer to the replication crisis and aims to encourage transparency across all stages of research. In this workshop, we will focus on practicing transparency in our analyses through reproducibility: what does it mean, why should we practice it, and how can we do it? We will focus on establishing and maintaining a reproducible, project-oriented workflow in the R environment. After the workshop, participants will be able to put reproducibility concepts and tools into practice, such as data management and documentation, R projects, and R packages developed specifically for reproducibility. The workshop assumes participants have at least basic familiarity with R and RStudio.

## Schedule

@tbl-sched shows the tentative plan for the workshop and may be adjusted based on the needs of the participants.

```{r}
#| echo: false
#| label: tbl-sched
#| tbl-cap: "Tentative schedule for the 4-day workshop"
dplyr::tribble(
  ~"Session", ~"Topic(s)",
  "Day 1", "The state of reproducibility in linguistic research",
  "Day 2", "(i) Setting up a reproducible project-oriented workflow <br> (ii) Implementing literate and modular analyses",
  "Day 3", "Putting it into practice",
  "Day 4", "(i) Publishing analyses<br> (ii) Conducting peer code review"
) |> 
  knitr::kable() 
```

## Workshop goals

We will discuss and implement the following:

- reproducibility rates in linguistics
- project-oriented workflow in R
  + with RProjects
  + folder structure
  + naming conventions
  + using project-relative filepaths with the `here` package
- literate programming
  + writing linear code
  + modular analyses
  + dynamic reports with Quarto
- sharing and checking our code
  + uploading code to an OSF repository
  + conducting a code review
  
What we will *NOT* cover:

- version control (e.g., git/GitHub)
- learning R/the RStudio environment
- how to appropriately analyse data (e.g., which analyses to use, etc.)
- how to produce tables and figures
- how to write a manuscript in Rmarkdown

What we *might* cover if there's interest and time:

- project-relative package management with the `renv` package

## How to navigate this website

Each topic is listed in the sidebar in chronological order. Three output formats are available, all with the same content:

1. HTML page
2. PDF of content (sub-optimally formatted)
3. RevealJS slides

The contents were formatted for the slide output. Tables and figures may be too large/small in HTML and PDF format (especially the latter). Each page of the website presents the HTML format. The other 2 formats can be viewed by clicking on their symbol under 'Other Formats' (right sidebar).

# Preparation

It is assumed you have at least some basic familiarity with R and R Studio. Please at the very least make sure you have the required software before Day 1. If you have any problems, I can take a look after class on Day 1 (we will begin using the software from Day 2).

## Software (before Day 1)

1. [Install or update R](https://www.r-project.org/) 
    + N.B., I am currently using version 4.4.0 (Puppy Cup, 2024-04-24), although there is a newer version 4.4.1 (Race for Your Life, 2024-06-14)
    + having an R version from 2022.07 or later should suffice
    + *Disclaimer*: updating R can interfere with on-going R projects you are currently working on, most notably because you will need to re-install packages (and thus you may be installing more recent package versions which may break existing code). If you are currently in the middle of analysing some data, you may not want to update R right now. In this case, just make note of which version you're currently running (e.g., by running `R.version` in the Console)
2. [Install or update RStudio](https://posit.co/download/rstudio-desktop/)
    + I am currently using RStudio version 2023.12.1+402, as I encountered issues when updating to 2024.04.2+764 in April when it was released. As a rule of thumb, I update R and/or RStudio a few months after their initial release, and when I know I have time to fix any bugs that might pop up (i.e., I don't have a looming deadline)
    
To check which version of R you currently have, run the command `R.version$version.string` in the Console (to print just the version name and release date), or `R.version$nickname` (to print the nickname).

```{r filename="In the Console: print R version and release date"}
R.version$version.string
```

```{r filename="In the Console: print R version nickname"}
R.version$nickname
```

To check which version of RStudio you currently have, go to `Help > About RStudio` in RStudio. You should see a pop-up like @fig-RStudio.

```{r}
#| label: fig-RStudio
#| fig-cap: "Help > About RStudio"
#| out-width: 60%
#| echo: false

knitr::include_graphics(here::here("media", "about_RStudio.png"))
```

## Additional steps (before Day 4)

Create an OSF account [here](https://osf.io/register) if you don't have one already.

# Suggested readings (before, during, or after the workshop)

There is currently a wealth of literature on the topic of reproducibility, both in terms of meta-science reviews of rates of reproducibility and in terms of best-practice advice. Some reading I would suggest for a soft introduction into the latter would be:

- @nagler_coding_1995
- @bowers_how_2016
- @wilson_good_2017
- @seibold_6_nodate

For a book-length treatment of R-specific reproducible workflows:

- @rodrigues_building_nodate

And for an overview of R-specific data analysis suggestions, I recommend the following on-line resources:

- @bryan_what_nodate
- @bryan_chapter_nodate

For more general discussions on Open Science Practices:

- @kathawalla_easing_2021
- @cruwell_seven_2019