R for Reproducibility
  • D. Palleschi
  1. Reproducibility
  • Open Science
  • The Replication Crisis
  • Reproducibility
  • RProjects
  • Writing Reproducible Code
  • Data wrangling
  • Data tidying
  • Data communication with tables
  • Data Visualisation with ggplot2
  • Package management
  • Reproducible Writing
  • Publishing analyses + Peer code review
  • Reporting regression results

On this page

  • Reproducibility
    • What should (ideally) be shared?
    • Reproducibility rates in linguistic research
    • Journal of Memory and Language
  • FAIR principles
    • Findable
    • Accessible
    • Interoperable
    • Reusable
    • Task: finding data
  • Data and code availability
    • Data and code \(\neq\) Reproducibility
    • Share the code, not just the data
  • Building a reproducible workflow
    • Project management
    • Literate programming
    • Documentation
    • Version control
    • Persistant (public) storage
    • Writing
  • Setting up a project

Other Formats

  • PDF
  • RevealJS

Reproducibility

Principles and Practice

Author
Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

April 30, 2024

Learning Objectives

Today we will learn about…

  • reproducibility rates in linguistics
  • FAIR principles
  • concepts for building a reproducible workflow

Reproducibility

  • generating the same results with the same data and analysis scripts

    • seems obvious, but requires organisation and forethought
  • bare minimum: share the code and the data (Laurinavichyute et al., 2022)

  • rates of reproducibility vary across fields (Bochynska et al., 2023)

    • open access: 25-65%
    • data and analyses sharing: 11-33%
    • pre-registrations: 0-3%
  • what constitutes “reproducibility”?

What should (ideally) be shared?

  • materials
    • protocols
    • stimuli
    • experiment set-up
  • documentation
    • README
    • metadata
  • data
    • raw
      • e.g., text files, audio, video, or images
    • processed
  • analysis code
    • pre-processing
    • analyses
  • materials are helpful for replication
    • but also for inspection of e.g., design
  • data and code are necessary for reproducibility
    • along with proper documentation of software used

Reproducibility rates in linguistic research

Figure 1: Source: Bochynska et al. (2023), p. 11 (all rights reserved)
  • meta-analysis of 519 randomly sampled articles from various linguistic journales
    • pre- and post-reproducibility crisis (2008/9, 2018/19) (Bochynska et al., 2023)
    • differentiated between primary (collected for study) and secondary (pre-existing) data
  • reported a post-RC increase in shared materials, data, and analyses
    • but still low rates of each
  • higher rates of secondary data sharing, presumably due to publicly available corpora
  • data shared more often than analyses, pre- and post-RC

Journal of Memory and Language

  • meta-analysis of articles from JML (Laurinavichyute et al., 2022)
    • before and after an Open Science Policy was introduced in 2019
Figure 2: Source: Laurinavichyute et al. (2022), p. 5 (all rights reserved)
  • code and data availability improved
  • but reproducibility rate ranged from 34-56%, depending on criteria
  • higher rates compared to field-wide meta-analysis (Bochynska et al., 2023)

FAIR principles

  • guidelines for sharing digital resources
  • refers broadly to data, but we’ll consider it in terms of analyses
Figure 3: Source: National Library of Medicine (all rights reserved)
  • findable and accesssible refer to where materials are stored
    • in findable repositories
    • that are accessible, i.e., do not require an account
  • interoperable and reusable emphasise the format of data (and code)
    • the importance of future use
    • and use beyond your precise computational environment
  • a great way to test the FAIR principles
    • code review!
    • i.e., have a colleague try to access your data/run your code
      • either via an online repository
      • or send them your project folder

Findable

  • refers to data and supplementary materials

  • materials should have a “persistant identifier”

    • e.g., Digital Object Identifier (DOI) for scholarly articles
  • a digital, long-term storage of data

    • not on a personal or professional website
    • GitHub files don’t typically have sufficient metadata
    • ideally: OSF, Zenodo or some other repository
  • in recent papers, an OSF link is typically provided

  • also: discoverable

    • e.g., in data-specific search engines (Google’s Dataset search)

Accessible

  • data (and code) should be
    • machine- and human-readable
    • available on a trusted repository, e.g., the OSF
    • Open Access
      • not behind a paywall
      • nor require a login

Interoperable

  • data (and code) should
    • not dependent on an operating system
    • nor entirely on software/package versions
  • easiest work around:
    • document your software versions
    • this doesn’t automatically facilitate interoperability
    • but may help pinpoint where problems are coming from

Reusable

  • data (and code) should
    • be reusable for future research
  • data format should be generic
    • i.e., not tied to a specific program
    • for tabular data, I recommend .csv format
  • we can swap with ‘reproducible’ in the context of analyses

Task: finding data

Go to datasetsearch.research.google.com/

  • do a search for data related to a topic of interest to you

  • what type of information does the search provide?

  • what type of links?

  • do you find analysis code, or just data?

  • do the same search at osf.io

  • and at zenodo.org/

    • are there the same amount of hits?

Data and code availability

  • “data available upon (reasonable) request”
    • generally not true
  • data was not available in 68% of the most cited psychology studies (2006-2016) (Hardwicke & Ioannidis, 2018)
    • a further 18% were available with restrictions
    • only 11% available without restriction
  • data alone is not sufficient
    • ‘Data Analysis’ sections are rarely exhaustive/unambiguous
    • very difficult to re-create analyses without code
    • e.g., is data trimming explicitly defined?
      • this will even affect descriptive statistics
Figure 4: Source: Hardwicke & Ioannidis (2018), p. 6 (all rights reserved)

Data and code \(\neq\) Reproducibility

  • even including code does not guarantee reproducibility

  • access to data and code do not mean analyses are reproducible

  • what can go wrong? Examples from Laurinavichyute et al. (2022)

  1. Data problems
    • inaccessible data
    • incomplete data (e.g., 2/3 experiments)
  2. Code problems
    • incomplete code
    • error messages
    • code rot: outdated syntax or environment
    • proprietary software
  1. Documentation problems
    • data difficult to interpret
    • no README file/data dictionary
    • unclear folder/file/variable naming convention
    • manuscript contradicts code
  2. Unclear terms of use
    • no licence specification

Share the code, not just the data

  • Why?
    • key details are often missing from ‘Methods’ sections
  • suggestions for researchers from Laurinavichyute et al. (2022)
  1. Share data in usable form
    • with pre-processing code
  2. Use publicly accessible repositories
    • e.g., OSF
  3. Use non-proprietary data formats
    • e.g., not .xls files (Excel)
  1. Provide documentation
    • e.g., README, data dictionaries
  2. Share code and data
    • they estimate a 38% increase in reproducibility
  3. Teach data management and computing skills
    • that’s what this course is for!

Building a reproducible workflow

  • there are different levels of reproducibility
    • the bare minimum is sharing the code and data
    • and including session information:
      • which operating system was used
      • which software/package versions were used
  • going bigger:
    • project-oriented workflow
    • project-specific filepaths
    • contained in a single project folder
  • we will be using RProjects to achieve this

Project management

  • folder structure
  • project-relative file paths
  • appropriate documentation
    • e.g., README
  • it’s great to map out your project structure early on
    • but it will grow as you go along
    • reproducible principles facilitate adapting as it grows

Literate programming

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

— Knuth (1984), p. 97

  • originally used to refer to writing programs
  • but also applies to analysis code
    • especially if we’re aiming for reproducibility
  • main concepts:

    • code is linear (this pre-dates Knuth, 1984)
    • informative but concise commenting
  • main benefits:

    • facilitates maintenance
    • helpful for future-you, collaborators, etc.

Documentation

  • metadata

    • project README
    • codebook/data dictionary
  • README should contain

    • a project description
    • relevant links
    • description of folder structure
  • can be updated as the project develops

  • README.md files in GitHub/Lab are automatically used as a project description

    • .md is a plaintext document
    • uses markdown syntax

Version control

  • git: local tracking
  • useful for the analysis and writing phases
    • but can be tricky for collaboration
  • GitHub/GitLab: remote tracking
    • store your changes to your local git repository
    • then push them to your remote repository
  • safe guards against local hardware/software issues
    • lost or damaged computer or local files
  • and allows for collaboration or sharing

Persistant (public) storage

  • GitHub/Lab are sub-optimal
    • developer-focused
    • typically lack thorough documentation/metadata
    • not very user-friendly for non-users
  • OSF, Zenodo
    • Open Science-focused
    • can be linked to a GitHub/Lab repository
    • facilitate thorough documentation
    • user-friendly

Writing

  • dynamic reports with Markdown syntax

    • e.g., Rmarkdown, Quarto
    • integration of data, code, and prose
      • facilitates cross-referencing within document
      • integration of citation management tools
      • supports LaTeX syntax for example sentences and tables
  • papaja package for APA-formatted Rmarkdown documents

  • challenge: collaboration

    • not all collaborators know these tools
    • track changes not currently possible

Setting up a project

  • next week: hands-on
  • required installations/recent versions of:
    • R
      • version 4.4.0, “Puppy Cup”
      • check current version with R.version
      • download/update: https://cran.r-project.org/bin/macosx/
    • RStudio
      • version 2023.12.1.402, “Ocean Storm”
      • Help > Check for updates
      • new install: https://posit.co/download/rstudio-desktop/

Learning objectives 🏁

Today we learned…

  • reproducibility rates in linguistics ✅
  • FAIR principles ✅
  • concepts for building a reproducible workflow ✅

References

Bochynska, A., Keeble, L., Halfacre, C., Casillas, J. V., Champagne, I.-A., Chen, K., Röthlisberger, M., Buchanan, E. M., & Roettger, T. B. (2023). Reproducible research practices and transparency across linguistics. Glossa Psycholinguistics, 2(1). https://doi.org/10.5070/G6011239
Corker, K. S. (2022). An Open Science Workflow for More Credible, Rigorous Research. In M. J. Prinstein (Ed.), The Portable Mentor (3rd ed., pp. 197–216). Cambridge University Press. https://doi.org/10.1017/9781108903264.012
Hardwicke, T. E., & Ioannidis, J. P. A. (2018). Populating the Data Ark: An attempt to retrieve, preserve, and liberate data from the most highly-cited psychology and psychiatry articles. PLOS ONE, 13(8), e0201856. https://doi.org/10.1371/journal.pone.0201856
Knuth, D. (1984). Literate programming. The Computer Journal, 27(2), 97–111.
Laurinavichyute, A., Yadav, H., & Vasishth, S. (2022). Share the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language under the open data policy. Journal of Memory and Language, 125, 12.
Source Code
---
title: "Reproducibility"
subtitle: "Principles and Practice"
author: "Daniela Palleschi"
institute: Humboldt-Universität zu Berlin
lang: en
date: 2024-04-30
format: 
  html:
    output-file: reproducibility.html
    number-sections: false
    toc: true
    code-overflow: wrap
    code-tools: true
    self-contained: true
  pdf:
    output-file: reproducibility.pdf
    toc: true
    number-sections: false
    colorlinks: true
    code-overflow: wrap
  revealjs:
    output-file: reproducibility_slides.html
    include-in-header: ../../mathjax.html # for multiple equation hyperrefs
    code-overflow: wrap
    theme: [dark]
    width: 1600
    height: 900
    progress: true
    scrollable: true
    # smaller: true
    slide-number: c/t
    code-link: true
    # logo: logos/hu_logo.png
    # css: logo.css
    incremental: true
    # number-sections: true
    toc: false
    toc-depth: 2
    toc-title: 'Overview'
    navigation-mode: linear
    controls-layout: bottom-right
    fig-cap-location: top
    font-size: 0.6em
    slide-level: 4
    self-contained: true
    title-slide-attributes: 
      data-background-image: logos/logos.tif
      data-background-size: 15%
      data-background-position: 50% 92%
    fig-align: center
    fig-dpi: 300
editor_options: 
  chunk_output_type: console
bibliography: ../../references.bib
csl: ../../apa.csl
---

```{r setup, eval = T, echo = F}
knitr::opts_chunk$set(echo = T, # print chunks?
                      eval = T, # run chunks?
                      error = F, # print errors?
                      warning = F, # print warnings?
                      message = F, # print messages?
                      cache = F # cache?; be careful with this!
                      )
```

# Learning Objectives {.unnumbered .unlisted}

Today we will learn about...

-   reproducibility rates in linguistics
-   FAIR principles
-   concepts for building a reproducible workflow

# Reproducibility

-   generating the same results with the same data and analysis scripts

    -   seems obvious, but requires organisation and forethought

-   bare minimum: share the code and the data [@laurinavichyute_share_2022]

-   rates of reproducibility vary across fields [@bochynska_reproducible_2023]

    -   open access: 25-65%
    -   data and analyses sharing: 11-33%
    -   pre-registrations: 0-3%

-   what constitutes "reproducibility"?

## What should (ideally) be shared?

::: columns
::: {.column width="50%"}
-   materials
    -   protocols
    -   stimuli
    -   experiment set-up
-   documentation
    -   README
    -   metadata
:::

::: {.column width="50%"}
-   data
    -   raw
        -   e.g., text files, audio, video, or images
    -   processed
-   analysis code
    -   pre-processing
    -   analyses
:::
:::

::: columns
::: {.column width="50%"}
-   materials are helpful for replication
    -   but also for inspection of e.g., design
:::

::: {.column width="50%"}
-   data and code are necessary for reproducibility
    -   along with proper documentation of software used
:::
:::

## Reproducibility rates in linguistic research {.smaller}

::: columns
::: {.column width="60%"}
```{r}
#| echo: false
#| out-width: "95%"
#| fig-align: center
#| label: fig-bochynska
#| fig-cap: "Source: @bochynska_reproducible_2023, p. 11 (all rights reserved)"
magick::image_negate(magick::image_read(here::here("media/bochynska_reproducibility.png")))
```
:::

::: {.column width="40%"}
-   meta-analysis of 519 randomly sampled articles from various linguistic journales
    -   pre- and post-reproducibility crisis (2008/9, 2018/19) [@bochynska_reproducible_2023]
    -   differentiated between primary (collected for study) and secondary (pre-existing) data
-   reported a post-RC increase in shared materials, data, and analyses
    -   but still low rates of each
-   higher rates of secondary data sharing, presumably due to publicly available corpora
-   data shared more often than analyses, pre- and post-RC
:::
:::

## Journal of Memory and Language

-   meta-analysis of articles from JML [@laurinavichyute_share_2022]
    -   before and after an Open Science Policy was introduced in 2019

```{r}
#| echo: false
#| out-width: "95%"
#| fig-align: center
#| label: fig-laurinavichyute
#| fig-cap: "Source: @laurinavichyute_share_2022, p. 5 (all rights reserved)"
magick::image_negate(magick::image_read(here::here("media/laurinavichyute_reproducibility.png")))
```

-   code and data availability improved
-   but reproducibility rate ranged from 34-56%, depending on criteria
-   higher rates compared to field-wide meta-analysis [@bochynska_reproducible_2023]

# FAIR principles {.smaller}

::: {.column width="100%"}
-   guidelines for sharing digital resources
-   refers broadly to data, but we'll consider it in terms of analyses
:::

::: columns
::: {.column width="40%"}
```{r}
#| echo: false
#| out-width: "95%"
#| fig-align: center
#| label: fig-FAIR
#| fig-cap: "Source: [National Library of Medicine](https://www.nlm.nih.gov/oet/ed/cde/tutorial/02-300.html) (all rights reserved)"
magick::image_negate(magick::image_read(here::here("media/FAIR_nlm_nih.png")))
```
:::

::: {.column width="60%"}
-   findable and accesssible refer to where materials are stored
    -   in *findable* repositories
    -   that are *accessible*, i.e., do not require an account
-   interoperable and reusable emphasise the format of data (and code)
    -   the importance of future use
    -   and use beyond your precise computational environment
-   a great way to test the FAIR principles
    -   code review!
    -   i.e., have a colleague try to access your data/run your code
        -   either via an online repository
        -   or send them your project folder
:::
:::

## Findable {background-image="https://amor.cms.hu-berlin.de/~pallesid/teaching/404_error.png" background-opacity=".2"}

-   refers to data and supplementary materials

-   materials should have a "persistant identifier"

    -   e.g., Digital Object Identifier (DOI) for scholarly articles

-   a digital, long-term storage of data

    -   *not* on a personal or professional website
    -   GitHub files don't typically have sufficient metadata
    -   ideally: OSF, Zenodo or some other repository

-   in recent papers, an OSF link is typically provided

-   also: *discoverable*

    -   e.g., in data-specific search engines (Google's Dataset search)

## Accessible {background-image="https://amor.cms.hu-berlin.de/~pallesid/teaching/paywalled-research-screenshot.png" background-opacity=".2"}

-   data (and code) should be
    -   machine- and human-readable
    -   available on a trusted repository, e.g., the OSF
    -   Open Access
        -   not behind a paywall
        -   nor require a login

## Interoperable

:::: {.columns}

::: {.column width="60%"}

-   data (and code) should
    -   not dependent on an operating system
    -   nor entirely on software/package versions
-   easiest work around:
    -   document your software versions
    -   this doesn't automatically facilitate interoperability
    -   but may help pinpoint where problems are coming from

:::

::: {.column width="40%"}
```{r}
#| echo: false
#| out-width: "50%"
magick::image_negate(magick::image_read(here::here("media/interoperable.png")))
```
:::

::::

## Reusable

:::: {.columns}

::: {.column width="60%"}

-   data (and code) should
    -   be reusable for future research
-   data format should be generic
    -   i.e., not tied to a specific program
    -   for tabular data, I recommend `.csv` format
-   we can swap with 'reproducible' in the context of analyses

:::

::: {.column width="40%"}
```{r}
#| echo: false
magick::image_read(here::here("media/recycle.png"))
```
:::

::::

## Task: finding data

Go to [datasetsearch.research.google.com/](https://datasetsearch.research.google.com/)

-   do a search for data related to a topic of interest to you

-   what type of information does the search provide?

-   what type of links?

-   do you find analysis code, or just data?

-   do the same search at [osf.io](https://osf.io/)

-   and at [zenodo.org/](https://zenodo.org/)

    -   are there the same amount of hits?

# Data and code availability {.smaller}

::: columns
::: {.column width="40%"}
-   "data available upon (reasonable) request"
    -   generally not true
-   data was not available in 68% of the most cited psychology studies (2006-2016) [@hardwicke_populating_2018]
    -   a further 18% were available with restrictions
    -   only 11% available without restriction
-   data alone is not sufficient
    -   'Data Analysis' sections are rarely exhaustive/unambiguous
    -   very difficult to re-create analyses without code
    -   e.g., is data trimming explicitly defined?
        -   this will even affect descriptive statistics
:::

::: {.column width="60%"}
```{r}
#| echo: false
#| out-width: "75%"
#| fig-align: center
#| label: fig-hardwicke
#| fig-cap: "Source: @hardwicke_populating_2018, p. 6 (all rights reserved)"
magick::image_negate(magick::image_read(here::here("media/Hardwicke_upon-request.png")))
```
:::
:::

## Data and code $\neq$ Reproducibility

-   even including code does not guarantee reproducibility

-   access to data and code do not mean analyses are reproducible

-   what can go wrong? Examples from @laurinavichyute_share_2022

::: columns
::: {.column width="50%"}
1.  Data problems
    -   inaccessible data
    -   incomplete data (e.g., 2/3 experiments)
2.  Code problems
    -   incomplete code
    -   error messages
    -   code rot: outdated syntax or environment
    -   proprietary software
:::

::: {.column width="50%"}
3.  Documentation problems
    -   data difficult to interpret
    -   no README file/data dictionary
    -   unclear folder/file/variable naming convention
    -   manuscript contradicts code
4.  Unclear terms of use
    -   no licence specification
:::
:::

## Share the code, not just the data

-   Why?
    + key details are often missing from 'Methods' sections
-   suggestions for researchers from @laurinavichyute_share_2022

::: columns
::: {.column width="50%"}
1.  Share data in usable form
    -   with pre-processing code
2.  Use publicly accessible repositories
    -   e.g., OSF
3.  Use non-proprietary data formats
    -   e.g., not `.xls` files (Excel)
:::

::: {.column width="50%"}
4.  Provide documentation
    -   e.g., README, data dictionaries
5.  Share code *and* data
    -   they estimate a 38% increase in reproducibility
6.  Teach data management and computing skills
    -   that's what this course is for!
:::
:::

# Building a reproducible workflow

-   there are different levels of reproducibility
    -   the *bare minimum* is sharing the code and data
    -   *and* including session information:
        -   which operating system was used
        -   which software/package versions were used
-   going bigger:
    -   project-oriented workflow
    -   project-specific filepaths
    -   contained in a single project folder
-   we will be using RProjects to achieve this

## Project management

-   folder structure
-   project-relative file paths
-   appropriate documentation
    -   e.g., README
-   it's great to map out your project structure early on
    -   but it will grow as you go along
    -   reproducible principles facilitate adapting as it grows

## Literate programming

> Instead of imagining that our main task is to instruct a *computer* what to do, let us concentrate rather on explaining to *human beings* what we want a computer to do.

--- @Knuth_literate_1984, p. 97

::: columns
::: {.column width="100%"}
-   originally used to refer to writing programs
-   but also applies to analysis code
    -   especially if we're aiming for reproducibility
:::

::: {.column width="50%"}
-   main concepts:

    -   code is linear [this pre-dates @Knuth_literate_1984]
    -   informative but concise commenting
:::

::: {.column width="50%"}
-   main benefits:

    -   facilitates maintenance
    -   helpful for future-you, collaborators, etc.
:::
:::

## Documentation

-   metadata

    -   project README
    -   codebook/data dictionary

-   README should contain

    -   a project description
    -   relevant links
    -   description of folder structure

-   can be updated as the project develops

-   README.md files in GitHub/Lab are automatically used as a project description

    -   `.md` is a plaintext document
    -   uses markdown syntax

## Version control

-   git: local tracking
-   useful for the analysis and writing phases
    -   but can be tricky for collaboration
-   GitHub/GitLab: remote tracking
    -   store your changes to your local git repository
    -   then push them to your remote repository
-   safe guards against local hardware/software issues
    -   lost or damaged computer or local files
-   and allows for collaboration or sharing

## Persistant (public) storage

-   GitHub/Lab are sub-optimal
    -   developer-focused
    -   typically lack thorough documentation/metadata
    -   not very user-friendly for non-users
-   OSF, Zenodo
    -   Open Science-focused
    -   can be linked to a GitHub/Lab repository
    -   facilitate thorough documentation
    -   user-friendly

## Writing

-   dynamic reports with Markdown syntax

    -   e.g., Rmarkdown, Quarto
    -   integration of data, code, and prose
        -   facilitates cross-referencing within document
        -   integration of citation management tools
        -   supports LaTeX syntax for example sentences and tables

-   `papaja` package for APA-formatted Rmarkdown documents

-   challenge: collaboration

    -   not all collaborators know these tools
    -   track changes not currently possible
    

# Setting up a project

-   next week: hands-on
-   required installations/recent versions of:
    -   R
        -   version `4.4.0`, "Puppy Cup"
        -   check current version with `R.version`
        -   download/update: <https://cran.r-project.org/bin/macosx/>
    -   RStudio
        -   version `2023.12.1.402`, "Ocean Storm"
        -   Help \> Check for updates
        -   new install: <https://posit.co/download/rstudio-desktop/>

# Learning objectives 🏁 {.unnumbered .unlisted .uncounted}

Today we learned...

-   reproducibility rates in linguistics ✅
-   FAIR principles ✅
-   concepts for building a reproducible workflow ✅

# References {.unlisted .unnumbered visibility="uncounted"}

---
nocite: |
  @prinstein_open_2022
---

::: {#refs custom-style="Bibliography"}
:::