Open Science Practices for Linguistic Research: Reproducible Analyses in R (SSOL 2024 Workshop)
  • D. Palleschi
  1. Day 1: Introduction to reproducibility
  2. Building a reproducible workflow in R
  • Workshop overview
  • Day 1: Introduction to reproducibility
    • Reproducibility in linguistic research and R
    • Building a reproducible workflow in R
  • Day 2: Project-oriented workflow
    • RProjects
    • Package management
  • Day 3: Literate programming and Modularity
    • Writing Reproducible Code
  • Day 4: Publishing and Peer Review
    • Publishing analyses + Peer code review

On this page

  • 1 Building a reproducible workflow in R
    • 1.1 Broadening the reproducibilty spectrum
    • 1.2 Project management
      • Naming conventions
    • 1.3 Literate programming
    • 1.4 Documentation
    • 1.5 Version control (not covered in this workshop)
    • 1.6 Persistant (public) storage
    • 1.7 Writing (not covered in this workshop)
  • 2 Setting up a project

Other Formats

  • PDF
  • RevealJS
  1. Day 1: Introduction to reproducibility
  2. Building a reproducible workflow in R

Building a reproducible workflow in R

Project-oriented workflow

Author
Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Workshop Day 1

Wed Aug 21, 2024

Last Modified

Wed Aug 21, 2024

Learning Objectives

Today we will learn…

  • about reproducibility practices beyond sharing code and data
  • about project-oriented workflows
  • what we will cover in this workshop

1 Building a reproducible workflow in R

  • we now know some important principles of a reproducible workflow
    • and that ‘reproducibility’ is not black-and-white
    • but even the reproducibility spectrum is an oversimplification (Peng, 2011)
  • some additional resources that provide a list of tips include:
    • Bowers & Voors (2016); Nagler (1995); Wilson et al. (2017); Corker (2022)

1.1 Broadening the reproducibilty spectrum

  • there are different levels of reproducibility
    • the bare minimum is sharing the code and data
    • and including session information:
      • which operating system was used
      • which software/package versions were used
  • going bigger:
    • project-oriented workflow
    • project-specific filepaths
    • contained in a single project folder
  • we will be using RProjects to achieve this

1.2 Project management

  • folder structure
  • project-relative file paths
  • appropriate documentation
    • e.g., README
  • it’s great to map out your project structure early on
    • but it will grow as you go along
    • reproducible principles facilitate adapting as it grows

Naming conventions

  • there are some “rules” for naming files and folders
    • The Turing Way: Naming files, folders, and other things
    • Jenny Bryan: naming things (Reproducible Science Workshop 2015)
  1. Avoid special characters
    • ensures machine readability
  2. Make names concise but meaningful
    • ensures human-readability
  3. Avoid spaces
    • try CamelCase, snake case (snake_case), or skewer case (skewer-case)
    • or use hyphens (-) to separate chunks, and underscores (_) to connect words of the same chunk
  1. Consider default ordering
    • e.g., with dates: YYYY-MM-DD
    • with folders or files: numerical prefixes (e.g., 01-data_cleaning.R, 02-data_visualisation.R)
  2. Be consistent

1.3 Literate programming

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

— Knuth (1984), p. 97

  • originally used to refer to writing programs
  • but also applies to analysis code
    • especially if we’re aiming for reproducibility
  • main concepts:

    • code is linear (this pre-dates Knuth, 1984)
    • informative but concise commenting
  • main benefits:

    • facilitates maintenance
    • helpful for future-you, collaborators, etc.

1.4 Documentation

  • metadata

    • project README
    • codebook/data dictionary
  • README should contain

    • a project description
    • relevant links
    • description of folder structure
  • can be updated as the project develops

  • README.md files in GitHub/Lab are automatically used as a project description

    • .md is a plain text document
    • uses markdown syntax

1.5 Version control (not covered in this workshop)

  • git: local tracking
  • useful for the analysis and writing phases
    • but can be tricky for collaboration
  • GitHub/GitLab: remote tracking
    • store your changes to your local git repository
    • then push them to your remote repository
  • safe guards against local hardware/software issues
    • lost or damaged computer or local files
  • and allows for collaboration or sharing

1.6 Persistant (public) storage

  • GitHub/Lab are sub-optimal
    • developer-focused
    • typically lack thorough documentation/metadata
    • not very user-friendly for non-users
  • OSF, Zenodo
    • Open Science-focused
    • can be linked to a GitHub/Lab repository
    • facilitate thorough documentation
    • user-friendly

1.7 Writing (not covered in this workshop)

  • dynamic reports with Markdown syntax

    • e.g., Rmarkdown, Quarto
    • integration of data, code, and prose
      • facilitates cross-referencing within document
      • integration of citation management tools
      • supports LaTeX syntax for example sentences and tables
  • papaja package for APA-formatted Rmarkdown documents

  • challenge: collaboration

    • not all collaborators know these tools
    • track changes not currently possible

2 Setting up a project

  • tomorrow: hands-on
  • required installations/recent versions of:
    • R
      • preferably version 4.4.0, “Puppy Cup”
      • check current version with R.version
      • download/update: https://cran.r-project.org/bin/macosx/
    • RStudio
      • preferably version 2023.12.1.402, “Ocean Storm”
      • Help > Check for updates
      • new install: https://posit.co/download/rstudio-desktop/

Learning objectives 🏁

Today we learned…

  • about reproducibility practices beyond sharing code and data ✅
  • about project-oriented workflows ✅
  • what we will cover in this workshop ✅

References

Bowers, J., & Voors, M. (2016). How to improve your relationship with your future self. Revista de Ciencia Política (Santiago), 36(3), 829–848. https://doi.org/10.4067/S0718-090X2016000300011
Corker, K. S. (2022). An Open Science Workflow for More Credible, Rigorous Research. In M. J. Prinstein (Ed.), The Portable Mentor (3rd ed., pp. 197–216). Cambridge University Press. https://doi.org/10.1017/9781108903264.012
Knuth, D. (1984). Literate programming. The Computer Journal, 27(2), 97–111.
Nagler, J. (1995). Coding Style and Good Computing Practices. PS: Political Science & Politics, 28(3), 488–492. https://doi.org/10.2307/420315
Peng, R. D. (2011). Reproducible Research in Computational Science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510
Source Code
---
title: "Building a reproducible workflow in R"
subtitle: "Project-oriented workflow"
author: "Daniela Palleschi"
institute: "Humboldt-Universität zu Berlin"
lang: en
date: 2024-08-21
date-format: "ddd MMM D, YYYY"
date-modified: last-modified
language: 
  title-block-published: "Workshop Day 1"
  title-block-modified: "Last Modified"
format: 
  html:
    number-sections: true
    number-depth: 2
    toc: true
    code-overflow: wrap
    code-tools: true
    embed-resources: false
  pdf:
    toc: true
    number-sections: false
    colorlinks: true
    code-overflow: wrap
  revealjs:
    footer: "SSOL 2024"
    include-in-header: ../../mathjax.html # for multiple
    output-file: R-workflow_slides.html
    code-overflow: wrap
    theme: [dark]
    width: 1600
    height: 900
    progress: true
    scrollable: true
    # smaller: true
    slide-number: c/t
    code-link: true
    incremental: true
    # number-sections: true
    toc: false
    toc-depth: 2
    toc-title: 'Overview'
    navigation-mode: linear
    controls-layout: bottom-right
    fig-cap-location: top
    font-size: 0.6em
    slide-level: 4
    embed-resources: false
    fig-align: center
    fig-dpi: 300
editor_options: 
  chunk_output_type: console
bibliography: references.bib
csl: ../../../apa.csl
---

```{r setup, eval = T, echo = F}
knitr::opts_chunk$set(echo = T, # print chunks?
                      eval = T, # run chunks?
                      error = F, # print errors?
                      warning = F, # print warnings?
                      message = F, # print messages?
                      cache = F # cache?; be careful with this!
                      )
```

```{r}
#| echo: false
source(here::here("functions", "print_image.R"))
```


```{r}
#| echo: false
#| eval: false
# run manually
rbbt::bbt_update_bib(here::here("slides", "day1", "R-workflow", "R-workflow.qmd"))
```

# Learning Objectives {.unnumbered .unlisted}

Today we will learn...

-   about reproducibility practices beyond sharing code and data
-   about project-oriented workflows
-   what we will cover in this workshop


# Building a reproducible workflow in R

- we now know some important principles of a reproducible workflow
  + and that 'reproducibility' is not black-and-white
  + but even the reproducibility spectrum is an oversimplification [@peng_reproducible_2011]

- some additional resources that provide a list of tips include:
  - @bowers_how_2016; @nagler_coding_1995; @wilson_good_2017; @corker_open_2022

## Broadening the reproducibilty spectrum

-   there are different levels of reproducibility
    -   the *bare minimum* is sharing the code and data
    -   *and* including session information:
        -   which operating system was used
        -   which software/package versions were used
-   going bigger:
    -   project-oriented workflow
    -   project-specific filepaths
    -   contained in a single project folder
-   we will be using RProjects to achieve this

## Project management

-   folder structure
-   project-relative file paths
-   appropriate documentation
    -   e.g., README
-   it's great to map out your project structure early on
    -   but it will grow as you go along
    -   reproducible principles facilitate adapting as it grows

::: {.content-visible when-format="revealjs"}
### Naming conventions {.smaller}
:::
::: {.content-visible unless-format="revealjs"}
### Naming conventions
:::

- there are some "rules" for naming files and folders
    + [The Turing Way: Naming files, folders, and other things](https://the-turing-way.netlify.app/project-design/filenaming.html)
    + [Jenny Bryan: naming things (Reproducible Science Workshop 2015)](https://speakerdeck.com/jennybc/how-to-name-files)

::: columns

::: {.column width="50%"}

1. Avoid special characters
    + ensures machine readability
2. Make names concise but meaningful
    + ensures human-readability
3. Avoid spaces
    + try `CamelCase`, snake case (`snake_case`), or skewer case (`skewer-case`)
    + or use hyphens (`-`) to separate chunks, and underscores (`_`) to connect words of the same chunk

:::

::: {.column width="50%"}
4. Consider default ordering
    + e.g., with dates: `YYYY-MM-DD`
    + with folders or files: numerical prefixes (e.g., `01-data_cleaning.R`, `02-data_visualisation.R`)
5. Be *consistent*

:::

:::

## Literate programming

> Instead of imagining that our main task is to instruct a *computer* what to do, let us concentrate rather on explaining to *human beings* what we want a computer to do.

--- @Knuth_literate_1984, p. 97

::: columns
::: {.column width="100%"}
-   originally used to refer to writing programs
-   but also applies to analysis code
    -   especially if we're aiming for reproducibility
:::

::: {.column width="50%"}
-   main concepts:

    -   code is linear [this pre-dates @Knuth_literate_1984]
    -   informative but concise commenting
:::

::: {.column width="50%"}
-   main benefits:

    -   facilitates maintenance
    -   helpful for future-you, collaborators, etc.
:::
:::

## Documentation

-   metadata
    -   project README
    -   codebook/data dictionary

-   README should contain
    -   a project description
    -   relevant links
    -   description of folder structure

-   can be updated as the project develops

-   README.md files in GitHub/Lab are automatically used as a project description
    -   `.md` is a plain text document
    -   uses markdown syntax

## Version control (not covered in this workshop)

-   git: local tracking
-   useful for the analysis and writing phases
    -   but can be tricky for collaboration
-   GitHub/GitLab: remote tracking
    -   store your changes to your local git repository
    -   then push them to your remote repository
-   safe guards against local hardware/software issues
    -   lost or damaged computer or local files
-   and allows for collaboration or sharing

## Persistant (public) storage

-   GitHub/Lab are sub-optimal
    -   developer-focused
    -   typically lack thorough documentation/metadata
    -   not very user-friendly for non-users
-   OSF, Zenodo
    -   Open Science-focused
    -   can be linked to a GitHub/Lab repository
    -   facilitate thorough documentation
    -   user-friendly

## Writing (not covered in this workshop)

-   dynamic reports with Markdown syntax

    -   e.g., Rmarkdown, Quarto
    -   integration of data, code, and prose
        -   facilitates cross-referencing within document
        -   integration of citation management tools
        -   supports LaTeX syntax for example sentences and tables

-   `papaja` package for APA-formatted Rmarkdown documents

-   challenge: collaboration

    -   not all collaborators know these tools
    -   track changes not currently possible


# Setting up a project

-   tomorrow: hands-on
-   required installations/recent versions of:
    -   R
        -   preferably version `4.4.0`, "Puppy Cup"
        -   check current version with `R.version`
        -   download/update: <https://cran.r-project.org/bin/macosx/>
    -   RStudio
        -   preferably version `2023.12.1.402`, "Ocean Storm"
        -   Help \> Check for updates
        -   new install: <https://posit.co/download/rstudio-desktop/>

# Learning objectives 🏁 {.unnumbered .unlisted .uncounted}

Today we learned...

-   about reproducibility practices beyond sharing code and data ✅
-   about project-oriented workflows ✅
-   what we will cover in this workshop ✅

# References {.unlisted .unnumbered visibility="uncounted"}

---
nocite: |
  @seibold_6_nodate
---

::: {#refs custom-style="Bibliography"}
:::