R for Reproducibility
  • D. Palleschi
  1. Writing Reproducible Code
  • Open Science
  • The Replication Crisis
  • Reproducibility
  • RProjects
  • Writing Reproducible Code
  • Data wrangling
  • Data tidying
  • Data communication with tables
  • Data Visualisation with ggplot2
  • Package management
  • Reproducible Writing
  • Publishing analyses + Peer code review
  • Reporting regression results

On this page

  • Learning objectives
  • Reproducible code
    • Writing linear code
    • Literate programming
    • Example R script
  • Dynamic reports
    • Structure your reports
    • Session Information
    • Printing session info

Other Formats

  • PDF
  • RevealJS

Writing Reproducible Code

  • Show All Code
  • Hide All Code

  • View Source

Literate, linear programming

Author
Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

May 14, 2024

Learning objectives

  • learn what literate programming is
  • create and render a dynamic report with Quarto
  • load data
  • include a table and figure

Reproducible code

  • how you write your code is the first step in making it reproducible

  • the first principle is that your code must be linear

    • this means code must be written in a linear fashion
    • this is because we typically run a script from top-to-bottom
Example
read_csv(here("data", "my_data.csv"))

library(readr)
library(here)

Writing linear code

  • you need to load a package before you call a function from it
    • if we’re just working in an R session, before means temporally prior
    • with linear code, before means higher up in the script
  • such pre-requisite code must
    1. be present in the script
    2. appear above the first line of code that uses a function from this package
  • missing pre-requisite code might not throw an error message
    • but might produce output we aren’t expecting
    • e.g., forgetting to filter out certain observations
    • or forgetting that some observations have been filtered out

Literate programming

  • introduced in 1992 by Donald Knuth (Knuth, 1984)

  • refers to writing and documenting our code so that humans can understand it

    • important for us: we are (generally) not professional programmers, nor are our peers
  • we need to not only know what our code is doing when we look back at it in the future/share it

  • the easiest way: informative comments

    • the length and frequency of these comments is your choice

Example R script

Example
# Analysis script for phoneme paper
# author: Joe DiMaggio
# date: Feb. 29, 2024
# purpose: analyse cleaned dataset

# Set-up ###

# load required packages
library(dplyr)
library(readr)
library(ggplot2)
library(lme4)
library(broom.mixed) # tidy model summaries
library(ggeffects) # model predictions
library(here) # project-relative file path

# load-in data
df_phon <- read_csv(here("data", "phoneme_tidy_data.csv"))

# Explore data ###
  • begins with some meta-information about the document, including its purpose
    • aids in knowing which scripts to run in which sequence
  • there are three hashtags after some headings (###)
    • this is helpful because it structures the outline of the document in RStudio
  • the purpose of chunks of code are written above
    • description of specific lines of code are also given

Dynamic reports

  • R scripts are useful, but don’t show the code output
    • and commenting can get clunky
  • dynamic reports combine prose, code, and code output
    • R markdown (.Rmd file extension) and Quarto (.qmd ) are extensions of markdown
      • can embed R code ‘chunks’ in a script, thus producing ‘dynamic’ reports
    • produce a variety of output files which contain text, R code chunks, and the code chunk outputs all in one

Structure your reports

  • describe the function/purpose at the beginning

  • document your train of thought and findings throughout the script

    • e.g., why are you producing this plot, what does it tell you?
  • give an overview of the findings/end result at the end

  • it’s wise to avoid very long, multi-purpose scripts

    • rule of thumb: one script per product or purpose
    • e.g., data cleaning, exploration, analysis, publiation figures, etc.

Session Information

  • R and R package versions are both open source, and are frequently updated
    • you might’ve run your code using dplyr version 1.1.0 or later, which introduced the .by per-operation grouping argument
    • what happens when somebody who has an older version of dplyr tries to run your code?
      • They won’t be able to!
    • the reverse of this situation is more common:
      • a newer version of a package no longer supports a deprecated function or argument

Printing session info

  • so, print your session info at the end of every script!
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.0    fastmap_1.2.0     cli_3.6.2        
 [5] htmltools_0.5.8.1 tools_4.4.0       rstudioapi_0.16.0 yaml_2.3.8       
 [9] rmarkdown_2.27    knitr_1.47        jsonlite_1.8.8    xfun_0.44        
[13] digest_0.6.35     rlang_1.1.4       renv_1.0.7        evaluate_0.23    

References

Knuth, D. (1984). Literate programming. The Computer Journal, 27(2), 97–111.
Source Code
---
title: "Writing Reproducible Code"
subtitle: "Literate, linear programming"
author: "Daniela Palleschi"
institute: Humboldt-Universität zu Berlin
lang: en
date: 2024-05-14
format: 
  html:
    output-file: repro_code.html
    number-sections: false
    toc: true
    code-overflow: wrap
    code-tools: true
    self-contained: true
  pdf:
    output-file: repro_code.pdf
    toc: true
    number-sections: false
    colorlinks: true
    code-overflow: wrap
  revealjs:
    output-file: repro_code_slides.html
    include-in-header: ../../mathjax.html # for multiple equation hyperrefs
    code-overflow: wrap
    theme: [dark]
    width: 1600
    height: 900
    # chalkboard:
    #   src: chalkboard.json
    progress: true
    scrollable: true
    # smaller: true
    slide-number: c/t
    code-link: true
    # logo: logos/hu_logo.png
    # css: logo.css
    incremental: true
    # number-sections: true
    toc: false
    toc-depth: 2
    toc-title: 'Overview'
    navigation-mode: linear
    controls-layout: bottom-right
    fig-cap-location: top
    font-size: 0.6em
    slide-level: 4
    self-contained: true
    title-slide-attributes: 
      data-background-image: logos/logos.tif
      data-background-size: 15%
      data-background-position: 50% 92%
    fig-align: center
    fig-dpi: 300
editor_options: 
  chunk_output_type: console
bibliography: ../../references.bib
csl: ../../apa.csl
execute:
  echo: true
---

# Learning objectives

- learn what literate programming is
- create and render a dynamic report with Quarto
- load data
- include a table and figure

# Reproducible code

- how you write your code is the first step in making it reproducible

- the first principle is that your code must be *linear*
  + this means code must be written in a linear fashion
  + this is because we typically run a script from top-to-bottom

```{r}
#| eval: false
#| echo: true
#| code-fold: true
#| code-summary: "Example"

read_csv(here("data", "my_data.csv"))

library(readr)
library(here)
```
  
## Writing linear code

- you need to load a package *before* you call a function from it
  + if we're just working in an R session, *before* means temporally prior
  + with linear code, *before* means higher up in the script
- such pre-requisite code must 
    a. be present in the script
    b. appear above the first line of code that uses a function from this package
- missing pre-requisite code might not throw an error message
  + but might produce output we aren't expecting
  + e.g., forgetting to filter out certain observations
  + or forgetting *that* some observations have been filtered out

## Literate programming

- introduced in 1992 by Donald Knuth [@Knuth_literate_1984]
- refers to writing and documenting our code so that humans can understand it
  + important for us: we are (generally) not professional programmers, nor are our peers
- we need to not only know what our code is doing when we look back at it in the future/share it

- the easiest way: informative comments
  + the length and frequency of these comments is your choice

## Example R script

```{markdown}
#| eval: false
#| echo: true
#| code-fold: true
#| code-summary: "Example"

# Analysis script for phoneme paper
# author: Joe DiMaggio
# date: Feb. 29, 2024
# purpose: analyse cleaned dataset

# Set-up ###

# load required packages
library(dplyr)
library(readr)
library(ggplot2)
library(lme4)
library(broom.mixed) # tidy model summaries
library(ggeffects) # model predictions
library(here) # project-relative file path

# load-in data
df_phon <- read_csv(here("data", "phoneme_tidy_data.csv"))

# Explore data ###
```

- begins with some meta-information about the document, including its purpose
  + aids in knowing which scripts to run in which sequence

- there are three hashtags after some headings (`###`)
  + this is helpful because it structures the outline of the document in RStudio
  
- the purpose of chunks of code are written above
  + description of specific lines of code are also given

# Dynamic reports

- R scripts are useful, but don't show the code output
  + and commenting can get clunky
- dynamic reports combine prose, code, and code output
  + R markdown (`.Rmd` file extension) and Quarto (`.qmd` ) are extensions of markdown
    + can embed R code 'chunks' in a script, thus producing 'dynamic' reports
  + produce a variety of output files which contain text, R code chunks, and the code chunk outputs all in one
  
## Structure your reports

- describe the function/purpose at the beginning
- document your train of thought and findings throughout the script
  + e.g., why are you producing this plot, what does it tell you?
- give an overview of the findings/end result at the end

- it's wise to avoid very long, multi-purpose scripts
  - rule of thumb: one script per product or purpose
  + e.g., data cleaning, exploration, analysis, publiation figures, etc.

## Session Information

- R and R package versions are both open source, and are frequently updated
  + you might've run your code using `dplyr` version `1.1.0` or later, which introduced the `.by` per-operation grouping argument
  + what happens when somebody who has an older version of `dplyr` tries to run your code? 
    + They won't be able to!
  + the reverse of this situation is more common:
    + a newer version of a package no longer supports a deprecated function or argument
  
## Printing session info

- so, print your session info at the end of every script!

```{r}
sessionInfo()
```