Writing Reproducible Code

Literate, linear programming and modularity

Daniela Palleschi

Leibniz-Zentrum Allgemeine Sprachwissenschaft

Thu Oct 17, 2024

Topics

  • modular analyses and literate programming
  • create and render a dynamic report with Quarto
  • documenting your dependencies

Reproducible code

Reproducible code

  • how you write your code is the first step in making it reproducible

  • the first principle is that your code must be linear

    • this means code must be written in a linear fashion
    • i.e., our scripts should run from top-to-bottom
Example
Non-linear code
read_csv(here("data", "my_data.csv"))

library(readr)
library(here)

Writing linear code

  • you need to load a package before you call a function from it
    • if we’re just working in an R session, before means temporally prior
    • with linear code, before means higher up in the script
  • such pre-requisite code must
    1. be present in the script
    2. appear above the first line of code that uses a function from this package
  • missing pre-requisite code might not throw an error message
    • but might produce output we aren’t expecting
    • e.g., forgetting to filter out certain observations
    • or forgetting that some observations have been filtered out

Literate programming

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

Knuth (1984), p. 97

  • our code should also be literate
  • i.e., we should write and document our code so that humans can understand it
    • important for us: we are (generally) not professional programmers, nor are our peers
  • we need to know what our code is doing when we look back at it in the future/share it
  • the easiest way: informative comments
    • the length and frequency of these comments is your choice

Example R script

analysis4.R
library(dplyr)
library(readr)
library(here)

df_phon <- read_csv(here("data", "phoneme_tidy_data.csv"))

summary(df_phone)

plot(df_phon$duration, df_phon$mean_f0)
phoneme_analysis.R
# [example script]
# Analysis script for phoneme paper
# author: Joe DiMaggio
# date: Feb. 29, 2024
# purpose: analyse cleaned dataset

# Set-up ###

# load required packages
library(dplyr) # data wrangling
library(readr) # loding data
library(here) # project-relative file path

# Load-in data
df_phon <- read_csv(here("data", "phoneme_tidy_data.csv"))

# Explore data ###
summary(df_phone)

# scatterplot: phoneme duration by mean f0
plot(df_phon$duration, df_phon$mean_f0)
1
begins with some meta-information about the document, including its purpose
2
heading with three hashtags at the end (###) –> creates structured Outline
3
the purpose of chunks of code are written above
4
description of specific lines of code are also given
  • the metadata, headings, and informative comments in phoneme_analysis.R make the second script much easier to follow
  • this becomes more important with longer, more complex scripts

Modular analyses

  • recall our scripts folder (which you might’ve named analysis or something else)
  • ideally, this would also contain subfolders, one for each stage of your analysis
    • or at least, multiple scripts
  • this is the concept of modularity (Bowers & Voors, 2016; Nagler, 1995)
    • separating data cleaning, pre-processing, recoding, merging, analyses, etc. into files/scripts

Dynamic reports

Dynamic reports: .Rmd and .qmd

  • R scripts are useful, but don’t show the code output
    • and commenting can get clunky
  • dynamic reports combine prose, code, and code output
    • R markdown (.Rmd file extension) and Quarto (.qmd ) are extensions of markdown
      • can embed R code ‘chunks’ in a script, thus producing ‘dynamic’ reports
    • produce a variety of output files which contain text, R code chunks, and the code chunk outputs all in one
  • for example, we can look at the example script phoneme_analysis.R, but we have no idea what the scatterplot it produced looks like

Task: New Quarto document

  1. Navigate to File > New file > Quarto document
  2. Write some title, your name (Author), make sure ‘Visual markdown Editor’ is unchecked
  3. Click ‘Create’
  4. A new tab will open in R Studio. Press the ‘Render’ button above the top of the document, you will be prompted to save the document. Store it in a folder called scripts and save it as 01-literate-programming.qmd.
  5. What happens?

R v. Rmarkdown v. Quarto

  • .R files contain (R) source code only
  • .Rmd files are dynamic reports that support
    • R-Code (and R-packages)
  • .qmd files are dynamic reports (RStudio v2022.07 or later)
    • R-Code (and R-packages)
    • native support for Python (and Jupyter-Notebooks)
    • native support for Julia

Check your RStudio version

Run the following in the Console: RStudio.Version()$version

  • if the output is 2022.07 or higher you can use Quarto
  • if not: update RStudio: Help > Check for updates

YAML

  • the section at the very top fenced by ---
  • contains all the meta information about your document
    • e.g. title, author name, date
    • also formatting information, e.g. type of output file
  • there are many document formatting and customisation options, checkout the Quarto website for more
  • but for example I have many YAML formatting options in the source code of my slides
---
title: "My title"
---

YAML

YAML

  1. change the title if you want to do so.
  2. guess how to add a subtitle (hint: it is similar to adding a title)
  3. add an author, author: ‘firstname lastname’ (see example below)
  4. add a table of contents (Table of Contents = toc) by changing format so that it looks like this:
---
title: "Dynamic reports"
author: "Daniela Palleschi"
format:
  pdf:
    toc: true
---
  1. Render the document. Do you see the changes?

Structure your reports

  • remember to use (sub-)headings (e.g., # Set-up)
    • N.B., you don’t need the 3 hashtags here (only in R scripts)
  • describe the function/purpose at the beginning of the script
  • document your train of thought and findings throughout the script
    • e.g., why are you producing this plot, what does it tell you?
  • give an overview of the findings/end result at the end
  • it’s wise to avoid very long, multi-purpose scripts
    • rule of thumb: one script per product or purpose
    • e.g., data cleaning, exploration, analysis, publication figures, etc.

Code chunks

  • the main benefit of dynamic reports: combining text with code (and code output)
  • R code goes in code chunks:
```{r}
2+2
```
[1] 4
  • to add a code chunk: Code > Insert Chunk
    • or use the keyboard shortcut: Cmd+Opt+I (Mac) / Ctrl+Alt+I (Windows)

Adding content

Adding structure and code chunks

  1. Use the example R script above to create a structured document
    • use headings (#) and subheadings (##) accordingly
  2. Load in our dataset in a code chunk
  3. Render the document. Do you see the changes?

Documenting package dependencies

  • R and R package versions are both open source, and are frequently updated
    • you might’ve run your code using dplyr version 1.1.0 or later, which introduced the .by per-operation grouping argument
    • what happens when somebody who has an older version of dplyr tries to run your code?
      • They won’t be able to!
    • the reverse of this situation is more common:
      • a newer version of a package no longer supports a deprecated function or argument

Session info

  • so, print your session info at the end of every script
    • this will print your R version, package versions, and more
sessionInfo()
  • with dynamic reports: this will be produced the output
    • for R scripts: you can save the info as an object and save it as an RDS file (I recommend saving it alongside the relevant script, with the same name plus session_info or something of the like)
my_session <- sessionInfo()
saveRDS(my_session, file = here("scripts", "03-analyses", "phoneme_analyses-session_info.rds"))
  • or run it, copy-and-paste the output in the script, and comment it all out

Tips and tricks

  • when you start a new script make sure you always start with a clean R environment: Session > Restart R or Cmd/Ctrl+Shift+0
    • this means no packages, data, functions, or any other dependencies are loaded
  • at the top of your script, always load packages required below
    • you can always add more packages to the list as you add to your script
  • Render/Knit often: when you make changes to your script make sure you re-render your document
    • checks you haven’t introduced any errors
    • easier to troubleshoot if smaller changes have been made
  • if you can run your script manually from source but it won’t render, restart your R session and see if you can still run it from source
    • often the problem is some dependency in your environment that is not linearly introduced in the script

Hands-on: working with Quarto

Follow the instructions on the workshop website: Hands-on: working with Quarto

Topics 🏁

  • modular analyses and literate programming ✅
  • create and render a dynamic report with Quarto ✅
  • documenting your dependencies ✅

Session Info

My session info.

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.2         knitr_1.47        rlang_1.1.4      
 [5] xfun_0.45         renv_1.0.7        jsonlite_1.8.8    glue_1.7.0       
 [9] rprojroot_2.0.4   htmltools_0.5.8.1 hms_1.1.3         fansi_1.0.6      
[13] rmarkdown_2.27    evaluate_0.24.0   tibble_3.2.1      tzdb_0.4.0       
[17] fastmap_1.2.0     yaml_2.3.8        lifecycle_1.0.4   compiler_4.4.1   
[21] pkgconfig_2.0.3   here_1.0.1        rstudioapi_0.16.0 digest_0.6.35    
[25] R6_2.5.1          readr_2.1.5       utf8_1.2.4        pillar_1.9.0     
[29] magrittr_2.0.3    tools_4.4.1      

References

Bowers, J., & Voors, M. (2016). How to improve your relationship with your future self. Revista de Ciencia Política (Santiago), 36(3), 829–848. https://doi.org/10.4067/S0718-090X2016000300011
Knuth, D. (1984). Literate programming. The Computer Journal, 27(2), 97–111.
Nagler, J. (1995). Coding Style and Good Computing Practices. PS: Political Science & Politics, 28(3), 488–492. https://doi.org/10.2307/420315