Reproducible Workflow in R (ZAS Workshop)
  • D. Palleschi
  1. Day 1
  2. R-Projects
  • Workshop overview
  • Day 1
    • Reproducible analyses in R
    • R-Projects
  • Day 2
    • Writing Reproducible Code
    • Package management
    • Publishing our project and conducting a code review

On this page

  • 1 Installation requirements
  • 2 Project-oriented workflow
    • 2.1 Folder structure
  • 3 R-Projects
    • 3.1 Creating a new Project
    • 3.2 Opening a Project
    • 3.3 Adding a README file
    • 3.4 Global RStudio options
    • 3.5 Identifying your R-Project
    • 3.6 Folder structure
      • data/
      • scripts/
      • Load in the data
      • Exercise: mini-Code Review
  • 4 here-package
    • 4.1 The problem with setwd()
    • 4.2 The benefit of here()
    • 4.3 here::here()

Other Formats

  • PDF
  • RevealJS

https://www.leibniz-zas.de/en/ https://www.leibniz-gemeinschaft.de/en/

  1. Day 1
  2. R-Projects

R-Projects

Creating a project-oriented workflow in R

Author
Affiliation

Daniela Palleschi

Leibniz-Zentrum Allgemeine Sprachwissenschaft

Workshop Day 1

Tue Oct 8, 2024

Last Modified

Mon Oct 7, 2024

Topics

  • Project-oriented workflows
  • creating an R-Project
  • project-relative filepaths with the here package

1 Installation requirements

  • required installations/recent versions of:
    • R
      • at least version 4.4.0, “Puppy Cup”
      • check current version with R.version
      • download/update: https://cran.r-project.org/bin/macosx/
    • RStudio
      • at least version 2023.12.1.402, “Ocean Storm”
      • Help > Check for updates
      • new install: https://posit.co/download/rstudio-desktop/

2 Project-oriented workflow

  1. Folder structure:
    • keeping everything related to a project in one place
    • i.e., contained in a single folder, with subfolders as needed
  2. Project-relative working directory
    • the project folder should act as your working directory
    • all file paths should be relative to this folder

2.1 Folder structure

  • a core computer literacy skill
    • keep your Desktop as empty as possible
    • have a sensible folder structure
    • avoid mixing subfolders and files
      • i.e., if a folder contains subfolders, ideally it should not contain files

3 R-Projects

  • in data analysis, using an IDE is beneficial
    • e.g., RStudio
  • most IDEs have their own implementation of a Project
  • in RStudio, this is the R-Project
    • creates a .Rproj file in a project folder
    • stores project settings
  • you can have several R-Projects open simultaneously
    • and run several scripts across projects simultaneously
  • most importantly, R-Projects (can) centralise a specific project’s workflow and file path
  • to read more about R-Projects, check out Section 6.2: Projects from Wickham et al. (2023; or Ch. 8 - Workflow: Projects in Wickham & Grolemund, 2016)

3.1 Creating a new Project

  • when?
    • whenever you’re starting a new course oR-Project which will use R
  • why?
    • to keep all the relavent materials in one place
  • where?
    • somewhere that makes sense, e.g., a folder called SoSe2024 or Mastersarbeit
  • how?
    • File > New Project > New Directory > New Project > [Directory name] > Create Project

New R-Project

Create a new R-Project for this workshop

  • File > New Project > New Directory > New Project > [Directory name] > Create Project
  • make sure you choose a sensible location

3.2 Opening a Project

  • to open a project, locate its .Rproj file and double-click
  • or if you’re already in RStudio, you can use the Project (None) drop-down (top right)
Figure 1: Double-click .Rproj
Figure 2: Open from RStudio

3.3 Adding a README file

  • File > New File > Markdown File (not R Markdown!)
    • add some text describing the purpose of this project
    • include your name, the date
    • use Markdown formatting (e.g., # for headings, *italics*, **bold**)
  • save as README.md in youR-Project directory

3.4 Global RStudio options

Figure 3: RStudio settings for reproducibility
  • Tools > Global Options
    • Workspace: Restore .RData into workspace at startup: NO
    • Save workspace to .RData on exit: Never
  • this will ensure that you are always starting with a clean slate
    • and that your code is not dependent on some pacakge or object you created in another session
  • this is also how RMarkdown and Quarto scripts run
    • they start with an empty environment and run the script linearly

Global settings

Change your Global Options so that

  • Workspace: Restore .RData into workspace at startup: NO
  • Save workspace to .RData on exit: Never

3.5 Identifying your R-Project

  • there are a ways to check which (if any) R-Project you’re in
    • there are 6 differences between Figure 4 and Figure 5
    • which is in an R-Project session?
  • Spot the differences
  • Show the differences
Figure 4: RStudio Session A
Figure 5: RStudio Session B
Figure 6: How to tell if you’re in a project

3.6 Folder structure

  • some folders you’ll typically want to have:
    • data: containing your dataset(s)
    • scripts (or analyses, etc.): containing any analysis scripts
    • manuscript: containing any write-ups of your results
    • materials: containing relevant experiment materials (e.g., stimuli)
  • let’s just create the first 2 (data and scripts)

data/

  • do you have “raw”, i.e., pre-processed data?
    • if so, you might want to create a raw sub-folder
    • and any other relevant sub-folders (e.g., processed or tidy)
  • download the online_cleaned.csv dataset from the GitHub or OSF repo from Ćwiek et al. (2021)
    • or, move a dataset of your own to this folder
  • save the file as cwiek_2021-online_cleaned.csv
  • description of data collection:

In an online experiment with listeners of 25 different languages (from nine language families), participants listened to the 90 vocalizations (three for each of the 30 meanings), and for each, guessed its intended meaning from six written alternatives

– Ćwiek et al. (2021)

  • you could also download the data directly from GitHub in R:
write.csv(
  file = "data/cwiek_2021-online_cleaned.csv",
  read.csv("https://raw.githubusercontent.com/bodowinter/iconicity_challenge/refs/heads/master/data/online_cleaned.csv")
  )

scripts/

  • try to create a single script for each “product”
    • e.g., anonymised data, ‘cleaned’ data, data exploration, visualisation, analyses, etc.
  • you can create sub-folders as the project develops and move scripts around
    • for now, let’s create a new script to take a look at our data

New script

Create a new script:

  1. File > New File > Choose your preferred script type
  2. Save it in your scripts/ folder: File > Save as...

Load in the data

  • load in the data however you normally would
    • e.g., read.csv(), readr::read_csv(), …

Exercise: mini-Code Review

R-Project template
  1. Download the R-Project template at https://osf.io/ctmwj/
  2. Open (or switch to) rproject-template.Rproj
  3. Inspect the folder structure and the files.
  4. Look at the scripts/ folder. Is it clear which scripts should be run first?
  5. Try running 02-visualisation.R first. Do you encounter any problems?

4 here-package

  • here package (Müller, 2020) enables file referencing
    • avoids the use of setwd()
Figure 7: Illustration by Allison Horst

4.1 The problem with setwd()

If the first line of your R script is

setwd("C:\Users\jenny\path\that\only\I\have")

I will come into your office and SET YOUR COMPUTER ON FIRE🔥.

— Jenny Bryan

  • setwd() depends on your entire machine’s folder structure
  • setwd() breaks when you
    • send youR-Project folder to a collaborator
    • make your analyses open
    • change the location of youR-Project folder
  • using slashes is also dependent on your operating system
  • trying to use somebody else’s (or your former) folder path will result in a warning message like:

Error in setwd("/Users/danielapalleschi/Documents/R/rproject-template") : cannot change working directory

4.2 The benefit of here()

  • uses the top-level directory of your Project as the working directory
    • meaning we never need to specify the path to our project folder relative to our current higher-level folder structure
  • can separate folder names with a comma
    • meaning it doesn’t matter if the original code was written on a Mac or a Windows machine

here

In your R Project, load the cwiek_2021-online_cleaned.csv data using here

  1. Install here (if needed; e.g., install.packages("here"))
  2. Load here at the beginning of your package
    • or use here:: before calling a function
  3. Use the here() function to load in your data
  4. Inspect the dataset however you usually would (e.g., summary(), names(), etc.)
  5. Save your script

4.3 here::here()

  • install package
In the Console
install.packages("here")
  • load package and call the here function
# load package
library(here)

# read in data
df_icon <- read.csv(here("data", "cwiek_2021-online_cleaned.csv"))
  • or directly call the here function without loading the package
# read in data without loading here
df_icon <- read.csv(here::here("data", "cwiek_2021-online_cleaned.csv"))
  • note that I stored the data with the prefix df_
    • df stands for dataframe
  • I recommend using object-type defining prefixes for all objects in your Environment
    • e.g., fit_ for models, fig_ for figures, sum_ for summaries, tbl_ for tables, etc.

Reproduce your analysis
  1. Perform some data exploration (e.g., with names(), summary(), dplyr::glimpse(), whatever you typically do)
  2. Save your script, then close RStudio/your R-Project.
  3. Re-open the project. Can you re-run the script?

Topics 🏁

  • Project-oriented workflows ✅
  • creating an R-Project ✅
  • project-relative filepaths with the here package ✅

References

Bryan, J., Hester, J., Pileggi, S., & Aja, D. E. (n.d.). What They Forgot to Teach You About R. Retrieved May 6, 2024, from https://rstats.wtf/
Bryan, J., & TAs, T. S. 545. (n.d.). R Basics and workflows. In STAT 545 Course materials. Retrieved May 6, 2024, from https://stat545.com/
Ćwiek, A., Fuchs, S., Draxler, C., Asu, E. L., Dediu, D., Hiovain, K., Kawahara, S., Koutalidis, S., Krifka, M., Lippus, P., Lupyan, G., Oh, G. E., Paul, J., Petrone, C., Ridouane, R., Reiter, S., Schümchen, N., Szalontai, Á., Ünal-Logacev, Ö., … Perlman, M. (2021). Novel vocalizations are understood across cultures. Scientific Reports, 11(1), 10108. https://doi.org/10.1038/s41598-021-89445-4
Müller, K. (2020). Here: A Simpler Way to Find Your Files (Version 1.0.1). https://CRAN.R-project.org/package=here
Using RStudio Projects. (2024, April 16). Posit Support. https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2nd ed.). https://r4ds.hadley.nz/
Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. " O’Reilly Media, Inc.".
Source Code
---
title: "R-Projects"
subtitle: "Creating a project-oriented workflow in R"
author: "Daniela Palleschi"
institute: Leibniz-Zentrum Allgemeine Sprachwissenschaft
lang: en
date: 2024-10-08
date-format: "ddd MMM D, YYYY"
date-modified: last-modified
language: 
  title-block-published: "Workshop Day 1"
  title-block-modified: "Last Modified"
format: 
  html:
    output-file: R-Projects.html
    number-sections: true
    number-depth: 2
    toc: true
    code-overflow: wrap
    code-tools: true
    embed-resources: false
  pdf:
    output-file: R-Projects.pdf
    toc: true
    number-sections: false
    colorlinks: true
    code-overflow: wrap
  revealjs:
    footer: "R-Projects and {here}"
    output-file: R-Projects-slides.html
editor_options: 
  chunk_output_type: console
bibliography: ../bibs/RProjects.bib
execute:
  echo: false
---

```{r}
#| echo: false
#| eval: false
rbbt::bbt_update_bib(here::here("pages", "RProjects.qmd"))
```

# Topics {.unnumbered .unlisted}

-   Project-oriented workflows
-   creating an R-Project
-   project-relative filepaths with the `here` package

# Installation requirements

-   required installations/recent versions of:
    -   R
        -   at least version `4.4.0`, "Puppy Cup"
        -   check current version with `R.version`
        -   download/update: <https://cran.r-project.org/bin/macosx/>
    -   RStudio
        -   at least version `2023.12.1.402`, "Ocean Storm"
        -   Help \> Check for updates
        -   new install: <https://posit.co/download/rstudio-desktop/>

# Project-oriented workflow  {data-stack-name="Project-oriented workflow"}

1. Folder structure:
    + keeping everything related to a project in one place
    + i.e., contained in a single folder, with subfolders as needed
2. Project-relative working directory
    + the project folder should act as your working directory
    + all file paths should be relative to this folder

## Folder structure

- a core computer literacy skill
  + keep your Desktop as empty as possible
  + have a sensible folder structure
  + avoid mixing subfolders and files
    + i.e., if a folder contains subfolders, ideally it should not contain files

# R-Projects  {data-stack-name="R-Projects"}

- in data analysis, using an IDE is beneficial
  + e.g., RStudio
- most IDEs have their own implementation of a Project
- in RStudio, this is the R-Project
  + creates a `.Rproj` file in a project folder
  + stores project settings
- you can have several R-Projects open simultaneously
  + and run several scripts across projects simultaneously
- most importantly, R-Projects (can) centralise a specific project's workflow and file path
- to read more about R-Projects, check out [Section 6.2: Projects](https://r4ds.hadley.nz/workflow-scripts.html#projects) from @wickham_r_2023 [or [Ch. 8 -  Workflow: Projects](https://r4ds.had.co.nz/workflow-projects.html) in @wickham_r_2016]

## Creating a new Project

- when?
  + whenever you're starting a new course oR-Project which will use R
- why?
  + to keep all the relavent materials in one place
- where?
  + somewhere that makes sense, e.g., a folder called `SoSe2024` or `Mastersarbeit`
- how?
  + `File > New Project > New Directory > New Project > [Directory name] > Create Project`

### {.unnumbered .unlisted}

::: {.callout-tip}
# New R-Project

Create a new R-Project for this workshop

  + `File > New Project > New Directory > New Project > [Directory name] > Create Project`
  + make sure you choose a sensible location
:::

## Opening a Project

- to open a project, locate its `.Rproj` file and double-click
- or if you're already in RStudio, you can use the `Project (None)` drop-down (top right)

:::: {.columns}

::: {.column width="50%"}

```{r}
#| label: fig-click-open
#| fig-cap: Double-click `.Rproj`
#| out-width: "80%"
magick::image_read(here::here("media", "rstudio_click_open.png"))
```

:::

::: {.column width="50%"}

```{r}
#| label: fig-project-open
#| fig-cap: Open from RStudio
#| out-width: "80%"
magick::image_read(here::here("media", "rstudio_project_open.png"))
```

:::


::::

## Adding a README file

- `File > New File > Markdown File` (*not* R Markdown!)
  + add some text describing the purpose of this project
  + include your name, the date
  + use Markdown formatting (e.g., `#` for headings, `*italics*`, `**bold**`)
- save as `README.md` in youR-Project directory

## Global RStudio options

:::: {.columns}

::: {.column width="50%"}

```{r}
#| label: fig-rstudio-settings
#| fig-cap: RStudio settings for reproducibility
#| out-width: "80%"
magick::image_read(here::here("media", "RStudio_global-options.png"))
```

:::

::: {.column width="50%"}

- `Tools > Global Options`
  + **Workspace**: Restore .RData into workspace at startup: NO
  + Save workspace to .RData on exit: Never

- this will ensure that you are always starting with a clean slate
  + and that your code is not dependent on some pacakge or object you created in another session
- this is also how RMarkdown and Quarto scripts run
  + they start with an empty environment and run the script linearly

:::


::::

## {.unnumbered .unlisted}

::: {.callout-tip}
## Global settings

Change your Global Options so that

  + **Workspace**: Restore .RData into workspace at startup: NO
  + Save workspace to .RData on exit: Never
:::

## Identifying your R-Project {.smaller}

- there are a ways to check which (if any) R-Project you're in
  + there are 6 differences between @fig-noproject and @fig-project
  + which is in an R-Project session?

::: {.panel-tabset}

### Spot the differences

:::: {.columns}

::: {.column width="45%"}

```{r}
#| label: fig-noproject
#| fig-cap: RStudio Session A
#| out-width: "100%"
magick::image_read(here::here("media", "rstudio_noproject.png"))
```

:::

::: {.column width="5%"}
:::

::: {.column width="45%"}

```{r}
#| label: fig-project
#| fig-cap: RStudio Session B
#| out-width: "100%"
magick::image_read(here::here("media", "rstudio_project.png"))
```

:::

::::

### Show the differences

```{r}
#| label: fig-project-diffs
#| fig-cap: How to tell if you're in a project
#| out-width: "80%"
magick::image_read(here::here("media", "RProject_spot-the-diffs.png"))
```

:::

## Folder structure

- some folders you'll typically want to have:
  + `data`: containing your dataset(s)
  + `scripts` (or `analyses`, etc.): containing any analysis scripts
  + `manuscript`: containing any write-ups of your results
  + `materials`: containing relevant experiment materials (e.g., stimuli)
- let's just create the first 2 (`data` and `scripts`)

### `data/`

- do you have "raw", i.e., pre-processed data?
  + if so, you might want to create a `raw` sub-folder
  + and any other relevant sub-folders (e.g., `processed` or `tidy`)
- download the [online_cleaned.csv](https://raw.githubusercontent.com/bodowinter/iconicity_challenge/refs/heads/master/data/online_cleaned.csv) dataset from the [GitHub](https://github.com/bodowinter/iconicity_challenge/tree/master) or [OSF](https://osf.io/4na58/) repo from @cwiek_novel_2021
  + *or*, move a dataset of your own to this folder

- save the file as `cwiek_2021-online_cleaned.csv`

::: {.content-visible when-format="revealjs"}
### {.unlisted .unnumbered}
:::

- description of data collection:

::: {.fragment}

> In an online experiment with listeners of 25 different languages (from nine language families), participants listened to the 90 vocalizations (three for each of the 30 meanings), and for each, guessed its intended meaning from six written alternatives
>
-- @cwiek_novel_2021

:::

- you could also download the data directly from GitHub in R:

::: {.fragment}

```{r}
#| eval: false
#| echo: true
write.csv(
  file = "data/cwiek_2021-online_cleaned.csv",
  read.csv("https://raw.githubusercontent.com/bodowinter/iconicity_challenge/refs/heads/master/data/online_cleaned.csv")
  )
```

:::

```{r}
#| echo: false
#| eval: false
write_csv(
  file = here::here("data/cwiek_2021-online_cleaned.csv"), read_csv("https://raw.githubusercontent.com/bodowinter/iconicity_challenge/refs/heads/master/data/online_cleaned.csv"))
```


### `scripts/`

- try to create a single script for each "product"
  + e.g., anonymised data, 'cleaned' data, data exploration, visualisation, analyses, etc.
- you can create sub-folders as the project develops and move scripts around
  + for now, let's create a new script to take a look at our data

### {.unnumbered .unlisted}

::: {.callout-tip}
## New script

Create a new script:

1. `File > New File >` Choose your preferred script type
5. Save it in your `scripts/` folder: `File > Save as...`
:::

### Load in the data

- load in the data however you normally would
  + e.g., `read.csv()`, `readr::read_csv()`, ...
  
::: {.content-hidden when-format="revealjs"}
### Exercise: mini-Code Review
:::

::: {.callout-tip}
#### R-Project template

::: nonincremental
1. Download the R-Project template at [https://osf.io/ctmwj/](https://osf.io/ctmwj/)
2. Open (or switch to) `rproject-template.Rproj`
3. Inspect the folder structure and the files.
4. Look at the `scripts/` folder. Is it clear which scripts should be run first?
5. Try running `02-visualisation.R` first. Do you encounter any problems?

:::
:::

# `here`-package {data-stack-name="{here}"}

- `here` package [@here-package] enables file referencing
  + avoids the use of `setwd()`

::: {.content-visible when-format="revealjs"}
## {.unnumbered .unlisted}
:::
```{r}
#| label: fig-here
#| fig-cap: Illustration by [Allison Horst](https://github.com/allisonhorst)
magick::image_read(here::here("media", "Horst_here.png"))
```


## The problem with `setwd()`

::: {.fragment}

> If the first line of your R script is
>
> `setwd("C:\Users\jenny\path\that\only\I\have")`
>
> I will come into your office and SET YOUR COMPUTER ON FIRE🔥.

--- [Jenny Bryan](https://x.com/hadleywickham/status/940021008764846080)

:::

- `setwd()` depends on your entire machine's folder structure
- `setwd()` breaks when you
  + send youR-Project folder to a collaborator
  + make your analyses open
  + change the location of youR-Project folder
- using slashes is also dependent on your operating system

::: {.content-visible when-format="revealjs"}
### {.unnumbered}
:::

- trying to use somebody else's (or your former) folder path will result in a warning message like:

::: {.fragment}
`Error in setwd("/Users/danielapalleschi/Documents/R/rproject-template") : `
 ` cannot change working directory`
:::

## The benefit of `here()`

- uses the top-level directory of your Project as the working directory
  + meaning we never need to specify the path to our project folder relative to our current higher-level folder structure
- can separate folder names with a comma
  + meaning it doesn't matter if the original code was written on a Mac or a Windows machine

## {.unlisted .unnumbered}

::: {.callout-tip}
# `here`

In your R Project, load the `cwiek_2021-online_cleaned.csv` data using `here`

1. Install `here` (if needed; e.g., `install.packages("here")`)
2. Load `here` at the beginning of your package
    + or use `here::` before calling a function
3. Use the `here()` function to load in your data
4. Inspect the dataset however you usually would (e.g., `summary()`, `names()`, etc.)
4. Save your script

:::

## `here::here()`

- install package

```{r filename = "In the Console"}
#| eval: false
#| echo: true
install.packages("here")
```

- load package and call the `here` function

```{r}
#| eval: false
#| echo: true
# load package
library(here)

# read in data
df_icon <- read.csv(here("data", "cwiek_2021-online_cleaned.csv"))
```

- or directly call the `here` function without loading the package

```{r}
#| eval: false
#| echo: true

# read in data without loading here
df_icon <- read.csv(here::here("data", "cwiek_2021-online_cleaned.csv"))
```

::: {.content-visible when-format="revealjs"}
### {.unlisted .uncounted .unnumbered}
:::

- note that I stored the data with the prefix `df_`
  + `df` stands for dataframe
- I recommend using object-type defining prefixes for all objects in your Environment
  + e.g., `fit_` for models, `fig_` for figures, `sum_` for summaries, `tbl_` for tables, etc.

## {.unlisted .unnumbered}

::: {.callout-tip}
# Reproduce your analysis

1. Perform some data exploration (e.g., with `names()`, `summary()`, `dplyr::glimpse()`, whatever you typically do)
1. Save your script, then close RStudio/your R-Project.
2. Re-open the project. Can you re-run the script?
:::

# Topics 🏁 {.unnumbered .unlisted .nonincremental}

-   Project-oriented workflows ✅
-   creating an R-Project ✅
-   project-relative filepaths with the `here` package ✅

# References {.unlisted .unnumbered visibility="uncounted"}

---
nocite: |
  @bryan_what_nodate
  @bryan_chapter_nodate
  @noauthor_using_2024
---

::: {#refs custom-style="Bibliography"}
:::