Data tidying

From wide-to-long and long-to-wide with tidyr

Author

Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

May 14, 2024

Learning objectives

Today we will…

learn how to re-structure our data with the tidyr package
use pivot_longer() to make data longer
use pivot_wider() to make data wider

Set-up

Load the tidyverse package

library(tidyverse)

Load a subset of the tidy_data_lifetime_pilot.csv data. For demonstration purposes, we’ll only look at two trials from a single participant.

df_lifetime <- readr::read_csv(here::here("data/tidy_data_lifetime_pilot.csv"), 
                               # for special characters
                               locale = readr::locale(encoding = "latin1") 
                               ) |>
  filter(type=="critical", px=="px5", trial %in% c(3,8)) |> 
  select(px,trial,region,ff,fp,rpd,tt)

Inspect data

we’ll be changing the shape of our data, so let’s first see how it looks as-is

df_lifetime

# A tibble: 10 × 7
   px    trial region    ff    fp   rpd    tt
   <chr> <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>
 1 px5       3 verb-1   190   190   190   190
 2 px5       3 verb     175   175   175   321
 3 px5       3 verb+1   154   154  1258  1723
 4 px5       3 verb+2   160   283   283   672
 5 px5       3 verb+3   156   575  1940   575
 6 px5       8 verb-1   246   246   246   246
 7 px5       8 verb     228   960   960  1892
 8 px5       8 verb+1   176   573   573   967
 9 px5       8 verb+2   151   151   151   450
10 px5       8 verb+3   216   981  2852   981

of importance, we have the following variables:
- region: contains info on which sentence region the row’s reading times correspond to
- ff: first fixation time, an eye-tracking reading measure
- fp: first-pass reading time, an eye-tracking reading measure
- rpd: regression path duration, an eye-tracking reading measure
- tt: total reading time, an eye-tracking reading measure
we see that we have 10 rows x 4 reading time measures = 40 reading time measures

Reshape data

this is the major step of data tidying
- make each column a variable
- make each row an observation
- make each cell a data point
what variable and observation mean will depend on what you want to do, and will change at different steps of your analyses
you typically want long data
- but our dataset isn’t as long as it could be
the tidyr package from the tidyverse has some useful functions to faciliate this: pivot_longer() and pivot_wider()

Pivot with `tidyr`

to pivot (verb): to turn or rotate on a point, like a hinge. Or a basketball player pivoting back and forth on one foot to protect the ball. (vocabulary.com)
a pivot (noun): a fixed point supporting something that turns or balances (dictonary.Cambridge.org)

Figure 1: A memorable scence (to millenials) from Friends where the word ‘pivot’ is repeatedly used (YouTube clip)

Wide-to-long: `pivot_longer()`

pivot_longer() takes wide data and makes it longer
- converts headers of columns into values of a new column
- combines the values of those columns into a new condensed column
takes a few arguments:
- cols: which columns do we want to combine into a single column?
- names_to: what should we call the new column containing the previous column names?
- values_to: what should we call the new column containing the values from the previous columns?

let’s take our four reading time measures and list them in a single variable that we’ll call measure, and put their values in a second variable called time

df_longer <-
  df_lifetime |> 
  pivot_longer(
  cols = c(ff,fp,rpd,tt), # columns to make long 
  names_to = "measure", # new column name for headers
  values_to = "time" # new column name for values
)

df_longer

# A tibble: 40 × 5
   px    trial region measure  time
   <chr> <dbl> <chr>  <chr>   <dbl>
 1 px5       3 verb-1 ff        190
 2 px5       3 verb-1 fp        190
 3 px5       3 verb-1 rpd       190
 4 px5       3 verb-1 tt        190
 5 px5       3 verb   ff        175
 6 px5       3 verb   fp        175
 7 px5       3 verb   rpd       175
 8 px5       3 verb   tt        321
 9 px5       3 verb+1 ff        154
10 px5       3 verb+1 fp        154
# ℹ 30 more rows

now instead of having the four reading time values in a single row across four columns called ff, fp, rpd, and tt, we have two columns (measure and time) which contain the reading time measure names and corresponding reading times
we again still have 40 reading time values: 40 rows x 1 column containing reading time values (time)

Long-to-wide: `pivot_wider()`

pivot_wider() takes long data and makes it wider
takes a few arguments:
- id_cols: identifying columns
- names_from: what should we call the new column containing the previous column names?
- names_prefix:
- values_from: new column values

let’s now take our region column in df_longer and widen it
- we’ll do this only for tt (total reading time) the resultfour reading time measures and list them in a single variable that we’ll call measure, and put their values in a second variable called time

df_longer_wider <-
  df_longer |> 
  pivot_wider(
  id_cols = c(px,trial,measure), # columns to make long 
  names_from = region, # new column name for headers
  names_prefix = "reg_", # new column name for values (optional)
  values_from = time
)

df_longer_wider

# A tibble: 8 × 8
  px    trial measure `reg_verb-1` reg_verb `reg_verb+1` `reg_verb+2`
  <chr> <dbl> <chr>          <dbl>    <dbl>        <dbl>        <dbl>
1 px5       3 ff               190      175          154          160
2 px5       3 fp               190      175          154          283
3 px5       3 rpd              190      175         1258          283
4 px5       3 tt               190      321         1723          672
5 px5       8 ff               246      228          176          151
6 px5       8 fp               246      960          573          151
7 px5       8 rpd              246      960          573          151
8 px5       8 tt               246     1892          967          450
# ℹ 1 more variable: `reg_verb+3` <dbl>

again, we have 40 reading time values: 8 rows x 5 variables containing reading time values per region

Re-structuring, not changing

in df_lifetime, df_longer, and df_longer_wider, we have 40 reading time values
- we have the exact same information in all three versions
- we have not removed or changed our data
- we have only changed the structure of the data
this might not always be the case, based on what you’re trying to achieve
- but it’s important to understand that you can find the same information in long versus wide data
- the way you structure your data should reflect/facilitate what you’re trying to say about your data

look at the three versions of the data below, and ask yourself: what does each one more easily communicate?

df_longer

# A tibble: 40 × 5
   px    trial region measure  time
   <chr> <dbl> <chr>  <chr>   <dbl>
 1 px5       3 verb-1 ff        190
 2 px5       3 verb-1 fp        190
 3 px5       3 verb-1 rpd       190
 4 px5       3 verb-1 tt        190
 5 px5       3 verb   ff        175
 6 px5       3 verb   fp        175
 7 px5       3 verb   rpd       175
 8 px5       3 verb   tt        321
 9 px5       3 verb+1 ff        154
10 px5       3 verb+1 fp        154
# ℹ 30 more rows

# only first 15 rows
df_longer |> head(15)

# A tibble: 15 × 5
   px    trial region measure  time
   <chr> <dbl> <chr>  <chr>   <dbl>
 1 px5       3 verb-1 ff        190
 2 px5       3 verb-1 fp        190
 3 px5       3 verb-1 rpd       190
 4 px5       3 verb-1 tt        190
 5 px5       3 verb   ff        175
 6 px5       3 verb   fp        175
 7 px5       3 verb   rpd       175
 8 px5       3 verb   tt        321
 9 px5       3 verb+1 ff        154
10 px5       3 verb+1 fp        154
11 px5       3 verb+1 rpd      1258
12 px5       3 verb+1 tt       1723
13 px5       3 verb+2 ff        160
14 px5       3 verb+2 fp        283
15 px5       3 verb+2 rpd       283

df_longer_wider

# A tibble: 8 × 8
  px    trial measure `reg_verb-1` reg_verb `reg_verb+1` `reg_verb+2`
  <chr> <dbl> <chr>          <dbl>    <dbl>        <dbl>        <dbl>
1 px5       3 ff               190      175          154          160
2 px5       3 fp               190      175          154          283
3 px5       3 rpd              190      175         1258          283
4 px5       3 tt               190      321         1723          672
5 px5       8 ff               246      228          176          151
6 px5       8 fp               246      960          573          151
7 px5       8 rpd              246      960          573          151
8 px5       8 tt               246     1892          967          450
# ℹ 1 more variable: `reg_verb+3` <dbl>

Learning objectives 🏁

Today we…

learned how to re-structure our data with the tidyr package ✅
used pivot_longer() to make data longer ✅
used pivot_wider() to make data wider ✅

--- title: "Data tidying" subtitle: "From wide-to-long and long-to-wide with `tidyr`" author: "Daniela Palleschi" institute: Humboldt-Universität zu Berlin lang: en date: 2024-05-14 format: html: output-file: tidying.html number-sections: false toc: true code-overflow: wrap code-tools: true self-contained: true pdf: output-file: tidying.pdf toc: true number-sections: false colorlinks: true code-overflow: wrap revealjs: output-file: tidying_slides.html include-in-header: ../../mathjax.html # for multiple equation hyperrefs code-overflow: wrap theme: [dark] width: 1600 height: 900 # chalkboard: # src: chalkboard.json progress: true scrollable: true # smaller: true slide-number: c/t code-link: true # logo: logos/hu_logo.png # css: logo.css incremental: true # number-sections: true toc: false toc-depth: 2 toc-title: 'Overview' navigation-mode: linear controls-layout: bottom-right fig-cap-location: top font-size: 0.6em slide-level: 4 self-contained: true title-slide-attributes: data-background-image: logos/logos.tif data-background-size: 15% data-background-position: 50% 92% fig-align: center fig-dpi: 300 editor_options: chunk_output_type: console bibliography: ../../references.bib csl: ../../apa.csl execute: echo: true --- # Learning objectives Today we will... - learn how to re-structure our data with the `tidyr` package - use `pivot_longer()` to make data longer - use `pivot_wider()` to make data wider # Set-up {.unnumbered} Load the `tidyverse` package ```{r} library(tidyverse) ``` Load a subset of the `tidy_data_lifetime_pilot.csv` data. For demonstration purposes, we'll only look at two trials from a single participant. ```{r} df_lifetime <- readr::read_csv(here::here("data/tidy_data_lifetime_pilot.csv"), # for special characters locale = readr::locale(encoding = "latin1") ) |> filter(type=="critical", px=="px5", trial %in% c(3,8)) |> select(px,trial,region,ff,fp,rpd,tt) ``` ## Inspect data - we'll be changing the shape of our data, so let's first see how it looks as-is ```{r} df_lifetime ``` - of importance, we have the following variables: + `region`: contains info on which sentence region the row's reading times correspond to + `ff`: first fixation time, an eye-tracking reading measure + `fp`: first-pass reading time, an eye-tracking reading measure + `rpd`: regression path duration, an eye-tracking reading measure + `tt`: total reading time, an eye-tracking reading measure - we see that we have 10 rows x 4 reading time measures = 40 reading time measures # Reshape data - this is the major step of data tidying + make each column a variable + make each row an observation + make each cell a data point - what variable and observation mean will depend on what you want to do, and will change at different steps of your analyses - you typically want *long* data + but our dataset isn't as long as it could be - the `tidyr` package from the `tidyverse` has some useful functions to faciliate this: `pivot_longer()` and `pivot_wider()` ## Pivot with `tidyr` - to pivot (verb): *to turn or rotate on a point, like a hinge. Or a basketball player pivoting back and forth on one foot to protect the ball.* ([vocabulary.com](https://www.vocabulary.com/dictionary/pivot#:~:text=To%20pivot%20is%20to%20turn,foot%20to%20protect%20the%20ball.)) - a pivot (noun): *a fixed point supporting something that turns or balances* ([dictonary.Cambridge.org](https://dictionary.cambridge.org/dictionary/english/pivot)) ::: {.content-visible when-format="revealjs"} ## ::: ::: {.content-hidden when-format="pdf"} ```{r echo = F, fig.align = "center", out.width="100%"} #| fig-cap-location: bottom #| label: fig-friends #| fig-cap: "A memorable scence (to millenials) from Friends where the word 'pivot' is repeatedly used ([YouTube clip](https://www.youtube.com/watch?v=8w3wmQAMoxQ))" magick::image_read(here::here("media/pivot_friends.jpeg")) ``` ::: ::: {.content-visible when-format="pdf"} ```{r echo = F, fig.align = "center", out.width="50%"} #| fig-cap-location: bottom magick::image_read(here::here("media/pivot_friends.jpeg")) ``` ::: # Wide-to-long: `pivot_longer()` - `pivot_longer()` takes wide data and makes it longer + converts headers of columns into values of a new column + combines the values of those columns into a new condensed column - takes a few arguments: + `cols`: which columns do we want to combine into a single column? + `names_to`: what should we call the new column containing the previous column names? + `values_to`: what should we call the new column containing the values from the previous columns? ::: {.content-visible when-format="revealjs"} ## `pivot_longer()` ::: - let's take our four reading time measures and list them in a single variable that we'll call `measure`, and put their values in a second variable called `time` ```{r} #| output-location: fragment df_longer <- df_lifetime |> pivot_longer( cols = c(ff,fp,rpd,tt), # columns to make long names_to = "measure", # new column name for headers values_to = "time" # new column name for values ) ``` ```{r} df_longer ``` - now instead of having the four reading time values in a single row across four columns called `ff`, `fp`, `rpd`, and `tt`, we have two columns (`measure` and `time`) which contain the reading time measure names and corresponding reading times - we again still have 40 reading time values: 40 rows x 1 column containing reading time values (`time`) # Long-to-wide: `pivot_wider()` - `pivot_wider()` takes long data and makes it wider - takes a few arguments: + `id_cols`: identifying columns + `names_from`: what should we call the new column containing the previous column names? + `names_prefix`: + `values_from`: new column values ::: {.content-visible when-format="revealjs"} ## `pivot_wider()` ::: - let's now take our `region` column in `df_longer` and widen it + we'll do this only for `tt` (total reading time) the resultfour reading time measures and list them in a single variable that we'll call `measure`, and put their values in a second variable called `time` ```{r} #| output-location: fragment df_longer_wider <- df_longer |> pivot_wider( id_cols = c(px,trial,measure), # columns to make long names_from = region, # new column name for headers names_prefix = "reg_", # new column name for values (optional) values_from = time ) ``` ```{r} df_longer_wider ``` - again, we have 40 reading time values: 8 rows x 5 variables containing reading time values per region # Re-structuring, not changing - in `df_lifetime`, `df_longer`, and `df_longer_wider`, we have 40 reading time values + we have the exact same information in all three versions + we have not removed or changed our data + we have only changed the *structure* of the data - this might not always be the case, based on what you're trying to achieve + but it's important to understand that you can find the same information in long versus wide data + the way you structure your data should reflect/facilitate what you're trying to say about your data ::: {.content-visible when-format="revealjs"} ## ::: - look at the three versions of the data below, and ask yourself: what does each one more easily communicate? ```{r} df_longer ``` ```{r} # only first 15 rows df_longer |> head(15) ``` ```{r} df_longer_wider ``` ::: footer More reading: [PsyTeachR](https://psyteachr.github.io/ads-v2/08-tidy.html#long-to-wide) ::: # Learning objectives 🏁 Today we... - learned how to re-structure our data with the `tidyr` package ✅ - used `pivot_longer()` to make data longer ✅ - used `pivot_wider()` to make data wider ✅