Raw to tidy data
Humboldt-Universität zu Berlin
2024-05-14
Today we will…
dplyr verbs to wrangle columns and rows.RProj file (which opens RStudio)File > New Project > New Directory > New Project > New Project > Create Project.md file: File > New File > Markdown File
README.md in the project folderherehere package
here() from within a project; what’s the output?Session > Restart R
# Session Info) containing sessionInfo()
renv package, targets package, docker for environment containersRProject
.RProjREADME.mddata/scripts/ (for analyses)notes/ (if for class notes)Scripts (.qmd/.Rmd)
sessionInfo() at the end/ˈraŋɡl/
noun
a dispute or argument, typically one that is long and complicated. “an insurance wrangle is holding up compensation payments”
verb
have a long, complicated dispute or argument. “the bureaucrats continue wrangling over the fine print”
NORTH AMERICAN round up, herd, or take charge of (livestock). “the horses were wrangled early”
Tidy
Transform
Three rules (Wickham et al., 2023):
Image source: Wickham et al. (2023) (all rights reserved)
tidyverse
dplyr package
tidyverse you don’t need to also load dplyr
|>
Ctrl/Cmd+Shift+M
%>% (magrittr package) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# A tibble: 4,431 × 28
RECORDING_SESSION_LABEL TRIAL_INDEX EYE_USED IA_DWELL_TIME
<chr> <dbl> <chr> <dbl>
1 px3 1 RIGHT 0
2 px3 2 RIGHT 0
3 px3 3 RIGHT 0
4 px3 3 RIGHT 0
5 px3 3 RIGHT 0
6 px3 3 RIGHT 0
7 px3 3 RIGHT 0
8 px3 3 RIGHT 0
9 px3 4 RIGHT 0
10 px3 5 RIGHT 0
# ℹ 4,421 more rows
# ℹ 24 more variables: IA_FIRST_FIXATION_DURATION <dbl>,
# IA_FIRST_RUN_DWELL_TIME <dbl>, IA_FIXATION_COUNT <dbl>, IA_ID <dbl>,
# IA_LABEL <chr>, IA_REGRESSION_IN <dbl>, IA_REGRESSION_IN_COUNT <dbl>,
# IA_REGRESSION_OUT <dbl>, IA_REGRESSION_OUT_COUNT <dbl>,
# IA_REGRESSION_PATH_DURATION <dbl>, KeyPress <dbl>, rt <dbl>, bio <chr>,
# critical <chr>, gender <chr>, item_id <dbl>, list <dbl>, match <chr>, …
<-
<- code_output_to_be_saved_as_object_namedf_lifetime in the Environment paneA note on annotation
# comment)First we load required libraries.
tidyverse packagedplyr package, which is part of the tidyverse
rename()RECORDING_SESSION_LABEL is a long way of saying ‘participant’Change the following names:
EYE_USED to eye
IA_DWELL_TIME to tt
IA_FIRST_FIXATION_DURATION to ff
IA_FIXATION_COUNT to fix_count
IA_FIRST_RUN_DWELL_TIME to fp
IA_ID to region_n
IA_LABEL to region_text
IA_REGRESSION_IN to reg_in
IA_REGRESSION_IN_COUNT to reg_in_count
IA_REGRESSION_OUT to reg_out
IA_REGRESSION_OUT_COUNT to reg_out_count
IA_REGRESSION_PATH_DURATION to rpd
name_vital_status to lifetime
[1] "px" "trial" "eye" "tt"
[5] "ff" "fp" "fix_count" "region_n"
[9] "region_text" "reg_in" "reg_in_count" "reg_out"
[13] "reg_out_count" "rpd" "KeyPress" "rt"
[17] "bio" "critical" "gender" "item_id"
[21] "list" "match" "condition" "name"
[25] "lifetime" "tense" "type" "yes_press"
mutate()Mutate column(s):
new_column contain?# A tibble: 6 × 3
px new_column trial
<chr> <chr> <dbl>
1 px3 new 1
2 px3 new 2
3 px3 new 3
4 px3 new 3
5 px3 new 3
6 px3 new 3
new_column and trial contain?# A tibble: 6 × 3
px new_column trial
<chr> <chr> <dbl>
1 px3 px3 6
2 px3 px3 7
3 px3 px3 8
4 px3 px3 8
5 px3 px3 8
6 px3 px3 8
if_else()mutate()
ifelse(condition, output_if_true, output_if_false)Logical operators
symbols used to describe a logical condition
== is idential (1 == 1)
!= is not identical (1 != 2)
> is greater than (2 > 1)
< is less than (1 < 2)
& and also (for multiple conditions)
| or (for multiple conditions)
case_when()mutate()
ifelse()
case_when(condition & other_condition | other_condition ~ output, TRUE ~ output_otherwise)
TRUE ~ output then NAs will createdaccept that checks whether the button pressed (KeyPress) equals the button that corresponds to an acceptance (yes_press)
KeyPress and yes_press are the same, accept should be 1. If not, accept should be 0
if_else() or case_when()
accuracy where:
match is yes and accept is 1, accuracy is 1
match is no and accept is 0, accuracy is 1
match is yes and accept is 0, accuracy is 0
match is no and accept is 1, accuracy is 0
region, that has the following values based on region_n
1 is region verb-1
2 is region verb
3 is region verb+1
4 is region verb+2
5 is region verb+3
6 is region verb+4
region is before region_n
KeyPress is after yes_press
[1] "px" "trial" "region" "region_n"
[5] "region_text" "eye" "ff" "fp"
[9] "rpd" "tt" "fix_count" "reg_in"
[13] "reg_in_count" "reg_out" "reg_out_count" "rt"
[17] "bio" "critical" "gender" "item_id"
[21] "list" "match" "condition" "name"
[25] "lifetime" "tense" "type" "yes_press"
[29] "KeyPress" "new_column" "newer_column" "accept"
[33] "accuracy"
group_by() and ungroup()
Group data by certain variable(s)
[1] 0.26 0.90
.bymutate() also takes .by = as an argument
group_by()/ungroup()
dplyr 1.1.0 version (more info)[1] 0.26 0.90
separate()unite()
select()<-) will remove all other columns
df <- df |> select(...))# A tibble: 10 × 1
px
<chr>
1 px3
2 px3
3 px3
4 px3
5 px3
6 px3
7 px3
8 px3
9 px3
10 px3
select(-)# A tibble: 10 × 34
region region_n region_text eye ff fp rpd tt fix_count reg_in
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 filler 1 He owned innu… RIGHT 0 0 0 0 0 0
2 filler 1 She is a moth… RIGHT 0 0 0 0 0 0
3 verb-1 1 She RIGHT 0 0 0 0 0 0
4 verb 2 will perform RIGHT 0 0 0 0 0 0
5 verb+1 3 in prestigiou… RIGHT 0 0 0 0 0 0
6 verb+2 4 in the future, RIGHT 0 0 0 0 0 0
7 verb+3 5 as reported i… RIGHT 0 0 0 0 0 0
8 verb+4 6 as reported i… RIGHT 0 0 0 0 0 0
9 filler 1 He interviewe… RIGHT 0 0 0 0 0 0
10 verb-1 1 She RIGHT 0 0 0 0 0 0
# ℹ 24 more variables: reg_in_count <dbl>, reg_out <dbl>, reg_out_count <dbl>,
# rt <dbl>, bio <chr>, critical <chr>, gender <chr>, item_id <dbl>,
# list <dbl>, match <chr>, condition <chr>, name <chr>, First <chr>,
# Last <chr>, lifetime <chr>, tense <chr>, type <chr>, yes_press <dbl>,
# KeyPress <dbl>, new_column <chr>, newer_column <chr>, accept <dbl>,
# accuracy <dbl>, px_accuracy <dbl>
Select criteria
You can also use criteria for select:
select(starts_with("x")) select columns that start with a character stringselect(ends_with("x")) select columns that end with a character stringselect(contains("x")) select columns that contain a character stringselect(num_range("prefix",10:20)) select columns with a prefix followed by a range of valuesRemove the example variables we created with mutate:
new_column, newer_column, First, Last
[1] "px" "trial" "region" "region_n"
[5] "region_text" "eye" "ff" "fp"
[9] "rpd" "tt" "fix_count" "reg_in"
[13] "reg_in_count" "reg_out" "reg_out_count" "rt"
[17] "bio" "critical" "gender" "item_id"
[21] "list" "match" "condition" "name"
[25] "lifetime" "tense" "type" "yes_press"
[29] "KeyPress" "accept" "accuracy" "px_accuracy"
filter()==, !=, >, <, |)== is needed# A tibble: 8 × 32
px trial region region_n region_text eye ff fp rpd tt
<chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 px3 1 filler 1 He owned innumerabl… RIGHT 0 0 0 0
2 px5 1 filler 1 She is a mother of … RIGHT 145 1603 1603 1603
3 px6 1 filler 1 He is a father of t… RIGHT 147 1224 1224 1224
4 px2 1 filler 1 She made innumerabl… RIGHT 84 1829 1829 1829
5 px7 1 filler 1 In the '70s, he own… RIGHT 138 2456 2456 2456
6 px1 1 filler 1 Beloved morning sho… RIGHT 160 1708 1708 1708
7 px8 1 filler 1 She was a mother of… RIGHT 220 806 806 806
8 px4 1 filler 1 In the '70s, he own… LEFT 171 3557 3557 3557
# ℹ 22 more variables: fix_count <dbl>, reg_in <dbl>, reg_in_count <dbl>,
# reg_out <dbl>, reg_out_count <dbl>, rt <dbl>, bio <chr>, critical <chr>,
# gender <chr>, item_id <dbl>, list <dbl>, match <chr>, condition <chr>,
# name <chr>, lifetime <chr>, tense <chr>, type <chr>, yes_press <dbl>,
# KeyPress <dbl>, accept <dbl>, accuracy <dbl>, px_accuracy <dbl>
filter()What are these code chunks doing?
df_crit that includes only critical trialsdf_fill that includes only filler trialstype
# A tibble: 6 × 1
type
<chr>
1 critical
2 critical
3 critical
4 critical
5 critical
6 critical
distinct()filter(), but for distinct values of a variable
# A tibble: 639 × 2
px name
<chr> <chr>
1 px3 Edith Piaf
2 px3 Aaliyah
3 px3 David Beckham
4 px3 Jana Novotna
5 px3 Grace Kelly
6 px3 Nigella Lawson
7 px3 Coco Chanel
8 px3 Ben Kingsley
9 px3 Jim Carrey
10 px3 Judy Garland
# ℹ 629 more rows
# A tibble: 639 × 32
px trial region region_n region_text eye ff fp rpd tt
<chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 px3 3 verb-1 1 She RIGHT 0 0 0 0
2 px3 5 verb-1 1 She RIGHT 0 0 0 0
3 px3 8 verb-1 1 He RIGHT 0 0 0 0
4 px3 10 verb-1 1 She RIGHT 0 0 0 0
5 px3 13 verb-1 1 She RIGHT 0 0 0 0
6 px3 16 verb-1 1 She RIGHT 0 0 0 0
7 px3 18 verb-1 1 She RIGHT 0 0 0 0
8 px3 21 verb-1 1 He RIGHT 0 0 0 0
9 px3 23 verb-1 1 He RIGHT 0 0 0 0
10 px3 26 verb-1 1 She RIGHT 0 0 0 0
# ℹ 629 more rows
# ℹ 22 more variables: fix_count <dbl>, reg_in <dbl>, reg_in_count <dbl>,
# reg_out <dbl>, reg_out_count <dbl>, rt <dbl>, bio <chr>, critical <chr>,
# gender <chr>, item_id <dbl>, list <dbl>, match <chr>, condition <chr>,
# name <chr>, lifetime <chr>, tense <chr>, type <chr>, yes_press <dbl>,
# KeyPress <dbl>, accept <dbl>, accuracy <dbl>, px_accuracy <dbl>
arrange()# A tibble: 639 × 4
px trial name condition
<chr> <dbl> <chr> <chr>
1 px1 3 Amy Winehouse deadPP
2 px1 5 John Wayne deadPP
3 px1 8 Abraham Lincoln deadPP
4 px1 10 Helen Mirren livingSF
5 px1 13 Paul McCartney livingSF
6 px1 16 Ariana Grande livingPP
7 px1 18 Kate Middleton livingSF
8 px1 21 Johan Cruyff deadSF
9 px1 23 Marilyn Monroe deadPP
10 px1 26 Biggie Smalls deadSF
# ℹ 629 more rows
# A tibble: 639 × 4
px trial name condition
<chr> <dbl> <chr> <chr>
1 px8 3 Whitney Houston deadPP
2 px8 5 Elton John livingSF
3 px8 8 Jackie Chan livingPP
4 px8 10 Romy Schneider deadPP
5 px8 13 James Cameron livingSF
6 px8 16 Ella Fitzgerald deadSF
7 px8 18 Kathryn Hepburn deadPP
8 px8 21 Kate Middleton livingPP
9 px8 23 Janis Joplin deadPP
10 px8 26 Serena Williams livingSF
# ℹ 629 more rows
| wrangle | have a long dispute |
| data wrangling | tidying and transforming your data |
| tidy data | data where each column is a variable and each row is an observation |
| the tidyverse | a group of packages for tidy data |
| dplyr | a package within the tidyverse for data wrangling |
pipe operator (|> or |>) |
operational function, passes the result of one function/argument to the next |
| logical operators | compare values of two arguments: &, |, ==, !=, >, < |
read_csv() |
read-in a csv as a tibble (from readr package) |
rename() |
rename variables |
relocate() |
move variables |
mutate() |
change or create new variables |
if_else() |
condition for `mutate()` |
case_when() |
handle multiple conditions for `mutate()` |
group_by() |
group by a certain variable |
select() |
keep (or exclude) certain variables |
filter() |
keep (or exclude) rows based on some criteria |
distinct() |
keep rows with distinct value of given variable(s) |
arrange() |
sort variable(s) in ascending or descending order |
separate() |
split a variable into multiple variables |
pivot_longer() |
make wide data longer |
pivot_wider() |
make long data wider |
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 magick_2.8.3
loaded via a namespace (and not attached):
[1] bit_4.0.5 gtable_0.3.5 jsonlite_1.8.8 crayon_1.5.2
[5] compiler_4.4.0 renv_1.0.7 tidyselect_1.2.1 Rcpp_1.0.12
[9] parallel_4.4.0 scales_1.3.0 yaml_2.3.8 fastmap_1.2.0
[13] here_1.0.1 R6_2.5.1 generics_0.1.3 knitr_1.47
[17] munsell_0.5.1 rprojroot_2.0.4 tzdb_0.4.0 pillar_1.9.0
[21] rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 xfun_0.44
[25] bit64_4.0.5 timechange_0.3.0 cli_3.6.2 withr_3.0.0
[29] magrittr_2.0.3 digest_0.6.35 grid_4.4.0 vroom_1.6.5
[33] rstudioapi_0.16.0 hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5
[37] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0
[41] rmarkdown_2.27 tools_4.4.0 pkgconfig_2.0.3 htmltools_0.5.8.1