Raw to tidy data
Humboldt-Universität zu Berlin
2024-05-14
Today we will…
dplyr
verbs to wrangle columns and rows.RProj
file (which opens RStudio)File > New Project > New Directory > New Project > New Project > Create Project
.md
file: File > New File > Markdown File
README.md
in the project folderhere
here
package
here()
from within a project; what’s the output?Session > Restart R
# Session Info
) containing sessionInfo()
renv
package, targets
package, docker
for environment containersRProject
.RProj
README.md
data/
scripts/
(for analyses)notes/
(if for class notes)Scripts (.qmd
/.Rmd
)
sessionInfo()
at the end/ˈraŋɡl/
noun
a dispute or argument, typically one that is long and complicated. “an insurance wrangle is holding up compensation payments”
verb
have a long, complicated dispute or argument. “the bureaucrats continue wrangling over the fine print”
NORTH AMERICAN round up, herd, or take charge of (livestock). “the horses were wrangled early”
Tidy
Transform
Three rules (Wickham et al., 2023):
tidyverse
dplyr
package
tidyverse
you don’t need to also load dplyr
|>
Ctrl/Cmd+Shift+M
%>%
(magrittr
package) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# A tibble: 4,431 × 28
RECORDING_SESSION_LABEL TRIAL_INDEX EYE_USED IA_DWELL_TIME
<chr> <dbl> <chr> <dbl>
1 px3 1 RIGHT 0
2 px3 2 RIGHT 0
3 px3 3 RIGHT 0
4 px3 3 RIGHT 0
5 px3 3 RIGHT 0
6 px3 3 RIGHT 0
7 px3 3 RIGHT 0
8 px3 3 RIGHT 0
9 px3 4 RIGHT 0
10 px3 5 RIGHT 0
# ℹ 4,421 more rows
# ℹ 24 more variables: IA_FIRST_FIXATION_DURATION <dbl>,
# IA_FIRST_RUN_DWELL_TIME <dbl>, IA_FIXATION_COUNT <dbl>, IA_ID <dbl>,
# IA_LABEL <chr>, IA_REGRESSION_IN <dbl>, IA_REGRESSION_IN_COUNT <dbl>,
# IA_REGRESSION_OUT <dbl>, IA_REGRESSION_OUT_COUNT <dbl>,
# IA_REGRESSION_PATH_DURATION <dbl>, KeyPress <dbl>, rt <dbl>, bio <chr>,
# critical <chr>, gender <chr>, item_id <dbl>, list <dbl>, match <chr>, …
<-
<-
code_output_to_be_saved_as_object_namedf_lifetime
in the Environment paneA note on annotation
# comment
)First we load required libraries.
tidyverse
packagedplyr
package, which is part of the tidyverse
rename()
RECORDING_SESSION_LABEL
is a long way of saying ‘participant’Change the following names:
EYE_USED
to eye
IA_DWELL_TIME
to tt
IA_FIRST_FIXATION_DURATION
to ff
IA_FIXATION_COUNT
to fix_count
IA_FIRST_RUN_DWELL_TIME
to fp
IA_ID
to region_n
IA_LABEL
to region_text
IA_REGRESSION_IN
to reg_in
IA_REGRESSION_IN_COUNT
to reg_in_count
IA_REGRESSION_OUT
to reg_out
IA_REGRESSION_OUT_COUNT
to reg_out_count
IA_REGRESSION_PATH_DURATION
to rpd
name_vital_status
to lifetime
[1] "px" "trial" "eye" "tt"
[5] "ff" "fp" "fix_count" "region_n"
[9] "region_text" "reg_in" "reg_in_count" "reg_out"
[13] "reg_out_count" "rpd" "KeyPress" "rt"
[17] "bio" "critical" "gender" "item_id"
[21] "list" "match" "condition" "name"
[25] "lifetime" "tense" "type" "yes_press"
mutate()
Mutate column(s):
new_column
contain?# A tibble: 6 × 3
px new_column trial
<chr> <chr> <dbl>
1 px3 new 1
2 px3 new 2
3 px3 new 3
4 px3 new 3
5 px3 new 3
6 px3 new 3
new_column
and trial
contain?# A tibble: 6 × 3
px new_column trial
<chr> <chr> <dbl>
1 px3 px3 6
2 px3 px3 7
3 px3 px3 8
4 px3 px3 8
5 px3 px3 8
6 px3 px3 8
if_else()
mutate()
ifelse(condition, output_if_true, output_if_false)
Logical operators
symbols used to describe a logical condition
==
is idential (1 == 1
)
!=
is not identical (1 != 2
)
>
is greater than (2 > 1
)
<
is less than (1 < 2
)
&
and also (for multiple conditions)
|
or (for multiple conditions)
case_when()
mutate()
ifelse()
case_when(condition & other_condition | other_condition ~ output, TRUE ~ output_otherwise)
TRUE ~ output
then NA
s will createdaccept
that checks whether the button pressed (KeyPress
) equals the button that corresponds to an acceptance (yes_press
)
KeyPress
and yes_press
are the same, accept
should be 1
. If not, accept
should be 0
if_else()
or case_when()
accuracy
where:
match
is yes
and accept
is 1
, accuracy
is 1
match
is no
and accept
is 0
, accuracy
is 1
match
is yes
and accept
is 0
, accuracy
is 0
match
is no
and accept
is 1
, accuracy
is 0
region
, that has the following values based on region_n
1
is region verb-1
2
is region verb
3
is region verb+1
4
is region verb+2
5
is region verb+3
6
is region verb+4
region
is before region_n
KeyPress
is after yes_press
[1] "px" "trial" "region" "region_n"
[5] "region_text" "eye" "ff" "fp"
[9] "rpd" "tt" "fix_count" "reg_in"
[13] "reg_in_count" "reg_out" "reg_out_count" "rt"
[17] "bio" "critical" "gender" "item_id"
[21] "list" "match" "condition" "name"
[25] "lifetime" "tense" "type" "yes_press"
[29] "KeyPress" "new_column" "newer_column" "accept"
[33] "accuracy"
group_by()
and ungroup()
Group data by certain variable(s)
[1] 0.26 0.90
.by
mutate()
also takes .by =
as an argument
group_by()
/ungroup()
dplyr 1.1.0
version (more info)[1] 0.26 0.90
separate()
unite()
select()
<-
) will remove all other columns
df <- df |> select(...)
)# A tibble: 10 × 1
px
<chr>
1 px3
2 px3
3 px3
4 px3
5 px3
6 px3
7 px3
8 px3
9 px3
10 px3
select(-)
# A tibble: 10 × 34
region region_n region_text eye ff fp rpd tt fix_count reg_in
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 filler 1 He owned innu… RIGHT 0 0 0 0 0 0
2 filler 1 She is a moth… RIGHT 0 0 0 0 0 0
3 verb-1 1 She RIGHT 0 0 0 0 0 0
4 verb 2 will perform RIGHT 0 0 0 0 0 0
5 verb+1 3 in prestigiou… RIGHT 0 0 0 0 0 0
6 verb+2 4 in the future, RIGHT 0 0 0 0 0 0
7 verb+3 5 as reported i… RIGHT 0 0 0 0 0 0
8 verb+4 6 as reported i… RIGHT 0 0 0 0 0 0
9 filler 1 He interviewe… RIGHT 0 0 0 0 0 0
10 verb-1 1 She RIGHT 0 0 0 0 0 0
# ℹ 24 more variables: reg_in_count <dbl>, reg_out <dbl>, reg_out_count <dbl>,
# rt <dbl>, bio <chr>, critical <chr>, gender <chr>, item_id <dbl>,
# list <dbl>, match <chr>, condition <chr>, name <chr>, First <chr>,
# Last <chr>, lifetime <chr>, tense <chr>, type <chr>, yes_press <dbl>,
# KeyPress <dbl>, new_column <chr>, newer_column <chr>, accept <dbl>,
# accuracy <dbl>, px_accuracy <dbl>
Select criteria
You can also use criteria for select
:
select(starts_with("x"))
select columns that start with a character stringselect(ends_with("x"))
select columns that end with a character stringselect(contains("x"))
select columns that contain a character stringselect(num_range("prefix",10:20))
select columns with a prefix
followed by a range of valuesRemove the example variables we created with mutate
:
new_column
, newer_column
, First
, Last
[1] "px" "trial" "region" "region_n"
[5] "region_text" "eye" "ff" "fp"
[9] "rpd" "tt" "fix_count" "reg_in"
[13] "reg_in_count" "reg_out" "reg_out_count" "rt"
[17] "bio" "critical" "gender" "item_id"
[21] "list" "match" "condition" "name"
[25] "lifetime" "tense" "type" "yes_press"
[29] "KeyPress" "accept" "accuracy" "px_accuracy"
filter()
==
, !=
, >
, <
, |
)==
is needed# A tibble: 8 × 32
px trial region region_n region_text eye ff fp rpd tt
<chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 px3 1 filler 1 He owned innumerabl… RIGHT 0 0 0 0
2 px5 1 filler 1 She is a mother of … RIGHT 145 1603 1603 1603
3 px6 1 filler 1 He is a father of t… RIGHT 147 1224 1224 1224
4 px2 1 filler 1 She made innumerabl… RIGHT 84 1829 1829 1829
5 px7 1 filler 1 In the '70s, he own… RIGHT 138 2456 2456 2456
6 px1 1 filler 1 Beloved morning sho… RIGHT 160 1708 1708 1708
7 px8 1 filler 1 She was a mother of… RIGHT 220 806 806 806
8 px4 1 filler 1 In the '70s, he own… LEFT 171 3557 3557 3557
# ℹ 22 more variables: fix_count <dbl>, reg_in <dbl>, reg_in_count <dbl>,
# reg_out <dbl>, reg_out_count <dbl>, rt <dbl>, bio <chr>, critical <chr>,
# gender <chr>, item_id <dbl>, list <dbl>, match <chr>, condition <chr>,
# name <chr>, lifetime <chr>, tense <chr>, type <chr>, yes_press <dbl>,
# KeyPress <dbl>, accept <dbl>, accuracy <dbl>, px_accuracy <dbl>
filter()
What are these code chunks doing?
df_crit
that includes only critical trialsdf_fill
that includes only filler trialstype
# A tibble: 6 × 1
type
<chr>
1 critical
2 critical
3 critical
4 critical
5 critical
6 critical
distinct()
filter()
, but for distinct values of a variable
# A tibble: 639 × 2
px name
<chr> <chr>
1 px3 Edith Piaf
2 px3 Aaliyah
3 px3 David Beckham
4 px3 Jana Novotna
5 px3 Grace Kelly
6 px3 Nigella Lawson
7 px3 Coco Chanel
8 px3 Ben Kingsley
9 px3 Jim Carrey
10 px3 Judy Garland
# ℹ 629 more rows
# A tibble: 639 × 32
px trial region region_n region_text eye ff fp rpd tt
<chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 px3 3 verb-1 1 She RIGHT 0 0 0 0
2 px3 5 verb-1 1 She RIGHT 0 0 0 0
3 px3 8 verb-1 1 He RIGHT 0 0 0 0
4 px3 10 verb-1 1 She RIGHT 0 0 0 0
5 px3 13 verb-1 1 She RIGHT 0 0 0 0
6 px3 16 verb-1 1 She RIGHT 0 0 0 0
7 px3 18 verb-1 1 She RIGHT 0 0 0 0
8 px3 21 verb-1 1 He RIGHT 0 0 0 0
9 px3 23 verb-1 1 He RIGHT 0 0 0 0
10 px3 26 verb-1 1 She RIGHT 0 0 0 0
# ℹ 629 more rows
# ℹ 22 more variables: fix_count <dbl>, reg_in <dbl>, reg_in_count <dbl>,
# reg_out <dbl>, reg_out_count <dbl>, rt <dbl>, bio <chr>, critical <chr>,
# gender <chr>, item_id <dbl>, list <dbl>, match <chr>, condition <chr>,
# name <chr>, lifetime <chr>, tense <chr>, type <chr>, yes_press <dbl>,
# KeyPress <dbl>, accept <dbl>, accuracy <dbl>, px_accuracy <dbl>
arrange()
# A tibble: 639 × 4
px trial name condition
<chr> <dbl> <chr> <chr>
1 px1 3 Amy Winehouse deadPP
2 px1 5 John Wayne deadPP
3 px1 8 Abraham Lincoln deadPP
4 px1 10 Helen Mirren livingSF
5 px1 13 Paul McCartney livingSF
6 px1 16 Ariana Grande livingPP
7 px1 18 Kate Middleton livingSF
8 px1 21 Johan Cruyff deadSF
9 px1 23 Marilyn Monroe deadPP
10 px1 26 Biggie Smalls deadSF
# ℹ 629 more rows
# A tibble: 639 × 4
px trial name condition
<chr> <dbl> <chr> <chr>
1 px8 3 Whitney Houston deadPP
2 px8 5 Elton John livingSF
3 px8 8 Jackie Chan livingPP
4 px8 10 Romy Schneider deadPP
5 px8 13 James Cameron livingSF
6 px8 16 Ella Fitzgerald deadSF
7 px8 18 Kathryn Hepburn deadPP
8 px8 21 Kate Middleton livingPP
9 px8 23 Janis Joplin deadPP
10 px8 26 Serena Williams livingSF
# ℹ 629 more rows
wrangle | have a long dispute |
data wrangling | tidying and transforming your data |
tidy data | data where each column is a variable and each row is an observation |
the tidyverse | a group of packages for tidy data |
dplyr | a package within the tidyverse for data wrangling |
pipe operator (|> or |> ) |
operational function, passes the result of one function/argument to the next |
logical operators | compare values of two arguments: & , | , == , != , >, < |
read_csv() |
read-in a csv as a tibble (from readr package) |
rename() |
rename variables |
relocate() |
move variables |
mutate() |
change or create new variables |
if_else() |
condition for `mutate()` |
case_when() |
handle multiple conditions for `mutate()` |
group_by() |
group by a certain variable |
select() |
keep (or exclude) certain variables |
filter() |
keep (or exclude) rows based on some criteria |
distinct() |
keep rows with distinct value of given variable(s) |
arrange() |
sort variable(s) in ascending or descending order |
separate() |
split a variable into multiple variables |
pivot_longer() |
make wide data longer |
pivot_wider() |
make long data wider |
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 magick_2.8.3
loaded via a namespace (and not attached):
[1] bit_4.0.5 gtable_0.3.5 jsonlite_1.8.8 crayon_1.5.2
[5] compiler_4.4.0 renv_1.0.7 tidyselect_1.2.1 Rcpp_1.0.12
[9] parallel_4.4.0 scales_1.3.0 yaml_2.3.8 fastmap_1.2.0
[13] here_1.0.1 R6_2.5.1 generics_0.1.3 knitr_1.47
[17] munsell_0.5.1 rprojroot_2.0.4 tzdb_0.4.0 pillar_1.9.0
[21] rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 xfun_0.44
[25] bit64_4.0.5 timechange_0.3.0 cli_3.6.2 withr_3.0.0
[29] magrittr_2.0.3 digest_0.6.35 grid_4.4.0 vroom_1.6.5
[33] rstudioapi_0.16.0 hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5
[37] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0
[41] rmarkdown_2.27 tools_4.4.0 pkgconfig_2.0.3 htmltools_0.5.8.1