library(tidyverse)
Data dictionary
A data dictionary is (usually) a table defining each variable in a dataset, often including information like data type (continuous, categorical, etc.) and a description of its values. It’s helpful for collaborators, others trying to reproduce your analyses, and of course future-you.
Data dictionary in R
I first started creating data dictionaries based on a free Coding Club I attended on creating an R package, given by Lisa DeBruine.
You’ll need the package dplyr
from the tidyverse
.
I prefer to load packages in using the function p_load
from the pacman
package. It will install and load packages that you don’t already have installed, otherwise it simply loads a package.
# install.packages("pacman")
::p_load(tidyverse) pacman
Step 1: Load in data
<- palmerpenguins::penguins df_penguins
Step 2: Get variable names
dput(names(df_penguins))
c("species", "island", "bill_length_mm", "bill_depth_mm", "flipper_length_mm",
"body_mass_g", "sex", "year")
Copy list of names (without c(
and )
) and place them in tribble()
from dplyr
. Add varible headers ~variable
and ~description
, or whatever you like.
tribble(
~ "variable", ~ "description",
"species", "island", "bill_length_mm", "bill_depth_mm", "flipper_length_mm",
"body_mass_g", "sex", "year"
)
Highlight the variable names, and use the keyboard shortcut Cmd/Ctrl+Shift+A
to format the list, making a single row for each. Then add "",
after each variable name. Now we’ve got all our variable names, which will appear under the header variable
, and an empty string where we can write the variable description that will appear under the heading despcription
. Try printing this to see how the table looks.
tribble(
~ "variable", ~ "description",
"species", "",
"island", "",
"bill_length_mm", "",
"bill_depth_mm", "",
"flipper_length_mm", "",
"body_mass_g", "",
"sex", "",
"year", ""
)
Step 3: Create table
Fill in all the variable descriptions. Then feed the tibble through your favourite table formatter. I’ll use kable()
from knitr
and kable_styling()
from kableExtra
for some extra HTML formatting. You might also try gt()
from the gt
package.
::tribble(
dplyr~ "variable", ~ "description",
"species", "penguin species",
"island", "island penguin lives on",
"bill_length_mm", "length of bill in millimeters",
"bill_depth_mm", "depth of bill in millimeters",
"flipper_length_mm", "length of flipper in millimeters",
"body_mass_g", "penguin weight in grams",
"sex", "penguin sex",
"year", "year of data collection"
|>
) ::kable() |>
knitr::kable_styling() kableExtra
variable | description |
---|---|
species | penguin species |
island | island penguin lives on |
bill_length_mm | length of bill in millimeters |
bill_depth_mm | depth of bill in millimeters |
flipper_length_mm | length of flipper in millimeters |
body_mass_g | penguin weight in grams |
sex | penguin sex |
year | year of data collection |