DAL tutorial - Week 7

Data tidying and joining

1 Pivoting and joining data

For this tutorial, you should go through the following chapters of the R4DS textbook:

R4DS Ch 5 on tidying data.
R4DS Ch 19 on joining data.

You can read them in any order you like (even jump from one to the other). Make sure you try the code out yourself and feel free to try the exercises as well (you can find the solutions here.

These chapters cover functions and methods that you will need to complete Exercise 2 of Summative Assessment 1.

2 Reading multiple files at once

Another important skill to learn is how to read multiple files at once and save the output into a single tibble/data frame.

This can be achieved with the list.files() function.

For example, let’s read individual files with tongue contours data from ultrasound tongue imaging (UTI). These files are in data/coretta2018/ultrasound/.

We can list all files like so:

list.files("data/coretta2018/ultrasound")

 [1] "it01-tongue-cart.tsv"  "it01-vowel-series.tsv" "it02-tongue-cart.tsv" 
 [4] "it02-vowel-series.tsv" "it03-tongue-cart.tsv"  "it03-vowel-series.tsv"
 [7] "it04-tongue-cart.tsv"  "it04-vowel-series.tsv" "it05-tongue-cart.tsv" 
[10] "it05-vowel-series.tsv" "it07-tongue-cart.tsv"  "it07-vowel-series.tsv"
[13] "it09-tongue-cart.tsv"  "it09-vowel-series.tsv" "it11-tongue-cart.tsv" 
[16] "it11-vowel-series.tsv" "it12-tongue-cart.tsv"  "it12-vowel-series.tsv"
[19] "it13-tongue-cart.tsv"  "it13-vowel-series.tsv" "it14-tongue-cart.tsv" 
[22] "it14-vowel-series.tsv" "pl02-tongue-cart.tsv"  "pl02-vowel-series.tsv"
[25] "pl03-tongue-cart.tsv"  "pl03-vowel-series.tsv" "pl04-tongue-cart.tsv" 
[28] "pl04-vowel-series.tsv" "pl05-tongue-cart.tsv"  "pl05-vowel-series.tsv"
[31] "pl06-tongue-cart.tsv"  "pl06-vowel-series.tsv" "pl07-tongue-cart.tsv" 
[34] "pl07-vowel-series.tsv"

By default, the list.files() function returns just the name of the file, but we need the full path for the files to be read in R.

list.files("data/coretta2018/ultrasound", full.names = TRUE)

 [1] "data/coretta2018/ultrasound/it01-tongue-cart.tsv" 
 [2] "data/coretta2018/ultrasound/it01-vowel-series.tsv"
 [3] "data/coretta2018/ultrasound/it02-tongue-cart.tsv" 
 [4] "data/coretta2018/ultrasound/it02-vowel-series.tsv"
 [5] "data/coretta2018/ultrasound/it03-tongue-cart.tsv" 
 [6] "data/coretta2018/ultrasound/it03-vowel-series.tsv"
 [7] "data/coretta2018/ultrasound/it04-tongue-cart.tsv" 
 [8] "data/coretta2018/ultrasound/it04-vowel-series.tsv"
 [9] "data/coretta2018/ultrasound/it05-tongue-cart.tsv" 
[10] "data/coretta2018/ultrasound/it05-vowel-series.tsv"
[11] "data/coretta2018/ultrasound/it07-tongue-cart.tsv" 
[12] "data/coretta2018/ultrasound/it07-vowel-series.tsv"
[13] "data/coretta2018/ultrasound/it09-tongue-cart.tsv" 
[14] "data/coretta2018/ultrasound/it09-vowel-series.tsv"
[15] "data/coretta2018/ultrasound/it11-tongue-cart.tsv" 
[16] "data/coretta2018/ultrasound/it11-vowel-series.tsv"
[17] "data/coretta2018/ultrasound/it12-tongue-cart.tsv" 
[18] "data/coretta2018/ultrasound/it12-vowel-series.tsv"
[19] "data/coretta2018/ultrasound/it13-tongue-cart.tsv" 
[20] "data/coretta2018/ultrasound/it13-vowel-series.tsv"
[21] "data/coretta2018/ultrasound/it14-tongue-cart.tsv" 
[22] "data/coretta2018/ultrasound/it14-vowel-series.tsv"
[23] "data/coretta2018/ultrasound/pl02-tongue-cart.tsv" 
[24] "data/coretta2018/ultrasound/pl02-vowel-series.tsv"
[25] "data/coretta2018/ultrasound/pl03-tongue-cart.tsv" 
[26] "data/coretta2018/ultrasound/pl03-vowel-series.tsv"
[27] "data/coretta2018/ultrasound/pl04-tongue-cart.tsv" 
[28] "data/coretta2018/ultrasound/pl04-vowel-series.tsv"
[29] "data/coretta2018/ultrasound/pl05-tongue-cart.tsv" 
[30] "data/coretta2018/ultrasound/pl05-vowel-series.tsv"
[31] "data/coretta2018/ultrasound/pl06-tongue-cart.tsv" 
[32] "data/coretta2018/ultrasound/pl06-vowel-series.tsv"
[33] "data/coretta2018/ultrasound/pl07-tongue-cart.tsv" 
[34] "data/coretta2018/ultrasound/pl07-vowel-series.tsv"

You see now the full path is return, relative to the Quarto Project directory.

In our case, we really just want to read the *-tongue-cart.tsv files, so we can specify a regular expression to list only those files that contain -tongue-cart.tsv.

list.files("data/coretta2018/ultrasound", full.names = TRUE, pattern = "*-tongue-cart.tsv")

 [1] "data/coretta2018/ultrasound/it01-tongue-cart.tsv"
 [2] "data/coretta2018/ultrasound/it02-tongue-cart.tsv"
 [3] "data/coretta2018/ultrasound/it03-tongue-cart.tsv"
 [4] "data/coretta2018/ultrasound/it04-tongue-cart.tsv"
 [5] "data/coretta2018/ultrasound/it05-tongue-cart.tsv"
 [6] "data/coretta2018/ultrasound/it07-tongue-cart.tsv"
 [7] "data/coretta2018/ultrasound/it09-tongue-cart.tsv"
 [8] "data/coretta2018/ultrasound/it11-tongue-cart.tsv"
 [9] "data/coretta2018/ultrasound/it12-tongue-cart.tsv"
[10] "data/coretta2018/ultrasound/it13-tongue-cart.tsv"
[11] "data/coretta2018/ultrasound/it14-tongue-cart.tsv"
[12] "data/coretta2018/ultrasound/pl02-tongue-cart.tsv"
[13] "data/coretta2018/ultrasound/pl03-tongue-cart.tsv"
[14] "data/coretta2018/ultrasound/pl04-tongue-cart.tsv"
[15] "data/coretta2018/ultrasound/pl05-tongue-cart.tsv"
[16] "data/coretta2018/ultrasound/pl06-tongue-cart.tsv"
[17] "data/coretta2018/ultrasound/pl07-tongue-cart.tsv"

There’s another catch. These files don’t have column headings! We need to supply them ourselves as a character vector to the col_names argument of read_tsv(). Alternatively you can set that to FALSE and automatic column names will be created for you.

Finally, we might want to create a new column on the fly which has the file path. This is helpful when the files you are reading don’t have a column that allows you to distinguish data from different files (in these files the first column do this for us).

You can create a new column with the path by specifying a name for this new column as the value of the id argument. With id = "file" a new column called file will be created with the path of the file.

files <- list.files(
  "data/coretta2018/ultrasound",
  full.names = TRUE,
  pattern = "*-tongue-cart.tsv"
)

# Column names of the first 14 columns. The rest of the columns are X and Y
# coordinates of tongue contours of 42 points along the contour:
# X1,Y1,X2,Y2,X3,Y3,...,X42,Y42.
#
# Note that R automatically names unnamed columns with X followed by
# the column number, so the 84 coordinate columns will be all named Xn.
columns <- c(
  "speaker",
  "seconds",
  "rec_date",
  "prompt",
  "label",
  "TT_displacement_sm",
  "TT_velocity",
  "TT_velocity_abs",
  "TD_displacement_sm",
  "TD_velocity",
  "TD_velocity_abs",
  "TR_displacement_sm",
  "TR_velocity",
  "TR_velocity_abs"
)

tongue <- read_tsv(files, id = "file", col_names = columns, na = "*")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 7598 Columns: 99
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (4): speaker, rec_date, prompt, label
dbl (94): seconds, TT_displacement_sm, TT_velocity, TT_velocity_abs, TD_disp...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tongue

If you are wondering what na = "*" does, it just tells R that cells with * in them should be treated as NAs.

Fab! Now why don’t you practice pivoting and change the tongue tibble from a wide format to a long format where each row has the X,Y coordinates of a single point out of the 42 points?

I will give you a head start by renaming the columns with the X,Y coordinates for you. Make sure you inspect the code and understand what it does.

new_col_names <- paste0(rep(c("X", "Y"), 42), "_", rep(1:42, each = 2))

# The square brakets [] are a way of extracting specific items from a vector.
# This method is called "indexing" and specifically this is the "braket indexing"
# method. See the Box below for a more detailed explanation.
existing_col_names <- colnames(tongue)[1:15]

colnames(tongue) <- c(existing_col_names, new_col_names)

Now transform the data so that you end up with three new columns that replace all the X_n, Y_n columns: point with the point number (1 to 42), X with the x coordinate and Y with the y coordinate. You will need to use the separate() function to separate the X_n, Y_n names into X and n and Y and n.

Hint

Follow this workflow:

Pivot all the Xn,Yn columns from wide to longer.
Separate the column that has X_n’s and X_y’s into two columns: one that has X or Y and one with the point number 1 to 42. You can use the separate() function to do so. Check ?separate.
Now pivot from long to wider so that you get an X column and a Y column

It should look like this (only relevant columns shown):

Solution

tongue |> 
  pivot_longer(X_1:Y_42, names_to = "coord", values_to = "value") |> 
  drop_na(value) |> 
  separate(coord, c("axis", "point")) |> 
  pivot_wider(names_from = axis, values_from = value) |> 
  select(prompt, point, X, Y)