Pivoting

Learn how to change the format of data tables from long to wide and viceversa using the pivot functions from tidyr
Author

Stefano Coretta

Published

July 15, 2024

1 Pivoting

Pivoting is the process of changing the shape of the data from a long format to a wide format and vice versa. You can find nice animations on Garrick’s website that illustrate the process.

Ch 5 of the R4DS book introduces the concept of long and wide formats and teaches you how to pivot data using the pivot_wider() and pivot_longer() functions from tidyr (a tidyverse package).

Make sure you try the code out yourself and feel free to try the exercises as well (you can find the solutions here).

The documentation of the pivot functions is also a good place to learn about how they work: https://tidyr.tidyverse.org/articles/pivot.html.

2 Practice

You can practice with the coretta2018/ultrasound/ data which includes a set of files with data from ultrasound tongue imaging of Italian and Polish speakers.

files <- list.files(
  "data/coretta2018/ultrasound",
  full.names = TRUE,
  pattern = "*-tongue-cart.tsv"
)

# Column names of the first 14 columns. The rest of the columns are X and Y
# coordinates of tongue contours of 42 points along the contour:
# X1,Y1,X2,Y2,X3,Y3,...,X42,Y42.
#
# Note that R automatically names unnamed columns with X followed by
# the column number, so the 84 coordinate columns will be all named Xn.
columns <- c(
  "speaker",
  "seconds",
  "rec_date",
  "prompt",
  "label",
  "TT_displacement_sm",
  "TT_velocity",
  "TT_velocity_abs",
  "TD_displacement_sm",
  "TD_velocity",
  "TD_velocity_abs",
  "TR_displacement_sm",
  "TR_velocity",
  "TR_velocity_abs"
)

tongue <- read_tsv(files, id = "file", col_names = columns, na = "*")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 7598 Columns: 99
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (4): speaker, rec_date, prompt, label
dbl (94): seconds, TT_displacement_sm, TT_velocity, TT_velocity_abs, TD_disp...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tongue

If you are wondering what na = "*" does, it just tells R that cells with * in them should be treated as NAs.

Fab! Now why don’t you practice pivoting and change the tongue tibble from a wide format to a long format where each row has the X,Y coordinates of a single point out of the 42 points?

I will give you a head start by renaming the columns with the X,Y coordinates for you. Make sure you inspect the code and understand what it does.

new_col_names <- paste0(rep(c("X", "Y"), 42), "_", rep(1:42, each = 2))

# The square brakets [] are a way of extracting specific items from a vector.
# This method is called "indexing" and specifically this is the "braket indexing"
# method. See the Box below for a more detailed explanation.
existing_col_names <- colnames(tongue)[1:15]

colnames(tongue) <- c(existing_col_names, new_col_names)

Now transform the data so that you end up with three new columns that replace all the X_n, Y_n columns: point with the point number (1 to 42), X with the x coordinate and Y with the y coordinate. You will need to use the separate() function to separate the X_n, Y_n names into X and n and Y and n.

Follow this workflow:

  • Pivot all the Xn,Yn columns from wide to longer.

  • Separate the column that has X_n’s and X_y’s into two columns: one that has X or Y and one with the point number 1 to 42. You can use the separate() function to do so. Check ?separate.

  • Now pivot from long to wider so that you get an X column and a Y column

It should look like this (only relevant columns shown):

tongue |> 
  pivot_longer(X_1:Y_42, names_to = "coord", values_to = "value") |> 
  drop_na(value) |> 
  separate(coord, c("axis", "point")) |> 
  pivot_wider(names_from = axis, values_from = value) |> 
  select(prompt, point, X, Y)