Reading multiple data files

Learn how to read multiple data files in R

Published

July 10, 2024

Prerequisites

1 Reading multiple files at once

An important skill to learn is how to read multiple files at once and save the output into a single tibble/data frame.

This can be achieved with the list.files() function.

For example, let’s read individual files with tongue contours data from ultrasound tongue imaging (UTI). These files are in data/coretta2018/ultrasound/.

We can list all files like so:

list.files("data/coretta2018/ultrasound")
 [1] "it01-tongue-cart.tsv"  "it01-vowel-series.tsv" "it02-tongue-cart.tsv" 
 [4] "it02-vowel-series.tsv" "it03-tongue-cart.tsv"  "it03-vowel-series.tsv"
 [7] "it04-tongue-cart.tsv"  "it04-vowel-series.tsv" "it05-tongue-cart.tsv" 
[10] "it05-vowel-series.tsv" "it07-tongue-cart.tsv"  "it07-vowel-series.tsv"
[13] "it09-tongue-cart.tsv"  "it09-vowel-series.tsv" "it11-tongue-cart.tsv" 
[16] "it11-vowel-series.tsv" "it12-tongue-cart.tsv"  "it12-vowel-series.tsv"
[19] "it13-tongue-cart.tsv"  "it13-vowel-series.tsv" "it14-tongue-cart.tsv" 
[22] "it14-vowel-series.tsv" "pl02-tongue-cart.tsv"  "pl02-vowel-series.tsv"
[25] "pl03-tongue-cart.tsv"  "pl03-vowel-series.tsv" "pl04-tongue-cart.tsv" 
[28] "pl04-vowel-series.tsv" "pl05-tongue-cart.tsv"  "pl05-vowel-series.tsv"
[31] "pl06-tongue-cart.tsv"  "pl06-vowel-series.tsv" "pl07-tongue-cart.tsv" 
[34] "pl07-vowel-series.tsv"

By default, the list.files() function returns just the name of the file, but we need the full path for the files to be read in R.

list.files("data/coretta2018/ultrasound", full.names = TRUE)
 [1] "data/coretta2018/ultrasound/it01-tongue-cart.tsv" 
 [2] "data/coretta2018/ultrasound/it01-vowel-series.tsv"
 [3] "data/coretta2018/ultrasound/it02-tongue-cart.tsv" 
 [4] "data/coretta2018/ultrasound/it02-vowel-series.tsv"
 [5] "data/coretta2018/ultrasound/it03-tongue-cart.tsv" 
 [6] "data/coretta2018/ultrasound/it03-vowel-series.tsv"
 [7] "data/coretta2018/ultrasound/it04-tongue-cart.tsv" 
 [8] "data/coretta2018/ultrasound/it04-vowel-series.tsv"
 [9] "data/coretta2018/ultrasound/it05-tongue-cart.tsv" 
[10] "data/coretta2018/ultrasound/it05-vowel-series.tsv"
[11] "data/coretta2018/ultrasound/it07-tongue-cart.tsv" 
[12] "data/coretta2018/ultrasound/it07-vowel-series.tsv"
[13] "data/coretta2018/ultrasound/it09-tongue-cart.tsv" 
[14] "data/coretta2018/ultrasound/it09-vowel-series.tsv"
[15] "data/coretta2018/ultrasound/it11-tongue-cart.tsv" 
[16] "data/coretta2018/ultrasound/it11-vowel-series.tsv"
[17] "data/coretta2018/ultrasound/it12-tongue-cart.tsv" 
[18] "data/coretta2018/ultrasound/it12-vowel-series.tsv"
[19] "data/coretta2018/ultrasound/it13-tongue-cart.tsv" 
[20] "data/coretta2018/ultrasound/it13-vowel-series.tsv"
[21] "data/coretta2018/ultrasound/it14-tongue-cart.tsv" 
[22] "data/coretta2018/ultrasound/it14-vowel-series.tsv"
[23] "data/coretta2018/ultrasound/pl02-tongue-cart.tsv" 
[24] "data/coretta2018/ultrasound/pl02-vowel-series.tsv"
[25] "data/coretta2018/ultrasound/pl03-tongue-cart.tsv" 
[26] "data/coretta2018/ultrasound/pl03-vowel-series.tsv"
[27] "data/coretta2018/ultrasound/pl04-tongue-cart.tsv" 
[28] "data/coretta2018/ultrasound/pl04-vowel-series.tsv"
[29] "data/coretta2018/ultrasound/pl05-tongue-cart.tsv" 
[30] "data/coretta2018/ultrasound/pl05-vowel-series.tsv"
[31] "data/coretta2018/ultrasound/pl06-tongue-cart.tsv" 
[32] "data/coretta2018/ultrasound/pl06-vowel-series.tsv"
[33] "data/coretta2018/ultrasound/pl07-tongue-cart.tsv" 
[34] "data/coretta2018/ultrasound/pl07-vowel-series.tsv"

You see now the full path is return, relative to the Quarto Project directory.

In our case, we really just want to read the *-tongue-cart.tsv files, so we can specify a regular expression to list only those files that contain -tongue-cart.tsv.

list.files("data/coretta2018/ultrasound", full.names = TRUE, pattern = "*-tongue-cart.tsv")
 [1] "data/coretta2018/ultrasound/it01-tongue-cart.tsv"
 [2] "data/coretta2018/ultrasound/it02-tongue-cart.tsv"
 [3] "data/coretta2018/ultrasound/it03-tongue-cart.tsv"
 [4] "data/coretta2018/ultrasound/it04-tongue-cart.tsv"
 [5] "data/coretta2018/ultrasound/it05-tongue-cart.tsv"
 [6] "data/coretta2018/ultrasound/it07-tongue-cart.tsv"
 [7] "data/coretta2018/ultrasound/it09-tongue-cart.tsv"
 [8] "data/coretta2018/ultrasound/it11-tongue-cart.tsv"
 [9] "data/coretta2018/ultrasound/it12-tongue-cart.tsv"
[10] "data/coretta2018/ultrasound/it13-tongue-cart.tsv"
[11] "data/coretta2018/ultrasound/it14-tongue-cart.tsv"
[12] "data/coretta2018/ultrasound/pl02-tongue-cart.tsv"
[13] "data/coretta2018/ultrasound/pl03-tongue-cart.tsv"
[14] "data/coretta2018/ultrasound/pl04-tongue-cart.tsv"
[15] "data/coretta2018/ultrasound/pl05-tongue-cart.tsv"
[16] "data/coretta2018/ultrasound/pl06-tongue-cart.tsv"
[17] "data/coretta2018/ultrasound/pl07-tongue-cart.tsv"

There’s another catch. These files don’t have column headings! We need to supply them ourselves as a character vector to the col_names argument of read_tsv(). Alternatively you can set that to FALSE and automatic column names will be created for you.

Finally, we might want to create a new column on the fly which has the file path. This is helpful when the files you are reading don’t have a column that allows you to distinguish data from different files (in these files the first column do this for us).

You can create a new column with the path by specifying a name for this new column as the value of the id argument. With id = "file" a new column called file will be created with the path of the file.

files <- list.files(
  "data/coretta2018/ultrasound",
  full.names = TRUE,
  pattern = "*-tongue-cart.tsv"
)

# Column names of the first 14 columns. The rest of the columns are X and Y
# coordinates of tongue contours of 42 points along the contour:
# X1,Y1,X2,Y2,X3,Y3,...,X42,Y42.
#
# Note that R automatically names unnamed columns with X followed by
# the column number, so the 84 coordinate columns will be all named Xn.
columns <- c(
  "speaker",
  "seconds",
  "rec_date",
  "prompt",
  "label",
  "TT_displacement_sm",
  "TT_velocity",
  "TT_velocity_abs",
  "TD_displacement_sm",
  "TD_velocity",
  "TD_velocity_abs",
  "TR_displacement_sm",
  "TR_velocity",
  "TR_velocity_abs"
)

tongue <- read_tsv(files, id = "file", col_names = columns, na = "*")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 7598 Columns: 99
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (4): speaker, rec_date, prompt, label
dbl (94): seconds, TT_displacement_sm, TT_velocity, TT_velocity_abs, TD_disp...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tongue

If you are wondering what na = "*" does, it just tells R that cells with * in them should be treated as NAs.

Fab! Now you have a single tibble, tongue with data from all the files!