Wrangling data with R

Learn the basics of wrangling data with R

Author

Stefano Coretta

Published

July 15, 2024

Prerequisites

Statistical variables

1 Data wrangling

Data wrangling encompasses three main steps in the R pipeline:

Importing data.
Tidying data.
Transforming data.

2 Tidying data

Data tidying is about reshaping the data so that they are in a tidy format.

The concept of tidy data was introduced by Wickham 2014. The following illustrations by Allison Horst explain what it is meant with tidy data.

I recommend that if you have control over the layout of tabular data you use a tidy format (columns are variables and rows are observations). However, we sometimes have to use pre-existing data which might be messy or we need to modify our own tidy data for specific purposes (like making a plot that requires a different layout).

The nice thing about tidy data is that once they are tidy you can reshape them easily in whichever way you want.

The tidyverse package tidyr allows users to tidy up messy data with several functions. (It’s called the “tidyverse” because all packages are designed to work with tidy data!).

The most important procedure for tidying up data is pivoting. You can learn more about pivoting in Pivoting.

3 Transforming data

Transforming data encompasses several operations, each of which can be achieved with specific tidyverse functions.

You can filter data based on specific columns and criteria with filter().
Mutate columns or create new ones based on existing columns with mutate().
To summarise data use the summarise() function.
It is also possible to join data using the mutating join functions. See Joins of the R for Data Science book.