Filter data

Learn how to filter data with the tidyverse

Author

Stefano Coretta

Published

July 15, 2024

1 Data transformation

Data transformation is a fundamental aspect of data analysis.

After the data you need to use is imported into R, you will have to filter rows, create new columns, or join data frames, among many other transformation operations.

In this tutorial we will learn how to filter() the data and mutate() or create new columns. In Week 6 (after Flexible Learning week) you will learn how to obtain summary measures and how to count occurrences using the summarise(), group_by() and count() functions.

2 Filter

Filtering data based on specific criteria couldn’t be easier with filter(), from the dplyr package (one of the tidyverse core packages),

Let’s work with the coretta2022/glot_status data frame. It’s an .rds file, so you need to use the readRDS() function. Go ahead and read the data into glot_status.

The glot_status data frame contains the endangerment status for 7,845 languages from Glottolog. There are thousands of languages in the world, but most of them are losing speakers, and some are already no longer spoken. The endangerment status of a language in the data is on a scale from not endangered (languages with large populations of speakers) through threatened, shifting and nearly extinct, to extinct (languages that have no living speakers left).

glot_status

Before we can move on onto filtering data, we first need to learn about logical operators.

2.1 Logical operators

There are four main logical operators:

  • x == y: x equals y.

  • x != y: x is not equal to y.

  • x > y: x is greater than y.

  • x < y: x is smaller than y.

Logical operators return TRUE or FALSE depending on whether the statement they convey is true or false. Remember, TRUE and FALSE are logical values.

Try these out in the Console:

# This will return FALSE
1 == 2
[1] FALSE
# FALSE
"apples" == "oranges"
[1] FALSE
# TRUE
10 > 5
[1] TRUE
# FALSE
10 > 15
[1] FALSE
# TRUE
3 < 4
[1] TRUE
Logical operators

Logical operators are symbols that compare two objects and return either TRUE or FALSE.

The most common logical operators are ==, !=, >, and <.

Quiz 2
  1. Which of the following does not contain a logical operator?
  2. Which of the following returns c(FALSE, TRUE)?

2a.

Check for errors in the logical operators.

2b.

Run them in the console to see the output.

2a.

The logical operator == has TWO equal signs. A single equal sign = is an alternative way of writing the assignment operator <-, so that a = 1 and a <- 1 are equivalent.

2b.

Logical operators are “vectorised” (you will learn more about this below), i.e they are applied sequentially to all elements in pairs. If the number of elements on one side does not match than of the other side of the operator, the elements on the side that has the smaller number of elements will be recycled.

Now let’s see how these work with filter()!

2.2 The filter() function

Filtering in R with the tidyverse is straightforward. You can use the filter() function.

filter() takes one or more statements with logical operators.

Let’s try this out. The following code filters the status column so that only the extinct status is included in the new data frame extinct.

You’ll notice we are using the pipe |> to transfer the data into the filter() function; the output of the filter function is assigned <- to extinct. The flow might seem a bit counter-intuitive but you will get used to think like this when writing R code soon enough!

extinct <- glot_status |>
  filter(status == "extinct")

extinct

Neat! What if we want to include all statuses except extinct? Easy, we use the non-equal operator !=.

not_extinct <- glot_status |>
  filter(status != "extinct")

not_extinct

And if we want only non-extinct languages from South America? We can include multiple statements separated by a comma!

south_america <- glot_status |>
  filter(status != "extinct", Macroarea == "South America")

south_america

Combining statements like this will give you only those rows where both conditions apply. You can add as many statements as you need.

Now try to filter the data so that you include only not_endangered languages from all macro-areas except Eurasia. This time, don’t save the output to a new data frame. What happens? Where is the output shown?

glot_status |>
  filter(...)

This is all great, but what if we want to include more than one status or macro-area?

To do that we need another operator: %in%.

2.3 The %in% operator

%in%

The %in% operator is a special logical operator that returns TRUE if the value to the left of the operator is one of the values in the vector to its right, and FALSE if not.

Try these in the Console:

# TRUE
5 %in% c(1, 2, 5, 7)
[1] TRUE
# FALSE
"apples" %in% c("oranges", "bananas")
[1] FALSE

But %in% is even more powerful because the value on the left does not have to be a single value, but it can also be a vector! We say %in% is vectorised because it can work with vectors (most functions and operators in R are vectorised).

# TRUE, TRUE
c(1, 5) %in% c(4, 1, 7, 5, 8)
[1] TRUE TRUE
stocked <- c("durian", "bananas", "grapes")
needed <- c("durian", "apples")

# TRUE, FALSE
needed %in% stocked
[1]  TRUE FALSE

Try to understand what is going on in the code above before moving on.

2.4 Now filter the data

Now we can filter glot_status to include only the macro-areas of the Global South and only languages that are either “threatened”, “shifting”, “moribund” or “nearly_extinct”. I have started the code for you, you just need to write the line for filtering status.

global_south <- glot_status |>
  filter(
    Macroarea %in% c("Africa", "Australia", "Papunesia", "South America"),
    ...
  )

This should not look too alien! The first statement, Macroarea %in% c("Africa", "Australia", "Papunesia", "South America") looks at the Macroarea column and, for each row, it returns TRUE if the current row value is in c("Africa", "Australia", "Papunesia", "South America"), and FALSE if not.