library(tidyverse)
<- readRDS("data/coretta2022/glot_status.rds") glot_status
Mutate data
1 Mutate
To change existing columns or create new columns, we can use the mutate()
function from the dplyr package.
To learn how to use mutate()
, we will re-create the status
column (let’s call it Status
this time) from the Code_ID
column in glot_status
.
The Code_ID
column contains the status of each language in the form aes-STATUS
where STATUS
is one of not_endangered
, threatened
, shifting
, moribund
, nearly_extinct
and extinct
.
[1] "aes-shifting" "aes-extinct" "aes-moribund"
[4] "aes-nearly_extinct" "aes-threatened" "aes-not_endangered"
We want to create a new column called Status
which has only the STATUS
label (without the aes-
part). To remove aes-
from the Code_ID
column we can use the str_remove()
function from the stringr package. Check the documentation of ?str_remove
to learn which arguments it uses.
<- glot_status |>
glot_status mutate(
Status = str_remove(Code_ID, "aes-")
)
If you check glot_status
now you will find that a new column, Status
, has been added. This column is a character column (chr
).
Let’s reproduce the bar chart from above but with all the data from glot_status
, using now the Status
column.
|>
glot_status ggplot(aes(x = Status)) +
geom_bar()
But something is not quite right… The order of the levels of Status
does not match the order that makes sense (from least to most endangered)! Why?
This is because status
(the pre-existing column) is a factor column, rather than a simple character column. What is a factor vector/column?
A vector/column can be mutated into a factor column with the as.factor()
function. In the following code, we change the existing column Status
, in other words we overwrite it (this happens automatically, because the Status
column already exists, so it is replaced).
<- glot_status |>
glot_status mutate(
Status = as.factor(Status)
)
# read below for an explanation of the dollar disgn $ syntax
levels(glot_status$Status)
[1] "extinct" "moribund" "nearly_extinct" "not_endangered"
[5] "shifting" "threatened"
The levels()
functions returns the levels of a factor column in the order they are stored in the factor: by default the order is alphabetical. But wait, what is that $
in glot_status$Status
?
The dollar sign $
a base R way of extracting a single column (in this case Status
) from a data frame (glot_status
).
What if we want the levels of Status
to be ordered in a more logical manner: not_endangered
, threatened
, shifting
, moribund
, nearly_extinct
and extinct
? Easy! We can use the factor()
function instead of as.factor()
and specify the levels and their order.
<- glot_status |>
glot_status mutate(
Status = factor(Status, levels = c("not_endangered", "threatened", "shifting", "moribund", "nearly_extinct", "extinct"))
)
levels(glot_status$Status)
[1] "not_endangered" "threatened" "shifting" "moribund"
[5] "nearly_extinct" "extinct"
You see that now the order of the levels returned by levels()
is the one we specified.
Transforming character columns to vector columns is helpful to specify a particular order of the levels which can then be used when plotting.
|>
glot_status ggplot(aes(x = Status)) +
geom_bar()