Bar charts

Learn how to plot bar charts with ggplot2

Published

July 15, 2024

Prerequisites

1 Bar charts

Bar charts

Bar charts are useful when you are counting things. For example:

  • Number of verbs vs nouns vs adjectives in a corpus.

  • Number of languages by geographic area.

  • Number of correct vs incorrect responses.

The bar chart geometry is geom_bar().

We will first create a plot with counts of the number of languages in global_south by their endangerment status and then a plot where we also split the counts by macro-area.

1.1 Number of languages of the Global South by status

To create a bar chart, you can use the geom_bar() geometry.

Bar chart axes

In a simple bar chart, you only need to specify one axis, the x-axis, in the aesthetics aes().

This is because the counts that are placed on the y-axis are calculated by the geom_bar() function under the hood.

This quirk is something that confuses many new learners, so make sure you internalise this.

Go ahead and complete the following code to create a bar chart.

global_south |>
  ggplot(aes(x = status)) +
  ...

Note how we’re using |> to pipe the glot_status data frame into the ggplot() function. This works because ggplot()’s first argument is the data, and piping is a different way of providing the first argument to a function.

As mentioned above, the counting for the y-axis is done automatically. R looks in the status column and counts how many times each value in the column occurs in the data frame.

If you did things correctly, you should get the following plot.

Figure 1: Number of languages by endangerment status.

The x-axis is now status and the y-axis corresponds to the number of languages by status (count). As mentioned above, count is calculated under the hood for you (you will learn how to count levels with count() later in the course).

You could write a description of the plot that goes like this:

The number of languages in the Global South by endangered status is shown as a bar chart in Figure 1. Among the languages that are endangered, the majority are threatened or shifting.

What if we want to show the number of languages by endangerment status within each of the macro-areas that make up the Global South? Easy! You can make a stacked bar chart. This requires we mutate the data first.

2 Stacked bar charts

A special type of bar charts are the so-called stacked bar charts.

Stacked bar chart

A stacked bar chart is a bar chart in which each contains a “stack” of shorter bars, each indicating the counts of some sub-groups.

This type of plot is useful to show how counts of something vary depending on some other grouping (in other words, when you want to count the occurrences of a categorical variable based on another categorical variable). For example:

  • Number of languages by endangerment status, grouped by geographic area.

  • Number of infants by head-turning preference, grouped by first language.

  • Number of past vs non-past verbs, grouped by verb class.

To create a stacked bar chart, you just need to add a new aesthetic mapping to aes(): fill. The fill aesthetic lets you fill bars or areas with different colours depending on the values of a specified column.

Let’s make a plot on language endangerment by macro-area.

Complete the following code by specifying that fill should be based on status.

global_south |>
  ggplot(aes(x = Macroarea, ...)) +
  geom_bar()

You should get the following.

Figure 2: Number of languages by macro-area and endangerment status.

A write-up example:

Figure 2 shows the number of languages by geographic macro-area, subdivided by endangerment status. Africa, Eurasia and Papunesia have substantially more languages than the other areas.

Quiz 4

What is wrong in the following code?

gestures |>
  ggplot(aes(x = status), fill = Macroarea) +
  geom_bar()

3 Filled stacked bar charts

In the plot above it is difficult to assess whether different macro-areas have different proportions of endangerment. This is because the overall number of languages per area differs between areas.

A solution to this is to plot proportions instead of raw counts.

You could calculate the proportions yourself, but there is a quicker way: using the position argument in geom_bar().

You can plot proportions instead of counts by setting position = "fill" inside geom_bar(), like so:

global_south |>
ggplot(aes(x = Macroarea, fill = status)) +
  geom_bar(position = "fill")
Figure 3: Proportion of languages by macro-area and endangerment status.

The plot now shows proportions of languages by endangerment status for each area separately.

Note that the y-axis label is still “count” but should be “proportion”. Use labs() to change the axes labels and the legend name.

global_south |>
ggplot(aes(x = Macroarea, fill = status)) +
  geom_bar(position = "fill") +
  labs(
    ...
  )

If to change the name of the colour legend, you use the colour argument in labs(), guess which argument you should use for fill?

You should get this.

Figure 4: Proportion of languages by macro-area and endangerment status.

With this plot it is easier to see that different areas have different proportions of endangerment. In writing:

Figure 4 shows proportions of languages by endangerment status for each macro-area. Australia, South and North America have a substantially higher proportion of extinct languages than the other areas. These areas also have a higher proportion of near extinct languages. On the other hand, Africa has the greatest proportion of non-endangered languages followed by Papunesia and Eurasia, while North and South America are among the areas with the lower proportion, together with Australia which has the lowest.