Statistical summaries

Learn about descriptive summary measures
Author

Stefano Coretta

Published

September 13, 2023

Pre-requisites

1 Summary measures: overview

We can summarise variables using summary measures. There are two types of summary measures.

  • Measures of central tendency indicate the typical or central value of a sample.

  • Measures of dispersion indicate the spread or dispersion of the sample values around the central tendency value.

Always report a measure of central tendency together with its measure of dispersion!

Measures of central tendency

Mean

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + ... + x_n}{n}\]

Median

\[\text{if } n \text{ is odd, } x_\frac{n+1}{2}\]

\[\text{if } n \text{ is even, } \frac{x_\frac{n}{2} + x_{\frac{n}{2}+1}}{2}\]

Mode

The most common value.

Measures of dispersion

Minimum and maximum values

Range

\[ max(x) - min(x)\]

The difference between the largest and smallest value.

Standard deviation

\[\text{SD} = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{(x_1 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n-1}}\]

2 Mean

Use the mean with numeric continuous variables, if:

  • The variable can take on any positive and negative number, including 0.
mean(c(-1.12, 0.95, 0.41, -2.1, 0.09))
[1] -0.354
  • The variable can take on any positive number only.
mean(c(0.32, 2.58, 1.5, 0.12, 1.09))
[1] 1.122
Warning

Don’t take the mean of proportions and percentages!

Better to calculate the proportion/percentage across the entire data, rather than take the mean of individual proportions/percentages: see this blog post. If you really really have to, use the median.

3 Median

Use the median with numeric (continuous and discrete) variables.

# odd N
median(c(-1.12, 0.95, 0.41, -2.1, 0.09))
[1] 0.09
# even N
even <- c(4, 6, 3, 9, 7, 15)
median(even)
[1] 6.5
# the median is the mean of the two "central" number
sort(even)
[1]  3  4  6  7  9 15
mean(c(6, 7))
[1] 6.5
Warning
  • The mean is very sensitive to outliers.

  • The median is not.

The following list of numbers does not have obvious outliers. The mean and median are not to different.

# no outliers
median(c(4, 6, 3, 9, 7, 15))
[1] 6.5
mean(c(4, 6, 3, 9, 7, 15))
[1] 7.333333

In the following case, there is quite a clear outlier, 40. Look how the mean is higher than the median. This is because the outlier 40 pulls the mean towards it.

# one outlier
median(c(4, 6, 3, 9, 7, 40))
[1] 6.5
mean(c(4, 6, 3, 9, 7, 40))
[1] 11.5

4 Mode

Use the mode with categorical (discrete) variables. Unfortunately the mode() function in R is not the statistical mode, but rather it returns the R object type.

You can use the table() function to “table” out the number of occurrences of elements in a vector.

table(c("red", "red", "blue", "yellow", "blue", "green", "red", "yellow"))

  blue  green    red yellow 
     2      1      3      2 

The mode is the most frequent value: here it is red, with 3 occurrences.

Warning

Likert scales are ordinal (categorical) variables, so the mean and median are not appropriate!

You should use the mode (You can use the median with Likert scales if you really really need to…)

5 Minimum and maximum

You can report minimum and maximum values for any numeric variable.

x_1 <- c(-1.12, 0.95, 0.41, -2.1, 0.09)

min(x_1)
[1] -2.1
max(x_1)
[1] 0.95
range(x_1)
[1] -2.10  0.95

Note that the range() function does not return the statistical range (see next section), but simply prints both the minimum and the maximum.

6 Range

Use the range with any numeric variable.

x_1 <- c(-1.12, 0.95, 0.41, -2.1, 0.09)
max(x_1) - min(x_1)
[1] 3.05
x_2 <- c(0.32, 2.58, 1.5, 0.12, 1.09)
max(x_2) - min(x_2)
[1] 2.46
x_3 <- c(4, 6, 3, 9, 7, 15)
max(x_3) - min(x_3)
[1] 12

7 Standard deviation

Use the standard deviation with numeric continuous variables, if:

  • The variable can take on any positive and negative number, including 0.
sd(c(-1.12, 0.95, 0.41, -2.1, 0.09))
[1] 1.23658
  • The variable can take on any positive number only.
sd(c(0.32, 2.58, 1.5, 0.12, 1.09))
[1] 0.9895555
Warning

Standard deviations are relative and depend on the measurement unit/scale!

Don’t use the standard deviation with proportions and percentages!

8 Summarise data in R

When you work with data, you always want to get summary measures for most of the variables in the data.

Data reports usually include summary measures. It is also important to understand which summary measure is appropriate for which type of variable.

We have covered this in the lecture, so we won’t go over it again here. Instead, you will learn how to obtain summary measures using the summarise() function from the dplyr tidyverse package.

summarise() takes at least two arguments:

  • The data frame to summarise.

  • One or more summary functions.

For example, let’s get the mean the reaction time column RT. Easy! (First attach the tidyverse and read the song2020/shallow.csv file into a variable called shallow.)

summarise(shallow, RT_mean = mean(RT))

Great! The mean reaction times of the entire sample is 867.3592 ms.

You can round numbers with the round() function. For example:

num <- 867.3592
round(num)
[1] 867
round(num, 1)
[1] 867.4
round(num, 2)
[1] 867.36

The second argument sets the number of decimals to round to (by default, it is 0, so the number is rounded to the nearest integer, that is, to the nearest whole number with no decimal values).

Let’s recalculate the mean by rounding it this time.

summarise(shallow, RT_mean = round(mean(RT)))

What if we want also the standard deviation? Easy: we use the sd() function. (Round the mean and SD with the round() function in your code).

# round the mean and SD
summarise(shallow, RT_mean = mean(RT), RT_sd = sd(RT))

Now we know that reaction times are on average 867 ms long and have a standard deviation of about 293 ms (rounded to the nearest integer).

Let’s go all the way and also get the minimum and maximum RT values with the min() and max() functions (round all the summary measures).

summarise(
  shallow,
  RT_mean = mean(RT), RT_sd = sd(RT),
  RT_min = ..., RT_max = ...
)

Fab! When writing a data report, you could write something like this.

Reaction times are on average 867 ms long (SD = 293 ms), with values ranging from 0 to 1994 ms.

We won’t go into the details of what standard deviations are, but you can just think of them as a relative measure of how dispersed the data are around the mean: the higher the SD, the greater the dispersion around the mean, i.e. the greater the variability in the data.

When required, you can use the median() function to calculate the median, instead of the mean(). Go ahead and calculate the median reaction times in the data. Is it similar to the mean?

8.1 NAs

Most base R functions behave unexpectedly if the vector they are used on contain NA values.

NA is a special object in R, that indicates that a value is Not Available, meaning that that observation does not have a value.

For example, in the following numeric vector, there are 5 objects:

a <- c(3, 5, 3, NA, 4)

Four are numbers and one is NA.

If you calculate the mean of a with mean() something strange happens.

mean(a)
[1] NA

The functions returns NA.

This is because by default when just one value in the vector is NA then operations on the vector will return NA.

mean(a)
[1] NA
sum(a)
[1] NA
sd(a)
[1] NA

If you want to discard the NA values when operating on a vector that contains them, you have to set the na.rm (for “NA remove”) argument to TRUE.

mean(a, na.rm = TRUE)
[1] 3.75
sum(a, na.rm = TRUE)
[1] 15
sd(a, na.rm = TRUE)
[1] 0.9574271
Quiz 1
  1. What does the na.rm argument of mean() do?
  2. Which is the mean of c(4, 23, NA, 5) when na.rm has the default value?

Check the documentation of ?mean.

Summary table of summary measures