[1] -0.354
Data summaries
University of Edinburgh
We can summarise variables using summary measures.
There are two types of summary measures.
Measures of central tendency
Measures of dispersion
Always report a measure of central tendency together with its measure of dispersion!
Mean
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + ... + x_n}{n}\]
Median
\[\text{if } n \text{ is odd, } x_\frac{n+1}{2}\]
\[\text{if } n \text{ is even, } \frac{x_\frac{n}{2} + x_\frac{n}{2}}{2}\]
Mode
The most common value.
Minimum and maximum values
Range
\[ max(x) - min(x)\]
The difference between the largest and smallest value.
Standard deviation
\[\text{SD} = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{(x_1 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n-1}}\]
Use the mean with numeric continuous variables, if:
[1] -0.354
[1] 1.122
Don’t take the mean of proportions and percentages!
Better to calculate the proportion/percentage across the entire data, rather than take the mean of individual proportions/percentages: see this blog post. If you really really have to, use the median.
Use the median with numeric (continuous and discrete) variables.
[1] 0.09
[1] -2.10 -1.12 0.09 0.41 0.95
[1] 1.09
[1] 0.12 0.32 1.09 1.50 2.58
[1] 6.5
[1] 3 4 6 7 9 15
[1] 6.5
[1] 7.333333
[1] 6.5
[1] 11.5
The mean is very sensitive to outliers.
The median is not.
Use the mode with categorical (discrete) variables.
blue green red yellow
2 1 3 2
The mode is the most frequent value: red
.
Likert scales are ordinal (categorical) variables, so the mean and median are not appropriate!
You should use the mode (You can use the median with Likert scales if you really really need to…)
Report minimum and maximum values for any numeric variable.
[1] -2.1
[1] 0.95
[1] -2.10 0.95
Use the range with any numeric variable.
[1] 3.05
[1] 2.46
[1] 12
Use the standard deviation with numeric continuous variables, if:
[1] 1.23658
[1] 0.9895555
Standard deviations are relative and depend on the measurement unit/scale!
–
Don’t use the standard deviation with proportions and percentages!
The sample \(y\) is generated by a (random) variable \(Y\).
A (statistical) variable is any characteristics, number, or quantity that can be measured or counted.
Variables can be numeric or categorical.
We operationalise a measure/observation as a numeric or a categorical variable.
We summarise variables using summary measures: