Regression models: a cheat-sheet
1 One model to rule them all
Regression models (aka linear regression models, linear models, generalised linear models) are a group of statistical models based on the simple idea that we can predict an outcome variable \(Y\) based on a function \(f(X)\).
The “simplest” linear model is the formula of a line:1
\[ y = \alpha + \beta x \]
where \(\alpha\) is the intercept of the line and \(\beta\) the slope (aka gradient).
The principles behind this formula can be extended to represent virtually any other model, independent of the nature of the outcome variable(s) (\(y\)), the predictor(s), the types of relationship between outcome and predictor, and so on. This means that if you master the principles of regression models, then you can virtually fit any kind of data using regression models. You can bid farewell to ANOVAs, \(t\)-tests, \(\chi^2\)-tests, and what not. In fact, these can all be thought of as specific cases of regression models. It just so happens that they got themselves a specific name. But the underlying mechanics is the same.
Same goes with “linear models”, “logistic regression”, “generalised linear models”, “mixed-effects regression” and so on. These are all regression models, so they all follow the same principles. And again, the fact that they got specific name is a historical “accident”. Understanding that these named models are in fact all regression models gives you super powers you can use on data (Sauron would be so jealous):
One model to rule them all, one model to fit them,
One model to shrink them all, and in probability bind them;
In the Land of Inference where the distributions lie.
Ehm… perhaps this is not gonna win a poetry context, but… the message is that with a single tool, i.e. regression models, you can go a long way!
Each of the following sections asks you about the nature of your data and/or experimental design. By answering each, you will find out which “pieces” you need to add to your model structure.
2 Step 0: Number of outcome variables
We will get back to this step at the end of this post, since it makes things a bit more complex.
3 Step 1: Choose a distribution for your outcome variable
The first step towards building a regression model is to choose the family of distributions you believe the outcome variable belongs to. You can start by answering the following question.
Depending on the answer, check out Section 3.1 or Section 3.2.
3.1 Continuous outcome variable
The variable can take on any positive and negative real number, including 0: Gaussian (aka normal) distribution.
There are very few truly Gaussian variables, although in some cases one can speak of “approximate” or “assumed” normality.
This family is fitted by default in
lm()
,lme4::lmer()
andbrms::brm()
. You can explicitly select the family withfamily = gaussian
.
The variable can take on any positive real number only: Log-normal distribution.
Duration of segments, words, pauses, etc, are known to be log-normally distributed.
Reaction times can be modelled with a log-normal distribution.
Measurements taken in Hz (like f0, formants, centre of gravity, …) could be considered to be log-normal.
There other families that could potentially be used depending on the nature of the variable: exponential-Gaussian (reaction times), gamma, …
Fit a log-normal model with
brms::brm(..., family = lognormal)
.
The variable can take on any real number between 0 and 1, but not 0 nor 1: Beta distribution.
- Proportions fall into this category (for example proportion of voicing within closure), although 0 and 1 are not allowed in the beta distribution.
- Fit a beta model with
brms::brm(..., family = Beta)
. - Check this tutorial: Coretta and Bürkner (n.d.).
The variable can take on any real number between 0 and 1, including 0 or 0 and 1: Zero-inflated or Zero/one-inflated beta (ZOIB) distribution.
- If the proportion data includes many 0s and 1s, then this is the ideal distribution to use. ZOIB distributions are somewhat more difficult to fit than a simple beta distribution, so a common practice is to transform the data so that it doesn’t include 0s nor 1s (this can be achieved using different techniques, some better than others).
- Fit a ZOIB model with
brms::brm(..., family = zero_one_inflated_beta
. - Check this tutorial: Heiss (2021).
3.2 Discrete outcome variable
The variable is dichotomous, i.e. it can take one of two levels: Bernoulli distribution.
Categorical outcome variables like yes/no, correct/incorrect, voiced/voiceless, follow this distribution.
This family is fitted by default when you run
glm(..., family = binomial)
, aka “logistic regression” or “binomial regression” or withbrms::brm(..., family = bernoulli)
.2
The variable is counts: Poisson distribution.
- Counts of words, segments, gestures, f0 peaks, …
- Check out this tutorial: Winter and Bürkner (2021).
- Fit a Poisson model with
brms::brm(..., family = poisson)
. - Sometimes a negative binomial distribution is preferable, if the count data is dispersed. Fit this model with
brms::brm(..., family = negbinomial)
.
The variable is a scale: ordinal linear model.
The variable has more than two levels, but it is not ordered: categorical (multinomial) model.
Fit categorical (multinomial) models with
brms::brm(..., family = categorical)
.As far as I know, there isn’t a tutorial for this family, but Lorson, Cummins, and Rohde (2021) uses categorical models. The research compendium (with data and code) of the paper can be found here: https://osf.io/e5av9/.
Just a quick note: if you have an outcome variable with 3 levels (A, B and C) and you fit a categorical (multinomial) model, then \(P(A) = \frac{e^0}{e^0 + e^{B} + e^{C}}\), where the superscripts \(B\) and \(C\) are the estimated log-odds difference of the B and C level vs the A level (these are the two intercepts in the model). This makes sense, because, assuming each level is equally probable, \(\frac{e^{0}}{1 + e^{0} + e^{0}} = 0.333\) for all levels.
4 Step 2: Are there hierarchical groupings and/or repeated measures?
The second step is to ensure that, if the data is structured hierarchically or repeated measures were taken, this is taken into account in the model. Here is where so-called varying terms (aka random effects, group-level effects/terms) come in (Gelman 2005). Models that include random effects/group-level terms are called: random-effects models, mixed-effects models, hierarchical models, nested models, multilevel models. These terms are for all intents and purposes equivalent (it just happens that different traditions use different terms).
As an example, let’s assume you asked a number of participants to read a list of words and each word was repeated 5 times by each participant. You then took f0 measurements from the stressed vowel of each word, of each repetition. Now, the data has a “hierarchical” structure to it:
- First, observations are grouped by participant (some observations belong to one participant and others to another and so on).
- Second, observations are grouped by word (some observations belong to one word and others to another and so on).
- Third, within the observations of each word, some belong to the same participant (or, from a different perspective, within the observations of each participant, some belong to the same word).
The presence of “levels” within the data (whether they come from natural groupings like participant or word, or from repeated measures) breaks one of the assumptions of regression models: that each observation must be independent. This is why you must include varying terms in the regression model, to account for this structure (and now you see why they are called hierarchical and multilevel models). If you don’t include any varying term, your model will expect that each observation is independent and hence it will underestimate variance and return unreliable results.
In the toy-example of f0 measurements, you will want to include varying terms for participant and word. These will take care to let the model know of the structure of the data mentioned above. If you have other predictors in the model, you should also add them as (varying) slopes in the varying terms. For example: (question | participant) + (question | word)
(where question
= statement vs question).
Here are some tutorials: Winter (2013), DeBruine and Barr (2021), Kirby and Sonderegger (2018), Bürkner (2018), Veenman, Stefan, and Haaf (2023), Pedersen et al. (2019).
5 Step 3: Are there non-linear effects?
A typical use-case of non-linear terms is when you are dealing with time-series data or spatial data (i.e. geographic coordinates). Generalised Additive Models (GAMs) allow you to fit non-linear effects using so called “smooth” (or “smoother”) terms. You can fit a regression model with smooth terms with brms::brm(y ~ s(x))
or with mgcv:gam(y ~ s(x))
, among others. See Simpson (2018), Sóskuthy (2017), Sóskuthy (2021), Wieling (2018), Pedersen et al. (2019).
6 Step 0-bis: Number of outcome variables
If you want to model just one outcome variable, you are already covered if you went through steps 1-3. If instead your design has two or more outcome variables (for example F1 and F2, or duration of the stressed and unstressed vowel of a word) which you want to model together, then you want to fit a multivariate model (i.e. a model with multiple outcome variables). The same steps we went through before can be applied to multiple outcome variables. In some cases, you will want to use the same model structure for all the outcome variables, while in others you might want to use a different model structure for each.
To learn more about multivariate models, I really recommend Bürkner (2024).