Introduction to regression models (Part III): include categorical predictors

Learn how to include categorical predictors in your regression models
Author

Stefano Coretta

Published

July 28, 2024

1 Categorical predictors

In theory, regression models support predictors that are numeric. Imagine trying to calculate \(y = 3 + 2x\) where \(x\) is categorical height, with levels “high” and “low”. That doesn’t make any sense, right?

A categorical variable doesn’t work as is and it needs to be transformed to numbers. This transformation is called coding of categorical variable in the regression world.

When fitting a regression models, R does this transformation for you automatically under the hood. So you don’t have to do it manually, but learning exactly what this transformation entails is necessary to understand how to interpret the summary of the model.

There are a few ways of coding categorical predictors as numbers. These ways are called contrasts in R. The default coding system is the treatment contrasts system.

Let’s see what treatment contrasts are.

Note

As with anything else in stats, naming of coding systems is not an established matter and the same coding can have different names, and vice versa the same name could refer to different systems.

For an excellent overview, see https://debruine.github.io/faux/articles/contrasts.html.

2 Treatment coding

Treatment coding (“treatment contrasts” in R) use 0 and 1 to code categorical predictors.

For example, the variable attitude from the polite data (winter2012/polite.csv) has two levels: inf and pol.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
polite <- read_csv("data/winter2012/polite.csv")
Rows: 224 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): subject, gender, birthplace, musicstudent, task, attitude
dbl (21): months_ger, scenario, total_duration, articulation_rate, f0mn, f0s...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
polite
unique(polite$attitude)
[1] "inf" "pol"

We can code the levels of attitude using 0 for inf and 1 for pol. Let’s call this new coded variable is_polite: [is_polite = 0] corresponds to [attitude = informal] and [is_polite = 1] to [attitude = polite].

attitude is_polite
inf(ormal) 0
pol(ite) 1

With treatment coding, the first level in the predictor is called the reference level. Based on how we have coded attitude, the reference level is informal.

By default, the order of levels in a categorical predictor is based on alphabetical order. But you can specify the order manually using the factor() function (you will see how below).

polite %>%
  mutate(
    is_polite = ifelse(attitude == "inf", 0, 1)
  ) %>%
  select(HNRmn, attitude, is_polite)
Warning

Remember that this is done automatically by R under the hood for you so you never have to do it by hand!!!

We will use attitude in the model.

2.1 Modelling Harmonics-to-Noise Ratio (HNR) by attitude

Harmonics-to-Noise Ratio (HNR) is an acoustic measurement that measures the ratio between the periodicity of a sound vs its noise level. Higher HNR indicates more modal types of voicing while lower HNR indicates more breathy or creaky voicing. Let’s assume that the HNR values come from a Gaussian distribution.1 In notation:

\[ \text{HNRmn} \sim Gaussian(\mu, \sigma) \]

Now, we want to estimate \(\mu\) so that we take into consideration whether the attitude is “informal” or “polite”. In other words, we want to model HNR as a function of attitude, in R code HNRmn ~ attitude.

attitude is a categorical variable (a chr character column in polite) that will be coded automatically in the model using the default treatment coding system as we have seen above (this is done under the hood by R for you!): [attitude = informal] is 0 and [attitude = polite] is 1.

Now we allow \(\mu\) to vary depending on attitude. In the formula, we include the coded variable is_polite rather than the original variable.

\[ \mu = \beta_0 + \beta_1 \cdot \text{is\_polite} \]

Let’s unpack the equation. There’s a bit of algebra in what follows, so take your time if this is a bit out of your comfort zone. But it’s worth it—familiarity with basic algebra will also help you become more comfortable working with linear models.

  • \(\beta_0\) is the mean HNR \(\mu\) when [attitude = informal]. That’s because the coded variable \(is_attitude\) takes on the value 0 when [attitude = informal]. Multiplying \(\beta_1\) by 0 means that \(\beta_1\) vanishes from the equation, and all that’s left is \(\beta_0\).

\[ \begin{aligned} \mu &= \beta_0 + \beta_1 \cdot \text{is\_polite} \\ \mu_{attitude = inf} &= \beta_0 + \beta_1 \cdot 0 \\ \mu_{attitude = inf} &= \beta_0 \\ \end{aligned} \]

  • \(\beta_1\) is the difference in mean HNR \(\mu\) between the mean HNR when [attitude = polite] and the mean valence when [attitude = informal]. That’s because \(is_polite\) takes on the value 1 when [attitude = polite], and multiplying \(\beta_1\) by 1 leaves it unchanged in the equation.

\[ \begin{aligned} \mu &= \beta_0 + \beta_1 \cdot \text{is\_polite}\\ \mu_{attitude = inf} &= \beta_0 + \beta_1 \cdot 1 \\ \mu_{attitude = pol} &= \beta_0 + \beta_1 \\ \end{aligned} \]

From this, we know that \(\beta_1\) represents the difference between the mean HRN when [attitude = informal] and the mean HNR when [attitude = polite]? Remember that \(\mu_{mod = smell} = \beta_0\), as we said in the first point above. We can substitute this into the equation from the last point, and then isolate \(\beta_1\) step by step as follows:

\[ \begin{aligned} \beta_0 & = \mu_{attitude=inf} \\ \mu_{attitude=pol} &= \beta_0 + \beta_1 \\ \mu_{attitude=pol} &= \mu_{attitude=inf} + \beta_1 \\ \mu_{attitude=pol} - \mu_{attitude=inf} &= \mu_{attitude=inf} + \beta_1 - \mu_{attitude=inf} \\ \mu_{attitude=pol} - \mu_{attitude=inf} &= \beta_1 \end{aligned} \]

So, \(\beta_1\) really is the difference between the means HNR value when attitude is polite vs informal.

Important

This is always a point of confusion for beginners.

While the intercept (\(\beta_0\)) corresponds to the mean of the outcome variable when the categorical predictor is at its reference level, \(\beta_1\) corresponds to the difference in the mean outcome variable when the categorical predictor level is the other level vs when it is the reference level.

Here’s the equation of the model we will be fitting.

\[ \begin{align} \text{HNRmn} & \sim Gaussian(\mu, \sigma) \\ \mu & = \beta_0 + \beta_1 \cdot \text{is\_polite} \\ \end{align} \]

Quiz 1
Since informal speech in the data seem to have on average lower HNR than polite speech, will \(\beta_1\) be:

Be careful: \(\beta_1\) is the difference in HMRmn between polite and informal attitude (polite - informal), rather than the difference between informal and polite attitude (informal - polite). This is because inf (informal) is the reference level of attitude.

2.2 Run the model

Now we know what the model will do, we can run it.

library(brms)
Loading required package: Rcpp
Loading 'brms' package (version 2.21.0). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').

Attaching package: 'brms'
The following object is masked from 'package:stats':

    ar
hnr_bm <- brm(
  HNRmn ~ attitude,
  family = gaussian(),
  data = polite,
  cores = 4,
  file = "cache/hnr_bm",
  seed = 2913
)

When reporting the model specification, you can write the following:

We fitted a Bayesian model with mean HNR as the outcome variable and attitude (informal vs polite) as the only predictor. A Gaussian family was used as the outcome distribution family. The attitude predictor was coded with the default treatment contrasts (with “informal” as the reference level).

2.3 Interpret the model output

Let’s inspect the model summary.

summary(hnr_bm)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: HNRmn ~ attitude 
   Data: polite (Number of observations: 224) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      16.28      0.15    15.97    16.58 1.00     4239     2929
attitudepol     1.25      0.22     0.80     1.69 1.00     3854     2983

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.69      0.08     1.54     1.85 1.00     3898     2885

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Look at the Regression Coefficients table.

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      16.28      0.15    15.97    16.58 1.00     4239     2929
attitudepol     1.25      0.22     0.80     1.69 1.00     3854     2983

We get two coefficients:

  • Intercept: this is our \(\beta_0\).

  • attitudepol: this is our \(\beta_1\).

Each coefficient has an Estimate and an Est.Error (estimate error). As we have seen in previous tutorials, these are the mean and SD of the posterior probability distribution of the coefficients.

This means that:

  • The probability distribution of the mean HNR when [attitude = informal] has mean = 16.27 and SD = 0.16.

  • The probability distribution of the difference between the mean HNR of [attitude = polite] and that of [attitude = informal] is has mean = 1.25 and SD = 0.23.

Quiz 2

What is the mean HNR for taste words?

Just use the formula of \(\mu\) and replace the \(\beta\) coefficients with the mean of the respective distributions.

Now let’s look at the CrIs (CI in the model summary) of the coefficients. Based on the reported 95% CrIs, we can say that:

  • We can be 95% confident that (you can also say, “there is a 95% probability that”), based on the data and the model, the mean HNR of informal speech is between 16 and 16.6 dB.

  • There is a 95% probability that the difference in mean HNR between polite and informal speech is between 0.8 and 1.7 dB. In other words, HNR increases by 0.8 to 1.7 dB in polite speech relative to informal speech.

You could report these results in the following way:

According to the model, the mean HNR of informal speech is 16 dB (SD = 0.16), with a 95% probability that it is between 16 and 16.6 dB. At 95% confidence, the difference in HNR between polite and informal speech is between 0.8 and 1.7 (\(\beta\) = 1.25, SD = 0.23).

But what about the probability distribution of the HNR of polite speech? This information is not in the summary per se.

To obtain that, we need to do a bit of data wrangling with the model draws. You can learn about it in Working with MCMC draws.

Footnotes

  1. Note that, in real-life research contexts, you should decide which distribution to use for different outcome variables by drawing on previous research that has assessed the possible distribution families for those variables, on domain-specific expertise or common-sense. You will see that in most cases with linguistic phenomena we know very little about the nature of the variables we work with, hence it can be difficult to know which family of distributions to use.↩︎