Regression: Indexing of categorical predictors

Learn about indexing of categorical predictors in regression models

Author

Stefano Coretta

Published

June 25, 2024

Prerequisites

Introduction to regression models (Part III): include categorical predictors.

1 Categorical predictors: contrasts vs indexing

Categorical predictors in regression models are coded by default using the so-called “treatment” contrasts, a type of contrast coding.

This means that the intercept of the model will correspond to the predicted value when the categorical predictor is at the reference level, and the other coefficients will be the difference between the predicted value when the categorical predictor is at the other levels and the intercept.

This way of coding categorical predictors makes it more difficult to set up priors and can at times make it less straightforward to interpret the model summary.

In this post, you will learn about a different method of coding categorical predictors, called indexing.

With this method, the model estimates the predicted value for each level of the predictor, rather than for just the reference level. This means that you don’t get coefficients that are differences between predicted values at different levels.

Let’s see an example of both methods using data from Coretta 2018.

2 The data

# Install the coretta2018itapol package first if needed.
# remotes::install_github("stefanocoretta/coretta2018itapol")
library(coretta2018itapol)
data("token_measures")

Here’s a preview of the data.

# Let's remove NAs from the v1_duration column
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

token_measures <- token_measures |> 
  drop_na(v1_duration)

token_measures

3 Treatment contrasts

library(brms)

Loading required package: Rcpp

Loading 'brms' package (version 2.21.0). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: 'brms'

The following object is masked from 'package:stats':

    ar

get_prior(
  v1_duration ~ c2_phonation,
  family = lognormal,
  data = token_measures,
)

We get an Intercept (that’s the expected vowel duration when the following consonant is voiced) and a class = b prior for the coefficient c2_phonationvoiceless. The coefficient c2_phonationvoiceless is the difference between the vowel duration when the following consonant is voiceless and the vowel duration when the following consonant is voiced.

\[ \begin{align} vdur & \sim LogNormal(\mu, \sigma) \\ \mu & = \beta_0 + \beta_1 \cdot \text{voiceless} \end{align} \]

where \(\text{voiceless}\) is an “indicator” (aka dummy) variable that “turns on” the \(\beta_1\) coefficient in the formula of \(\mu\) when the following consonant is voiceless.

Hence why \(\beta_1\) is a difference between expected values (the value when C2 is voiceless - the value when C2 is voiced).

Setting priors on \(\beta_0\) and \(\beta_1\) means we need to think about:

What we expect the mean value duration to be when C2 is voiced, and
What we expect the difference in mean between vowels followed by a voiceless vs voiced consonant.

It is much easier to set priors on the mean value duration when C2 is voiced and when C2 is voiceless. We can use indexing to achieve that.

4 Indexing

With indexing, each level in a categorical predictor gets it’s own coefficient and the model estimates the predicted value of the outcome at each level.

Differently from setting contrasts with contrasts()<-, indexing is applied by suppressing the intercept of the model with 0 +.

get_prior(
  v1_duration ~ 0 + c2_phonation,
  family = lognormal,
  data = token_measures,
)

You can see that there is no class = Intercept in the output of get_prior() because we “suppressed” the intercept of the model.

Now, c2_phonationvoiced is the mean vowel duration when the following consonant is voiced and c2_phonationvoiceless is the mean vowel duration when the following consonant is voiceless.

5 Fitting a model with indexing

\[ \begin{align} vdur_i & \sim LogNormal(\mu_i, \sigma) \\ \mu_i & = \alpha_{\text{Phon}[i]} \\ \end{align} \]

m_i <- brm(
  v1_duration ~ 0 + c2_phonation,
  family = lognormal,
  data = token_measures,
  seed = 1425,
  file = "cache/regression-indexing-m_i"
)

Compiling Stan program...

Trying to compile a simple C file

Running /opt/homebrew/Cellar/r/4.4.0_1/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 16.0.0 (clang-1600.0.26.3)’
using SDK: ‘MacOSX15.0.sdk’
clang -I"/opt/homebrew/Cellar/r/4.4.0_1/lib/R/include" -DNDEBUG   -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/Rcpp/1.0.13/f27411eb6d9c3dada5edd444b8416675/Rcpp/include/"  -I"/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/"  -I"/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/unsupported"  -I"/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/BH/include" -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/src/"  -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/"  -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppParallel/5.1.9/f38a72a419b91faac0ce5d9eee04c120/RcppParallel/include/"  -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/rstan/2.32.6/8a5b5978f888a3477c116e0395d006f8/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include    -fPIC  -g -O2  -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/Eigen/Dense:1:
In file included from /Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/Eigen/Core:19:
/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
  679 | #include <cmath>
      |          ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1

Start sampling


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 8.9e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.89 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.144 seconds (Warm-up)
Chain 1:                0.123 seconds (Sampling)
Chain 1:                0.267 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.8e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.14 seconds (Warm-up)
Chain 2:                0.148 seconds (Sampling)
Chain 2:                0.288 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.144 seconds (Warm-up)
Chain 3:                0.166 seconds (Sampling)
Chain 3:                0.31 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.7e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.141 seconds (Warm-up)
Chain 4:                0.138 seconds (Sampling)
Chain 4:                0.279 seconds (Total)
Chain 4:

summary(m_i)

 Family: lognormal 
  Links: mu = identity; sigma = identity 
Formula: v1_duration ~ 0 + c2_phonation 
   Data: token_measures (Number of observations: 1342) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
c2_phonationvoiced        4.65      0.01     4.62     4.67 1.00     4532
c2_phonationvoiceless     4.54      0.01     4.52     4.57 1.00     4229
                      Tail_ESS
c2_phonationvoiced        2899
c2_phonationvoiceless     2967

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.32      0.01     0.31     0.34 1.00     4024     2942

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

6 Get the difference between levels

We can now quickly calculate the difference between c2_phonationvoiced and c2_phonationvoiceless with the avg_comparisons() function from the marginaleffects package.

library(marginaleffects)
# you will also have to install the collapse package

avg_comparisons(m_i, variables = "c2_phonation")


 Estimate 2.5 % 97.5 %
    -10.9 -14.4  -7.43

Term: c2_phonation
Type:  response 
Comparison: mean(voiceless) - mean(voiced)
Columns: term, contrast, estimate, conf.low, conf.high, predicted_lo, predicted_hi, predicted, tmp_idx

So based on the model and data, vowels followed by voiceless consonants are 7-14.5 ms shorter than vowels followed by voiced stops, at 95% probability.

Regression models: interactions