# Install the coretta2018itapol package first if needed.
# remotes::install_github("stefanocoretta/coretta2018itapol")
library(coretta2018itapol)
data("token_measures")
Regression: Indexing of categorical predictors
1 Categorical predictors: contrasts vs indexing
Categorical predictors in regression models are coded by default using the so-called “treatment” contrasts, a type of contrast coding.
This means that the intercept of the model will correspond to the predicted value when the categorical predictor is at the reference level, and the other coefficients will be the difference between the predicted value when the categorical predictor is at the other levels and the intercept.
This way of coding categorical predictors makes it more difficult to set up priors and can at times make it less straightforward to interpret the model summary.
In this post, you will learn about a different method of coding categorical predictors, called indexing.
With this method, the model estimates the predicted value for each level of the predictor, rather than for just the reference level. This means that you don’t get coefficients that are differences between predicted values at different levels.
Let’s see an example of both methods using data from Coretta 2018.
2 The data
Here’s a preview of the data.
# Let's remove NAs from the v1_duration column
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- token_measures |>
token_measures drop_na(v1_duration)
token_measures
3 Treatment contrasts
library(brms)
Loading required package: Rcpp
Loading 'brms' package (version 2.21.0). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').
Attaching package: 'brms'
The following object is masked from 'package:stats':
ar
get_prior(
~ c2_phonation,
v1_duration family = lognormal,
data = token_measures,
)
We get an Intercept
(that’s the expected vowel duration when the following consonant is voiced) and a class = b
prior for the coefficient c2_phonationvoiceless
. The coefficient c2_phonationvoiceless
is the difference between the vowel duration when the following consonant is voiceless and the vowel duration when the following consonant is voiced.
\[ \begin{align} vdur & \sim LogNormal(\mu, \sigma) \\ \mu & = \beta_0 + \beta_1 \cdot \text{voiceless} \end{align} \]
where \(\text{voiceless}\) is an “indicator” (aka dummy) variable that “turns on” the \(\beta_1\) coefficient in the formula of \(\mu\) when the following consonant is voiceless.
Hence why \(\beta_1\) is a difference between expected values (the value when C2 is voiceless - the value when C2 is voiced).
Setting priors on \(\beta_0\) and \(\beta_1\) means we need to think about:
- What we expect the mean value duration to be when C2 is voiced, and
- What we expect the difference in mean between vowels followed by a voiceless vs voiced consonant.
It is much easier to set priors on the mean value duration when C2 is voiced and when C2 is voiceless. We can use indexing to achieve that.
4 Indexing
With indexing, each level in a categorical predictor gets it’s own coefficient and the model estimates the predicted value of the outcome at each level.
Differently from setting contrasts with contrasts()<-
, indexing is applied by suppressing the intercept of the model with 0 +
.
get_prior(
~ 0 + c2_phonation,
v1_duration family = lognormal,
data = token_measures,
)
You can see that there is no class = Intercept
in the output of get_prior()
because we “suppressed” the intercept of the model.
Now, c2_phonationvoiced
is the mean vowel duration when the following consonant is voiced and c2_phonationvoiceless
is the mean vowel duration when the following consonant is voiceless.
5 Fitting a model with indexing
\[ \begin{align} vdur_i & \sim LogNormal(\mu_i, \sigma) \\ \mu_i & = \alpha_{\text{Phon}[i]} \\ \end{align} \]
<- brm(
m_i ~ 0 + c2_phonation,
v1_duration family = lognormal,
data = token_measures,
seed = 1425,
file = "cache/regression-indexing-m_i"
)
Compiling Stan program...
Trying to compile a simple C file
Running /opt/homebrew/Cellar/r/4.4.0_1/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 16.0.0 (clang-1600.0.26.3)’
using SDK: ‘MacOSX15.0.sdk’
clang -I"/opt/homebrew/Cellar/r/4.4.0_1/lib/R/include" -DNDEBUG -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/Rcpp/1.0.13/f27411eb6d9c3dada5edd444b8416675/Rcpp/include/" -I"/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/" -I"/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/unsupported" -I"/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/BH/include" -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/src/" -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/" -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppParallel/5.1.9/f38a72a419b91faac0ce5d9eee04c120/RcppParallel/include/" -I"/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/rstan/2.32.6/8a5b5978f888a3477c116e0395d006f8/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include -fPIC -g -O2 -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Users/ste/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin23.4.0/StanHeaders/2.32.10/c35dc5b81d7ffb1018aa090dff364ecb/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/Eigen/Dense:1:
In file included from /Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/Eigen/Core:19:
/Users/ste/repos/qml/renv/library/macos/R-4.4/aarch64-apple-darwin23.4.0/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
679 | #include <cmath>
| ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1
Start sampling
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 8.9e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.89 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.144 seconds (Warm-up)
Chain 1: 0.123 seconds (Sampling)
Chain 1: 0.267 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 2.8e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.14 seconds (Warm-up)
Chain 2: 0.148 seconds (Sampling)
Chain 2: 0.288 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 2.8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.144 seconds (Warm-up)
Chain 3: 0.166 seconds (Sampling)
Chain 3: 0.31 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 2.7e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.141 seconds (Warm-up)
Chain 4: 0.138 seconds (Sampling)
Chain 4: 0.279 seconds (Total)
Chain 4:
summary(m_i)
Family: lognormal
Links: mu = identity; sigma = identity
Formula: v1_duration ~ 0 + c2_phonation
Data: token_measures (Number of observations: 1342)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
c2_phonationvoiced 4.65 0.01 4.62 4.67 1.00 4532
c2_phonationvoiceless 4.54 0.01 4.52 4.57 1.00 4229
Tail_ESS
c2_phonationvoiced 2899
c2_phonationvoiceless 2967
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.32 0.01 0.31 0.34 1.00 4024 2942
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
6 Get the difference between levels
We can now quickly calculate the difference between c2_phonationvoiced
and c2_phonationvoiceless
with the avg_comparisons()
function from the marginaleffects package.
library(marginaleffects)
# you will also have to install the collapse package
avg_comparisons(m_i, variables = "c2_phonation")
Estimate 2.5 % 97.5 %
-10.9 -14.4 -7.43
Term: c2_phonation
Type: response
Comparison: mean(voiceless) - mean(voiced)
Columns: term, contrast, estimate, conf.low, conf.high, predicted_lo, predicted_hi, predicted, tmp_idx
So based on the model and data, vowels followed by voiceless consonants are 7-14.5 ms shorter than vowels followed by voiced stops, at 95% probability.