QML - Week 5

Regression models: the basics

Stefano Coretta

Word frequency and reaction times

What is the relationship between a word’s lexical frequency and reaction times in a lexical decision task in Croatian?

Group activity

The data

croat <- read_csv("data/vlasicek2024/croatian-lexdes.csv")
croat |> select(word_string, rt_milliseconds_mean, word_frequency)
# A tibble: 2,612 × 3
   word_string  rt_milliseconds_mean word_frequency
   <chr>                       <dbl>          <dbl>
 1 ribnjak                      575.           5755
 2 dostupnost                   624.          15564
 3 deformacija                  706.           6016
 4 antioksidans                 914.           7646
 5 zapis                        623.          44327
 6 kvadrat                      549.          36394
 7 presedan                     987.           7786
 8 zapremnina                  1035.           1434
 9 prerada                      750.          18759
10 general                      625.          96640
# ℹ 2,602 more rows

Reaction times

Word frequency

Word frequency: logged

Word frequency and RTs

Gaussian model of RT

\[ RT \sim Gaussian(\mu, \sigma) \]

But we want to know what happens to RTs depending on the value of lexical frequency…

Then we let the mean \(\mu\) vary by lexical frequency!

\[ \begin{align} RT & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot logf \end{align} \]

But what are those \(\beta_0\) and \(\beta_1\)?

The equation of a line

\[ y = a + b \cdot x \]

  • \(a\) is the line intercept: the \(y\) value when \(x\) is 0 zero.

  • \(b\) is the line slope: the change in \(y\) for each unit-increase of \(x\).

Intercept

Slope

The equation of a line (bis)

\[ y = a + b \cdot x \]

  • \(a\) is the line intercept: the \(y\) value when \(x\) is 0 zero.

  • \(b\) is the line slope: the change in \(y\) for each unit-increase of \(x\).

\[ y = \beta_0 + \beta_1 \cdot x \]

[alternative notation]

Regression model

\[ \begin{align} RT & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot logf & \text{[Regression equation]} \end{align} \]

  • A regression model is a model based on the equation of a line.

  • The model estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) from the data (i.e. the observed \(RT\) and \(logf\) values).

  • \(\beta_0\), intercept

    • Mean RT value when logged frequency is 0 zero (i.e. when word frequency is 1; exp(0) = 1).
  • \(\beta_1\), slope

    • Change in mean RT for each unit increase of log-frequency (when log-frequency goes from \(x\) to \(x + 1\)).

Fit the model with brms

rt_bm <- brm(
  rt ~ 1 + log_freq,
  family = gaussian,
  data = croat
)

Model summary

summary(rt_bm)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: rt ~ 1 + log_freq 
   Data: croat (Number of observations: 2612) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept  1106.79     11.25  1084.44  1128.79 1.00     4140     2687
log_freq    -45.56      1.14   -47.81   -43.33 1.00     4163     2633

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   104.06      1.45   101.28   107.02 1.00     4397     3032

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Regression coefficients: intercept

            Estimate Est.Error       Q2.5      Q97.5
Intercept 1106.78704 11.249044 1084.43507 1128.78687
log_freq   -45.56409  1.142898  -47.81221  -43.33064

Intercept Intercept

  • Estimate = posterior mean: 1107 ms.

  • Est.Error = posterior SD: 11 ms.

  • Q2.5 Q97.5 = posterior 95% CrI: [1084, 1129] ms.

Posterior distribution: intercept

Regression coefficients: slope

            Estimate Est.Error       Q2.5      Q97.5
Intercept 1106.78704 11.249044 1084.43507 1128.78687
log_freq   -45.56409  1.142898  -47.81221  -43.33064

Slope log_freq

  • Estimate = posterior mean: -46 ms.

  • Est.Error = posterior SD: 1 ms.

  • Q2.5 Q97.5 = posterior 95% CrI: [-48, -43] ms.

Posterior distribution: slope

Posterior predictions

Word frequency and reaction times (bis)

What is the relationship between a word’s lexical frequency and reaction times in a lexical decision task in Croatian?

  • When log-frequency is 0, the mean RTs are between 1084 and 1129 ms at 95% confidence.

  • For each unit increase of log-frequency, the mean RTs decrease by 43-48 ms, at 95% confidence.

Be careful!

Correlation is NOT causation

  • Correlation between two variables: they co-vary (when one changes, the other systematically changes too).

  • Spurious correlations: two variables look correlated because of a confounder.

Note

  1. Vocabulary for snow vs latitude
    • The farther north a community lives, the more words their language has for snow.
    • Confounder: subsistence ecology.
  2. Number of plant names in a language vs. biodiversity of the region
    • Languages in biodiverse regions have more words for plants.
    • Confounder: cultural reliance on plants.
  3. Metaphors involving animals vs. level of industrialization
    1. Less industrialized societies use more animal-based metaphors
    2. Confounder: daily exposure to animals.

Be careful!

But it is if you use causal inference

  • Correlation can be interpreted causally if you adopt a causal inference approach.

  • We won’t treat causal inference in this course due to time, but you can learn about it in McElreath’s textbook Statistical Rethinking.