Data analysis workflow: Part 2


Stefano Coretta

1 Research process cycle: Overview

2 Research questions and hypothesis

Two case studies:

  • (Week 8) Descriptive and exploratory: Do British infants use a different number of gestures depending on their cultural background?

  • (This week) Explanatory and corroboratory: vowel duration and gesture distance.

3 Intrinsic vowel duration: Future study

High vowels tend to be shorter than low vowels.

4 Distance to target

5 Source of intrinsic vowel duration

Figure 1: Distance to target (Turk et al. 1994)
Figure 2: Phonologisation (Toivonen et al. 2015)

6 Variables of interest

Outcome Predictors
Vowel duration (ms) Vowel (quality)
Distance to vocalic target (mm)
Speech rate (syl/s)

7 Models of intrinsic vowel duration

8 Research Hypotheses

A. Vowel duration is entirely determined by the distance to vowel target.

B. Vowel duration is entirely determined by stored durational targets.

C. Vowel duration is not entirely determined by distance to vowel target only and stored durational targets are also necessary.

9 Research context overview

Topic Intrinsic vowel duration
Research problem There are competing models of gestural duration.
Goal Assessing these models against empirical data. (Comparing the fit of different models)
Research question Which of three possible models of gestural duration are corroborated?
Research hypotheses

10 Causal inference

  • Causal inference approach: think about the causal relationship of variables to determine statistical models.

  • Directed Acyclic Graphs, or DAGs.

  • Demo on DAGitty.

To learn more about Causal Inference:

11 Regression model

Inference Bayesian
Model Regression model
Outcome variable Vowel duration
Distribution of outcome Log-normal
Predictors vowel, distance, speech rate
  • distance: logged and standardised

  • speech rate: logged and standardised

Effect types non-linear effects for distance and speech rate
Coding indexing
Varying effects
  • by-speaker distance, by vowel

  • by-speaker speech rate

12 Regression model: code

bm <- brm(
  vowel_duration ~
    0 + vowel +
    s(distance_logz) +
    s(distance_logz, speaker, by = vowel, bs = "fs", m = 1) +
    s(speech_rate_logz, k = 5) +
    s(speech_rate_logz, speaker, bs = "fs", m = 1),
  family = lognormal

13 Statistical hypotheses

A. There is no robust effect of vowel but there is a robust (non-linear) effect of distance.

B. There is a robust effect of vowel but there is no robust effect of distance.

C. There is both an effect of vowel and distance.

14 Simulate data

Check the faux package, although it might not actively developed anymore.


s_n <- 20
rep_n <- 10
vdur_m <- 75
vdur_a <- 20; vdur_e <- 10; vdur_i <- -15; vdur_o <- 10; vdur_u <- -22
dist_a <- 8; dist_e <- 4.5; dist_i <- 2; dist_o <- 6; dist_u <- 2
dist_sd <- 0.2
b_dist <- 0.8
b_sr <- 0.01
sr_m <- 4
# varying duration
dur_spk_sd <- 5
dur_rep_sd <- 2
# varying distance
dist_spk_sd <- 0.05
dist_rep_sd <- 0.05
# varying sr
sr_spk_sd <- 0.005
sr_rep_sd <- 0.005

sigma_sd <- 20

sim_data <- add_random(speaker = s_n) |> 
  add_within("speaker", vowel = c("a", "e", "i", "o", "u")) |> 
  add_within("vowel", rep = 1:rep_n) |> 
  add_ranef("speaker", dur_spk = dur_spk_sd) |> 
  add_ranef("rep", dur_rep = dur_rep_sd) |> 
  add_ranef("speaker", dist_spk = dist_spk_sd) |> 
  add_ranef("rep", dist_rep = dist_rep_sd) |>
  add_ranef("speaker", sr_spk = sr_spk_sd) |> 
  add_ranef("rep", sr_rep = sr_rep_sd) |> 
  add_ranef(sigma = sigma_sd) |> 
  add_ranef(sigma_dist = dist_sd) |> 
    distance = case_when(
      vowel == "a" ~ dist_a + dist_spk + dist_rep + sigma_dist,
      vowel == "e" ~ dist_e + dist_spk + dist_rep + sigma_dist,
      vowel == "i" ~ dist_i + dist_spk + dist_rep + sigma_dist,
      vowel == "o" ~ dist_o + dist_spk + dist_rep + sigma_dist,
      vowel == "u" ~ dist_u + dist_spk + dist_rep + sigma_dist
    speech_rate = sr_m + sr_spk + sr_rep
  ) |> 
    vowel_duration = case_when(
      vowel == "a" ~ vdur_m + vdur_a + dur_spk + dur_rep + (b_dist * distance) + (b_sr * speech_rate) + sigma_sd,
      vowel == "e" ~ vdur_m + vdur_e + dur_spk + dur_rep + (b_dist * distance) + (b_sr * speech_rate) + sigma_sd,
      vowel == "i" ~ vdur_m + vdur_i + dur_spk + dur_rep + (b_dist * distance) + (b_sr * speech_rate) + sigma_sd,
      vowel == "o" ~ vdur_m + vdur_o + dur_spk + dur_rep + (b_dist * distance) + (b_sr * speech_rate) + sigma_sd,
      vowel == "u" ~ vdur_m + vdur_u + dur_spk + dur_rep + (b_dist * distance) + (b_sr * speech_rate) + sigma_sd
    distance_log = log(distance),
    distance_logz = (distance_log - mean(distance_log)) / sd(distance_log),
    speech_rate_log = log(speech_rate),
    speech_rate_logz = (speech_rate_log - mean(speech_rate_log)) / sd(speech_rate_log)

15 Simulate data: plot

16 Simulate data: model

bm <- brm(
  vowel_duration ~
    0 + vowel +
    s(distance_logz) +
    s(distance_logz, speaker, by = vowel, bs = "fs", m = 1) +
    s(speech_rate_logz, k = 5) +
    s(speech_rate_logz, speaker, bs = "fs", m = 1),
  family = lognormal,
  data = sim_data,
  cores = 4,
  seed = 7162,
  file = "cache/data-analysis-workflow-2_bm"

This took 40 minutes to run and ended in 218 divergent transitions. You would normally also specify priors and try and fix the divergent transitions (and in the case of non-linear effects you might also want to estimate the number of knots before fitting).

17 Simulate data: model summary

conditional_effects(bm, "distance_logz:vowel")

18 Summary

  • Research context.

  • Research questions and hypotheses.

  • Plan study design and analysis in details (here we just scratched the surface).

  • Simulate data and/or use previous/pilot data to check that the analysis works as intended.

  • Pre-register the analysis or write a Registered Report.