Data analysis workflow

Stefano Coretta

Research process cycle: Overview

Research questions and hypothesis

  • Research questions are testable questions.

  • Research hypotheses are falsifiable statements.

  • Check again this entry.

Two case studies:

  • Descriptive and exploratory: Do British infants use a different number of gestures depending on their cultural background?

  • Explanatory and corroboratory: vowel duration and gesture distance.

British infants’ gestures: Study design

  • Published paper: https://www.doi.org/10.1111/cdev.13406

  • Three cultural backgrounds: English, Bangladeshi and Cantonese.

  • At least 20 infants per background.

  • Laboratory setting, three tasks.

Analysis plan (gesture count by background)

  • Summaries with median and range.

  • Visualisation with strip charts.

  • Regression models

    • Outcome variable: gesture count → Poisson family.

    • Predictors: cultural background.

British infants’ gestures: Data simulation

library(tidyverse)

N <- 20
background <- c("English", "Bangladeshi", "Cantonese")
count_en <- rpois(N, 4)
count_ba <- rpois(N, 1)
count_ca <- rpois(N, 1.5)

gestures_sim <- tibble(
  background = rep(background, each = N),
  count = c(count_en, count_ba, count_ca)
)

gestures_sim
# A tibble: 60 × 2
   background count
   <chr>      <int>
 1 English        2
 2 English        5
 3 English        4
 4 English        3
 5 English        3
 6 English        5
 7 English        3
 8 English        6
 9 English        3
10 English        5
# ℹ 50 more rows

British infants’ gestures: Data simulation (summarise)

gestures_sim |> 
  group_by(background) |> 
  summarise(median(count), min(count), max(count))
# A tibble: 3 × 4
  background  `median(count)` `min(count)` `max(count)`
  <chr>                 <dbl>        <int>        <int>
1 Bangladeshi               1            0            2
2 Cantonese                 1            0            6
3 English                   4            1            9

British infants’ gestures: Data simulation (plotting)

gestures_sim |> 
  ggplot(aes(background, count)) +
  geom_jitter(width = 0.1)

British infants’ gestures: Data simulation (model)

library(brms)

gestures_sim_bm <- brm(
  count ~ 0 + background,
  family = poisson,
  data = gestures_sim,
  cores = 4,
  seed = 9527,
  file = "cache/gestures_sim_bm"
)

British infants’ gestures: Data simulation (model)

summary(gestures_sim_bm, prob = 0.8)
 Family: poisson 
  Links: mu = log 
Formula: count ~ 0 + background 
   Data: gestures_sim (Number of observations: 60) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                      Estimate Est.Error l-80% CI u-80% CI Rhat Bulk_ESS
backgroundBangladeshi     0.20      0.20    -0.05     0.45 1.00     3695
backgroundCantonese       0.39      0.18     0.16     0.62 1.00     3934
backgroundEnglish         1.35      0.11     1.21     1.49 1.00     3935
                      Tail_ESS
backgroundBangladeshi     2480
backgroundCantonese       2837
backgroundEnglish         2974

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

British infants’ gestures: Data simulation (model)

conditional_effects(gestures_sim_bm, prob = 0.8)

British infants’ gestures: Data acquisition

gestures <- read_csv("data/cameron2020/gestures.csv") |> 
  group_by(background, dyad) |> 
  summarise(
    count_sum = sum(count, na.rm = TRUE)
  ) |> 
  drop_na(count_sum)
gestures
# A tibble: 60 × 3
# Groups:   background [3]
   background dyad  count_sum
   <chr>      <chr>     <dbl>
 1 Bengali    b01          17
 2 Bengali    b02          36
 3 Bengali    b03          43
 4 Bengali    b04          50
 5 Bengali    b05          33
 6 Bengali    b06          24
 7 Bengali    b07          16
 8 Bengali    b08          37
 9 Bengali    b09          26
10 Bengali    b10          13
# ℹ 50 more rows

British infants’ gestures: Summarise

gestures |> 
  group_by(background) |> 
  summarise(median(count_sum), min(count_sum), max(count_sum))
# A tibble: 3 × 4
  background `median(count_sum)` `min(count_sum)` `max(count_sum)`
  <chr>                    <dbl>            <dbl>            <dbl>
1 Bengali                   36.5                5               81
2 Chinese                   32.5                1              108
3 English                   20                  1              164

British infants’ gestures: Plot

gestures |> 
  ggplot(aes(background, count_sum)) +
  geom_jitter(width = 0.1)

British infants’ gestures: Model

gestures_bm <- brm(
  count_sum ~ 0 + background,
  family = poisson,
  data = gestures,
  cores = 4,
  seed = 9527,
  file = "cache/gestures_bm"
)

British infants’ gestures: Model summary

summary(gestures_bm, prob = 0.8)
 Family: poisson 
  Links: mu = log 
Formula: count_sum ~ 0 + background 
   Data: gestures (Number of observations: 60) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                  Estimate Est.Error l-80% CI u-80% CI Rhat Bulk_ESS Tail_ESS
backgroundBengali     3.63      0.04     3.58     3.68 1.00     4227     3072
backgroundChinese     3.60      0.04     3.55     3.65 1.00     4388     3001
backgroundEnglish     3.38      0.04     3.32     3.43 1.00     3644     2852

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

British infants’ gestures: Conditional posteriors

conditional_effects(gestures_bm, prob = 0.8)

British infants’ gestures: Conditional predictions

library(marginaleffects)

avg_predictions(gestures_bm, variables = "background", conf_level = 0.8)

 background Estimate 10.0 % 90.0 %
    Bengali     37.7   36.0   39.5
    Chinese     36.6   34.9   38.4
    English     29.3   27.8   30.8

Type:  response 
Columns: background, estimate, conf.low, conf.high 

British infants’ gestures: Comparisons

avg_comparisons(gestures_bm, variables = list(background = "pairwise"), conf_level = 0.8)

                      Contrast Estimate 10.0 % 90.0 %
 mean(Chinese) - mean(Bengali)    -1.07  -3.43   1.44
 mean(English) - mean(Bengali)    -8.37 -10.78  -6.07
 mean(English) - mean(Chinese)    -7.31  -9.72  -5.06

Term: background
Type:  response 
Columns: term, contrast, estimate, conf.low, conf.high, predicted_lo, predicted_hi, predicted, tmp_idx 

British infants’ gestures: Interpretation

  • The 80% CrIs of the predicted number of gestures are:

    • Bengali: 36–40.

    • Chinese: 35–38.

    • English: 28–31.

  • English infants performed a lower number of gestures (5 to 11 gestures less than Chinese and Bengali infants at 80% confidence).

  • Chinese and Bengali children have very similar predicted numbers of gestures. Their difference is between -3.5 to +1.5 gestures at 80% confidence.

British infants’ gestures: Reporting

We fitted a Bayesian regression model to the number of gestures produced by each infant, using a Poisson distribution. As the only predictor, we included cultural background (Bengali, Chinese, English). This predictor was coded using indexing (by suppressing the intercept with 0 + in the model formula).

Based on the model and data, the number of gestures by cultural background are 36–40 for Bengali infants, 35–38 for Chinese infants, and 28-31 for English infants, at 80% confidence. When comparing each cultural background, there is an 80% probability that the difference in gesture number is between -3.5 and 1.5 for Bengali vs Chinese, -11 and -6 for English vs Bengali, and -10 and -5 for English vs Chinese.

British infants’ gestures: Research question

Do British infants use a different number of gestures depending on their cultural background?

Based on the model and data, while Bengali and Chinese infants have a very similar number of gestures, English infants have a somewhat lower count relative to the other two groups.

In the next lecture…

Does the distance the tongue has to travel to produce a vowel determines the duration of the vowel?