Quantitative Methods for LEL

.title[
# Quantitative Methods for LEL
]
.subtitle[
## Week 4
]
.author[
### Dr Stefano Coretta
]
.institute[
### University of Edinburgh
]
.date[
### 2023/10/10
]

---

---

## Albanian VOT

---

```r
alb_vot_vl
```

```
## # A tibble: 24 × 7
## speaker file label release voi_onset consonant vot
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 s01 014-pata p 0.705 0.712 p 7.46
## 2 s01 020-tapa t 0.825 0.833 t 8.00
## 3 s01 055-pata p 0.823 0.838 p 15.0 
## 4 s01 061-tapa t 0.944 0.953 t 8.62
## 5 s01 096-pata p 1.10 1.10 p 6.24
## 6 s01 102-tapa t 0.951 0.964 t 13.1 
## 7 s02 011-tapa t 0.752 0.766 t 13.9 
## 8 s02 034-pata p 0.724 0.735 p 11.1 
## 9 s02 052-tapa t 0.700 0.711 t 11.0 
## 10 s02 075-pata p 0.704 0.716 p 11.7 
## # ℹ 14 more rows
```

---

---

Based on the sample (N = 24): mean VOT = 11.6 ms, with SD = 2.8 ms.

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
**Sample of 24 VOT values of Albanian voiceless stops**.

- Sample mean = 11.6 ms.
- Sample SD = 2.8 ms.
]

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[
**Does this mean that the population mean and SD are 11.6 and 2.8 ms?**
]

---

---

## Albanian VOT

---

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[
**Does this mean that the population mean and SD are 11.6 and 2.8 ms?**
]

because of *uncertainty and variability* in the sample
]

---

Let's simulate some data:

- 10 samples of 24 values each.

- True mean = 12 ms and true SD = 3 ms

```r
set.seed(9899)
vot_l <- list()

for (i in 1:10) {
 vot_l[i] <- list(rnorm(n = 24, mean = 12, sd = 3))
}

vot_l
```

```
## [[1]]
##  [1] 12.547001 13.482570 11.235440  9.201638 13.029873 10.839973 10.072633 17.073684  9.590262
## [10]  8.736326 12.386988 10.330078 13.158883 17.370238  9.683748 14.626872 11.190529  8.588188
## [19]  8.185715 10.275703 10.163497 12.426824 17.258395 15.261082
## 
## [[2]]
##  [1] 11.199776  9.191559 10.536169 17.302444 14.968004 16.022714 11.811979 19.270296 14.663989
## [10]  9.111117 13.010169  8.768621 11.697593 14.240513 17.030579 17.591694  8.482335 15.371721
## [19] 11.392395 11.802774 10.686668 12.423605  9.899234 13.359118
## 
## [[3]]
##  [1] 15.469305 10.986465 11.399610 11.860328 11.098915 11.530128 14.366997 10.161520 15.062421
## [10] 13.377994 15.954005 13.734389 10.263805 12.089615  9.669913  9.185615 13.560065 11.238286
## [19] 13.664887  6.975807 10.327550 20.671285 12.913196 13.257993
## 
## [[4]]
##  [1]  9.401134  7.456256 16.953828 16.120669 10.186811 10.350754 12.577068 13.471340 11.155870
## [10]  7.995914  7.949878 12.424529 11.554214  5.741984 12.798154 10.616624 16.138792 12.083009
## [19] 13.043298 15.108454 11.605118 16.034281  9.240238  8.840297
## 
## [[5]]
##  [1] 10.144873 15.334926 14.176038 10.662095  9.950884  8.383486 12.333795 13.896069  6.045514
## [10] 10.709774 13.176737  9.854207 12.123742  9.828057 11.712854  7.562980 11.687145 15.004526
## [19] 13.647729  7.974551 13.657776 12.176325 11.572602 11.184775
## 
## [[6]]
##  [1] 11.415683  9.757701 10.656652 10.031233 12.744657 11.948005 12.396909  5.154551 13.516409
## [10] 13.189509  8.408094 17.574825 13.992077  9.590580 12.376580 15.127537 10.291946 12.809615
## [19] 12.433415  9.395907 15.605659  7.580856 13.663154 13.067048
## 
## [[7]]
##  [1] 18.211002  8.933161 10.701238 16.188371 10.716130 14.646898  7.344686 12.473941 16.796543
## [10] 14.943436 13.935726  9.903555 10.191064 20.577365  7.977717 11.526709 10.052314 11.901520
## [19] 15.382617 15.358084 16.477702  9.253993 19.780746 12.212425
## 
## [[8]]
##  [1]  8.545139  8.679016 16.443490 10.509814 15.204425 11.127125 15.945478 13.032889 19.209223
## [10] 10.762655 12.164008 13.206491 18.714305 16.573439 11.184109 14.180738 17.117758  5.179143
## [19] 13.101306 15.547665 13.272145  9.865320 12.410822 14.382655
## 
## [[9]]
##  [1] 15.853645 11.518232 12.418059 13.661118  7.552751  7.908694 15.185743 14.283423 14.782544
## [10]  7.655945 10.170148  9.175660 14.735301 10.458225  8.644414 10.282799 13.057164 11.617377
## [19] 11.036878 12.396111 15.290408 16.706777  6.665289  9.849574
## 
## [[10]]
##  [1] 12.856541  4.445483  6.389982  9.720114 17.289793 12.480332 14.668295 11.876737  8.672622
## [10] 11.174312 12.255304 10.978557 12.204603 13.443606  9.094550 15.752857 12.540005  9.498565
## [19] 11.818078  9.450323 13.851222  9.385503 10.549956 14.326689
```

---

```
## Mean:
##  11.9 12.9 12.5 11.6 11.4 11.8 13.1 13.2 11.7 11.4
```

```
## SD:
##  2.8 3.1 2.8 3 2.4 2.7 3.7 3.4 2.9 2.9
```

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[
Some of these means and SDs are off, because our samples are *random*.

**So any of the sample values we obtain when randomly sampling the population might not be the population values.**
]

---

## Inference: from sample to population

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
**We want to know the mean and SD of Albanian voiceless VOT.**

- In other words, we want to **estimate** the mean and SD.

- We collect a **sample** from the population of Albanian voiceless VOT values.

- Because of the random sampling, the estimates of the mean and SD are **uncertain**.

]

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[
We can **quantify the uncertainty** of an estimate by specifying the **probabilities** of different values of that estimate.
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
**Probabilities are at the very core of statistics**, because the ultimate aim of statistics is to quantify WHAT?
]

---

## Probabilities

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
**Probability**

- Probability of an event occurring or probability that an estimate being some numeric value.

- Probabilities can only be **between 0 and 1**.

- ⛔️ 0 means **impossible**.
  - 🤷 0.5 means **it can happen but it can also not happen**.
  - ✅ 1 means **certain**.
]

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
**Probability**

- Probability of an event occurring: 0 to 100% probability.

- **Probability of an estimate being some numeric value**: a bit more complicated...
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
**We need probability distributions!**
]

---

## Probability distributions

---

![](../../img/grubabilities.png)

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
A **probability distribution** is a mathematical function that describes *how the probabilities are distributed over the values* that a variable can take on.
]

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
Two types of probability distributions

- **Discrete probability distributions.**

- **Continuous probability distributions.**
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
**We talked about discrete and continuous variables in Week 2!**

Discrete variables (numeric or categorical) follow discrete probability distributions and continuous variables follow continuous probability distributions.
]

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
We can visualise probability distributions:

- Using the **probability mass function** for discrete probability distributions.

- Using the **probability density function** for continuous probability distributions.
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
You don't need to understand the math behind this, but you are free to learn about it through the internet search engine of your choice!
]

---

**Probability Mass Function**

---

**Probability Density Function**

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
**Probability distributions can be expressed by a set of parameters.**

- We summarise a probability distribution with a **set of parameters**.
  
  - Different (sub-)types of probability distributions have a different number of parameters and different parameters.
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
The **Gaussian probability distribution** is a continuous probability distribution and it has two parameters:

- The mean `$\mu$`.
- The standard deviation `$\sigma$`.

]

Go to **[Seeing Theory](https://seeing-theory.brown.edu/probability-distributions/index.html#section2)**.

???

Seeing Theory was created by Daniel Kunin while an undergraduate at Brown University. The goal of this website is to make statistics more accessible through interactive visualizations (designed using Mike Bostock’s JavaScript library D3.js).

<https://seeing-theory.brown.edu/index.html#3rdPage>

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
In research, you try to **estimate** the probability distribution of the variable of interest (VOT, number of telic verbs, informativity score, acceptability ratings, ...).

- In other words you are trying to **estimate the parameters** of the probability distribution.
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
Now, let's talk a bit about ontology...
]

---

.bg-washed-yellow.b--orange.ba.bw2.br3.shadow-5.ph4.mt2[
**Frequentist view of probabilities**

- The parameters (like `$\mu$` and `$\sigma$`) are **fixed** (they are *unknown but certain*).

- They take on a specific value.
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
**Bayesian view of probabilities**

- The parameters are **variables** themselves (they are *unknown and uncertain*).

- **We describe each parameter of a probability distribution as another probability distribution.**

- And each parameter's probability distribution is described by a set of parameters (called *hyper-parameters*).
]

---

background-image: url(https://media.giphy.com/media/443pAv9m6Ti8KiCoAi/giphy.gif)

---

---

## Probability distributions

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
**Sample of 24 VOT values of Albanian voiceless stops**.

- Sample mean = 11.6 ms.
- Sample SD = 2.8 ms.
]

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
Let's assume VOT values are **distributed according to a Gaussian distribution**.

- The VOT values we sampled are generated by a Gaussian distribution with mean `$\mu$` and standard deviation `$\sigma$`.
]

---

**Read as**: VOT values (`$vot$`) are distributed according to (`$\sim$`) a Gaussian distribution (`$Gaussian()$`) with mean `$\mu$` and standard deviation `$\sigma$`.

---

**Parameters**: mean `$\mu$` and SD `$\sigma$`.

---

**Parameters**: mean `$\mu$` and SD `$\sigma$`.

Let's pretend that the population mean and SD are the sample mean and SD...

---

---

**BUT**, this does not keep into consideration the **uncertainty and variability** of the sampling procedure.

---

**Parameters**: mean `$\mu$` and SD `$\sigma$`.

**Hyperparameters**: mean `$\mu_1$` and SD `$\sigma_1$`.

Standard deviations are always positive! So we need a truncated Gaussian distribution (only the positive half!).

---

**Parameters**: mean `$\mu$` and SD `$\sigma$`.

**Hyperparameters**: mean `$\mu_1$` and SD `$\sigma_1$`.

**Hyperparameters**: mean `$\mu_2$` and SD `$\sigma_2$`.

---

## Estimating probability distributions

---

.pull-left[
<img src="index_files/figure-html/norm-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="index_files/figure-html/hnorm-1.png" width="100%" style="display: block; margin: auto;" />
]

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[

$$
`\begin{align}
\text{vot} & \sim Gaussian(\mu, \sigma) \\
\mu        & \sim Gaussian(\mu_1, \sigma_1) \\
\sigma     & \sim TruncGaussian(\mu_2, \sigma_2)
\end{align}`
$$

- We need to estimate:

- `$\mu_1$` and `$\sigma_1$` for the Gaussian probability distribution of `$\mu$`.

- `$\mu_2$` and `$\sigma_2$` for the truncated Gaussian probability distribution of `$\sigma$`.
]

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
With our sample of N = 24 we want to make inferences about the population of VOT values of Albanian voiceless stops, by estimating the four parameters `$\mu_1$`, `$\sigma_1$`, `$\mu_2$` and `$\sigma_2$`.
]

---

```r
# Attach the brms package
library(brms)

# Run a Bayesian model
vot_bm <- brm(
 # This is the formula of the model.
 vot ~ 1,
 # This is the probability distribution family.
 family = gaussian(),
 # And the data.
 data = alb_vot_vl
)
```

---

.f2.center[
`vot ~ 1`
]

**Read as**: Model VOT values (`vot`) as a function of (`~`) the overall mean (`1`).

.f7[In other words, estimate the mean VOT. We will see later that `1` is also called the *Intercept*.]

.f2.center[
`family = gaussian()`
]

**Read as**: using a Gaussian probability distribution. The Gaussian distribution also has another parameter, the SD.

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
**Altogether**: Model VOT values as a function of the mean and standard deviation of a Gaussian probability distribution.

---

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: vot ~ 1 
##    Data: alb_vot_vl (Number of observations: 24) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    11.62      0.59    10.48    12.78 1.00     2559     2158
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     2.88      0.44     2.18     3.88 1.00     2327     2013
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

---

```
##  Family: gaussian 
## Formula: vot ~ 1 
##    Data: alb_vot_vl (Number of observations: 24) 
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    11.62      0.59    10.48    12.78 1.00     2559     2158
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     2.88      0.44     2.18     3.88 1.00     2327     2013
```

---

## Estimating the mean

---

```
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    11.62      0.59    10.48    12.78 1.00     2559     2158
```

$$
`\begin{align}
vot & \sim Gaussian(\mu, \sigma) \\
\mu & \sim Gaussian(\mu_1, \sigma_1) \\
\sigma & \sim TruncGaussian(\mu_2, \sigma_2)
\end{align}`
$$

- **Intercept**: the mean `$\mu$`.

- **Estimate**: `$\mu_1 = 11.62$` ms.

- **Est.Error**: `$\sigma_1 = 0.59$` ms.

---

---

```
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    11.62      0.59    10.48    12.78 1.00     2559     2158
```

- `l-95% CI`: LOWER boundary of the 95% Credible Interval.

- `u-95% CI`: UPPER boundary of the 95% Credible Interval.

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
A 95% Credible interval (CrI) includes the real value at 95% probability.

OR, there is a 95% probability that the real value is within the 95% CrI.
]

???

Confidence Intervals are different.

<https://rpsychologist.com/d3/ci/>

---

There is a 95% probability that mean VOT is between 10.48 and 12.78 ms.

---

## Estimating the standard deviation

---

```
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     2.88      0.44     2.18     3.88 1.00     2327     2013
```

$$
`\begin{align}
vot & \sim Gaussian(\mu, \sigma) \\
\mu & \sim Gaussian(\mu_1, \sigma_1) \\
\sigma & \sim TruncGaussian(\mu_2, \sigma_2)
\end{align}`
$$

- **sigma**: the SD `$\sigma$`.

- **Estimate**: `$\mu_2 = 2.88$` ms.

- **Est.Error**: `$\sigma_2 = 0.44$` ms.

---

---

```
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     2.88      0.44     2.18     3.88 1.00     2327     2013
```

---

There is a 95% probability that VOT standard deviation is between 2.18 and 3.88 ms.

---

## Putting it all together

$$
`\begin{align}
vot & \sim Gaussian(\mu, \sigma) \\
\mu & \sim Gaussian(11.62, 0.59) \\
\sigma & \sim TruncGaussian(2.88, 0.44)
\end{align}`
$$

> According to a Bayesian model of Albanian voiceless VOT with a Gaussian distribution as the distribution family, the VOT mean is between 10.48 and 12.78 ms and the VOT SD is between 2.18 and 3.88 ms, at 95% probability.

---

## Summary

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
- A random variable `$Y$` is a variable whose value is unknown and is generated by a random event.

- A **probability distribution** is a mathematical function that describes *how the probabilities are distributed over the values* that a random variable can take on.

- **Discrete probability distributions.**
  - **Continuous probability distributions.**

- The Gaussian distribution has two parameters: mean `$\mu$` and SD `$\sigma$`.

- We can describe `$\mu$` and `$\sigma$` as probability distributions and estimate the (hyper-)parameters of those probability distributions.

- R package [brms](https://paul-buerkner.github.io/brms/), function `brm()`.
]