Data analysis
1 Overview
There are several resources you can use to teach yourself quantitative data analysis skills, depending on your level. Don’t forget to also check Measures and the Skills page, for more general research-related skills.
2 Beginners
The following resources are suitable for beginners who want to learn quantitative data analysis from scratch.
2.1 Data wrangling and visualisation
The simple graph has brought more information to the data analyst’s mind than any other device.
—John Tukey
Data wrangling is about getting your data into a useful format for visualisation and modelling.
The programming languages Python and R are two very common languages used for data analysis.
Python is a general-purpose programming language while R is specifically developed for statistical analysis and visualisation. Most academic research uses R for data analysis, although Python is also employed especially for data processing.
If you want to teach yourself R, the following resources are an excellent place to start from:
The R for Data Science (R4DS) free online book is an excellent introduction to R and quantitative data analysis.
The Data Visualisation Catalogue is a project developed by Severino Ribecca to create a (non-code-based) library of different information visualisation types. The website serves as a learning and inspirational resource for those working with data visualisation.
The workshop intRo: Data Analysis with R introduces absolute beginners from the Humanities to R, quantitative data analysis and visualisation. Check out the videos on YouTube: videos. You can find the materials and slides here: materials and slides.
2.2 Statistical modelling
The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.
—Nate Silver, The Signal and the Noise
Statistics and statistical modelling are about finding meaning in the patterns that can be observed in the data.
Statistics for Linguists: An introduction using R by Bodo Winter is ideal both for absolute beginners and experienced researchers. It is packed with everything you need to successfully and effectively conduct statistical analyses.
Statistical (Re)thinking by Richard McElreath is an excellent introduction for absolute beginners, by Richard McElreath, which covers a wide variety of linear models. It focusses on Bayesian inference and how this framework can help us directly answer research questions, assess evidence for different hypothesis, and quantify uncertainty. If you are familiar with the tidyverse, the code from the Statistical (Re)thinking book has been translated into tidyverse by Solomon Kurz, and it can be accessed here: Statistical rethinking with brms, ggplot2, and the tidyverse.
The Art of Statistics: Learning from data by David Spiegelhalter uses real-world examples to explain principles of data visualisation and analysis. It touches upon a wide range of topics and disciplines, from communicating proportions, to probability and Bayesian inference, making it a great complement to the other books and resources mentioned here. If you just wish to dip your toes in statistics without committing (yet) to learning how to do statistics, this book is for you.
Linear models and linear mixed effects models in R with linguistic applications by Bodo Winter is a short and intense tutorial on linear models. The first part introduces you to Gaussian linear models and the second part to Gaussian linear models that include random effects (variably called mixed, hierarchical, nested models). Note that Gaussian linear models are not appropriate for the data of most linguistic research and you will have to learn and use other types of linear models.
2.2.1 Linear models
One model to rule them all, one model to fit them,
One model to shrink them all, and in probability bind them;
In the Land of Inference where the distributions lie.—The Statistical Hobbit.
Linear models are a very flexible and relatively straightforward way to model and analyse quantitative data. They have gained momentum and are increasingly being adopted across the entire field of linguistics.
The main perk of learning linear models is that they can be applied to many different types of data, so that once you learn this approach you will be able to apply it to a lot of data analysis scenarios.
The resources mentioned above all focus on linear modelling, so whether you are just starting your statistical journey or you are an experienced researcher who wants to consolidate their understanding of linear modelling, those resources are right for you.
After you have learnt the basics, the Linear models cheat-sheet can guide you through the process of choosing among the appropriate types of linear model depending on the nature of your data. The post also lists tutorials on linear models that use other less common probability distribution families, like the beta, Poisson and ordinal.
Confused about all the model names? Check out this post on how we don’t really need to use all of those names: they are all linear models!
3 Intermediate
If you already have a basic understanding of quantitative data analysis, statistics and R, the following resources can help you develop your skills further.
3.1 Likert scales
Likert-scale data are quite common in many fields of linguistics. Likert scales are common especially in sociolinguistics work, for example in work that investigates attitudes: e.g. a 5-point scale “disagree, somewhat disagree, neutral, somewhat agree, agree”.
Likert-scale data special because they are categorical and ordered.
Ordered data must be modelled using the appropriate distribution, namely the cumulative distribution. Ordinal linear models are an extension of linear models that use the cumulative distribution to model ordinal data, like Likert-scale data.
For an excellent tutorial on how to fit ordinal linear models using brms, see Analysis of rating scales: A pervasive problem in bilingualism research and a solution with Bayesian ordinal models by João Veríssimo.
3.2 Count data (and corpus data)
Count data, like number of a particular construction in a corpus, number of interjection in a conversation, number of infant gestures, etc, should be modelled using a Poisson distribution.
See Poisson regression for linguists: A tutorial introduction to modelling count data with brms by Bodo Winter for a fantastic tutorial.
3.3 Dimensionality reduction
If your data is highly dimensional, i.e. you have a lot of different variables (some of which might even be correlated to each other), you can employ data dimensionality reduction techniques to “synthesise” all the variables into fewer variables that represent components, dimensions or clusters in the data.
These techniques can be used both (a) to find patterns or groupings in the data and to obtain measures that capture these patterns and groupings and (b) to simplify analyses from a set of 15/20 variables to 2/3 components or dimensions.
Note that once you have reduced your data to a few variables (components or dimensions), these can still be further analysed with the other techniques mentioned on this page.
A common reduction technique is Principal Component Analysis (PCA). This method combines all of your variables into a limited set of numeric principal components. The scores of the principal components capture variation in the data and can be used for further analysis. You can learn how to carry out a PCA with this tutorial. Also check out Functional Principal Component Analysis below.
Multiple Correspondence Analysis (McA) is the discrete equivalent of PCA, i.e. it can be used with discrete/categorical variables. See this tutorial for an introduction.
Another dimensionality reduction technique is Cluster Analysis (CA, aka hierarchical clustering). This tutorial guides you through a CA in R.
3.4 Time series and coordinates
Generalised Additive (Mixed) Models (GAMMs) are a flexible extension of linear models that allows us to fit non-linear effects. They are particularly useful with data that come from time series (e.g. f0 and formants, corpus occurrences across time, longitudinal data, etc.) and they can be employed with any kind of data that can be thought of as being represented on a coordinate space (e.g., geolocations, electroencephalographic (EEG) data, 3D tongue imaging, etc).
The tutorial Generalised additive mixed models for dynamic analysis in linguistics: a practical introduction by Márton Sóskuthy is an excellent introduction to Generalised Additive Mixed Models (GAMMs).
Another excellent resource is Hierarchical generalized additive models in ecology: an introduction with mgcv by Pedersen and colleagues. In particular, Figure 4 is a beautiful visual summary of how different types of trends and groupings can be modelled with GAMs.
The paper Generalized Additive Mixed Models for intra-speaker variation by Tamminga and colleagues advocates for the adoption of GAMMs to advance the use of naturalistic data for studying psycholinguistic questions about intra-speaker variation.
Functional Principal Component Analysis (FPCA) is another approach to modelling time-series data.
- Functional Data Analysis for Speech Research by Michele Gubian is a collection of workshop materials on Functional Data Analysis with a focus on speech research data.
3.5 Bayesian inference
The overview by Etz et al., How to become a Bayesian in eight easy steps: An annotated reading list, is a good place to start from if you want to learn more about Bayesian statistics and inference.
For an more practice-oriented introduction, you should read Statistical (Re)thinking (see above).
The learnB4SS workshop is an introduction to Bayesian analysis for the Speech Sciences. It requires familiarity with linear models and Null Hypothesis Significance Testing.
4 Advanced
4.1 Power analysis
Power analysis is a fundamental, although often neglected, step in Null Hypothesis Significance Testing (the statistical framework that returns p-values). A power analysis is a method to estimate the minimum sample size necessary to detect a particular effect. The statistical power of a test is the percentage of tests that correctly detect an effect when the effect indeed exists. The recommended statistical power is 80% or greater.
Power analyses with linear models can become quite complex, especially if random effects are included. Simulation is a way to simplify the calculations necessary to find the minimum sample size. The R package simr provides users with a set of functions to perform a power analysis with linear models using simulations. You can find a tutorial here.
If you are running Bayesian linear models, you can check out this post on Bayesian CrI-width power analysis.
4.2 Multivariate linear models
Estimating Multivariate Models with brms by Paul Bürkner explains how to fit linear models with two or more outcome variables (i.e. multivariate models) using brms.