Data analysis
1 Overview
There are several resources you can use to teach yourself quantitative data analysis skills, depending on your level. Don’t forget to also check Measures and the Skills page, for more general research-related skills.
You should also check out the Statistical Training Workshop (STeW) series: most workshops have been recorded and the link to the video is available on that page.
2 Beginners
The following resources are suitable for beginners who want to learn quantitative data analysis from scratch.
2.1 Data wrangling and visualisation
The simple graph has brought more information to the data analyst’s mind than any other device.
—John Tukey
Data wrangling is about getting your data into a useful format for visualisation and modelling.
The programming languages Python and R are two very common languages used for data analysis.
Python is a general-purpose programming language while R is specifically developed for statistical analysis and visualisation. Most academic research uses R for data analysis, although Python is also employed especially for data processing.
If you want to teach yourself R, the following resources are an excellent place to start from:
The R for Data Science (R4DS) free online book is an excellent introduction to R and quantitative data analysis.
The Data Visualisation Catalogue is a project developed by Severino Ribecca to create a (non-code-based) library of different information visualisation types. The website serves as a learning and inspirational resource for those working with data visualisation.
The workshop intRo: Data Analysis with R introduces absolute beginners from the Humanities to R, quantitative data analysis and visualisation. Check out the videos on YouTube: videos. You can find the materials and slides here: materials and slides.
2.2 Statistical modelling
The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.
—Nate Silver, The Signal and the Noise
Statistics and statistical modelling are about finding meaning in the patterns that can be observed in the data.
Statistics for Linguists: An introduction using R by Bodo Winter (Winter 2020) is ideal both for absolute beginners and experienced researchers. It is packed with everything you need to successfully and effectively conduct statistical analyses.
Statistical (Re)thinking by Richard McElreath (McElreath 2020) is an excellent introduction for absolute beginners, by Richard McElreath, which covers a wide variety of linear models. It focusses on Bayesian inference and how this framework can help us directly answer research questions, assess evidence for different hypothesis, and quantify uncertainty. If you are familiar with the tidyverse, the code from the Statistical (Re)thinking book has been translated into tidyverse by Solomon Kurz, and it can be accessed here: Statistical rethinking with brms, ggplot2, and the tidyverse.
The Art of Statistics: Learning from data by David Spiegelhalter (Spiegelhalter 2020) uses real-world examples to explain principles of data visualisation and analysis. It touches upon a wide range of topics and disciplines, from communicating proportions, to probability and Bayesian inference, making it a great complement to the other books and resources mentioned here. If you just wish to dip your toes in statistics without committing (yet) to learning how to do statistics, this book is for you.
Linear models and linear mixed effects models in R with linguistic applications by Bodo Winter is a short and intense tutorial on linear models (Winter 2013). The first part introduces you to Gaussian linear models and the second part to Gaussian linear models that include random effects (variably called mixed, hierarchical, nested models). Note that Gaussian linear models are not appropriate for the data of most linguistic research and you will have to learn and use other types of linear models.
2.2.1 Regression models
One model to rule them all, one model to fit them,
One model to shrink them all, and in probability bind them;
In the Land of Inference where the distributions lie.—The Statistical Hobbit.
Regression models are a very flexible and relatively straightforward way to model and analyse quantitative data. They have gained momentum and are increasingly being adopted across the entire field of linguistics.
The main perk of learning regression models is that they can be applied to many different types of data, so that once you learn this approach you will be able to apply it to a lot of data analysis scenarios.
The resources mentioned above all focus on regression modelling, so whether you are just starting your statistical journey or you are an experienced researcher who wants to consolidate their understanding of regression modelling, those resources are right for you.
After you have learnt the basics, the Regression models cheat-sheet can guide you through the process of choosing among the appropriate types of models depending on the nature of your data. The post also lists tutorials on regression models that use other less common probability distribution families, like the beta, Poisson and ordinal.
Confused about all the model names? Check out this post on how we don’t really need to use all of those names: they are all regression models!
2.3 Modelling mindset
We tend to think of statistics as a monolith, with prescribed recipes and well-established procedures. This is further from the truth: Molnar (2022) introduces beginners to different approaches to statistics (more precisely, to modelling data). Some approaches even use the same tools, but with very different aims.
2.4 Causal inference
“Correlation is not causation!”, you might have heard. Well, if you adopt a causal inference approach, then correlation can be causation. Causal inference allows researchers to select variables needed to estimate particular causal effects. One approach to causal inference is based on do-calculus and Directed Acyclic Graphs (DAGs). Causal inference is explained in McElreath (2020), especially chapters 5 and 6. For a brief conceptual introduction, see McElreath (2021).
3 Intermediate
If you already have a basic understanding of quantitative data analysis, statistics and R, the following resources can help you develop your skills further.
3.1 Likert scales
Likert-scale data are quite common in many fields of linguistics. Likert scales are common especially in sociolinguistics work, for example in work that investigates attitudes: e.g. a 5-point scale “disagree, somewhat disagree, neutral, somewhat agree, agree”.
Likert-scale data special because they are categorical and ordered.
Ordered data must be modelled using the appropriate distribution, namely the cumulative distribution. Ordinal linear models are an extension of linear models that use the cumulative distribution to model ordinal data, like Likert-scale data.
For an excellent tutorial on how to fit ordinal linear models using brms applied to linguistics data, see Verissimo (2021). You should also check out the more general tutorial Bürkner and Vuorre (2019). A recording of the STeW workshop on ordinal regression model is also available here.
3.2 Count data (and corpus data)
Count data, like number of a particular construction in a corpus, number of interjection in a conversation, number of infant gestures, etc, should be modelled using a Poisson distribution.
See the tutorial by Winter and Bürkner (2021).
3.3 Categorical (multinomial) data with more than two levels
When you are working with outcome variables that have more than two levels, and the levels are not ordered, you should fit a multinomial model using the categorical distribution.
See the Regression cheat sheet.
3.4 Dimensionality reduction
If your data is highly dimensional, i.e. you have a lot of different variables (some of which might even be correlated to each other), you can employ data dimensionality reduction techniques to “synthesise” all the variables into fewer variables that represent components, dimensions or clusters in the data.
These techniques can be used both (a) to find patterns or groupings in the data and to obtain measures that capture these patterns and groupings and (b) to simplify analyses from a set of 15/20 variables to 2/3 components or dimensions.
Note that once you have reduced your data to a few variables (components or dimensions), these can still be further analysed with the other techniques mentioned on this page.
A common reduction technique is Principal Component Analysis (PCA). This method combines all of your variables into a limited set of numeric principal components. The scores of the principal components capture variation in the data and can be used for further analysis. You can learn how to carry out a PCA with this tutorial. Also check out Functional Principal Component Analysis below.
Multiple Correspondence Analysis (McA) is the discrete equivalent of PCA, i.e. it can be used with discrete/categorical variables. See this tutorial for an introduction.
Another dimensionality reduction technique is Cluster Analysis (CA, aka hierarchical clustering). This tutorial guides you through a CA in R.
3.5 Time series and coordinates
Generalised Additive (Mixed) Models (GAMMs) are a flexible extension of linear models that allows us to fit non-linear effects. They are particularly useful with data that come from time series (e.g. f0 and formants, corpus occurrences across time, longitudinal data, etc.) and they can be employed with any kind of data that can be thought of as being represented on a coordinate space (e.g., geolocations, electroencephalographic (EEG) data, 3D tongue imaging, etc).
The tutorial by Sóskuthy (2017) is an excellent introduction to Generalised Additive Mixed Models (GAMMs). Also see Sóskuthy (2021).
Another excellent resource is Pedersen et al. (2019). In particular, Figure 4 is a beautiful visual summary of how different types of trends and groupings can be modelled with GAMs.
The paper by Tamminga, Ahern, and Ecay (2016) advocates for the adoption of GAMMs to advance the use of naturalistic data for studying psycholinguistic questions about intra-speaker variation.
Functional Principal Component Analysis (FPCA) is another approach to modelling time-series data.
- Functional Data Analysis for Speech Research by Michele Gubian is a collection of workshop materials on Functional Data Analysis with a focus on speech research data.
3.6 Bayesian inference
The overview by Etz et al. (2018), How to become a Bayesian in eight easy steps: An annotated reading list, is a good place to start from if you want to learn more about Bayesian statistics and inference.
For an more practice-oriented introduction, you should read McElreath (2020).
The learnB4SS workshop is an introduction to Bayesian analysis for the Speech Sciences. It requires familiarity with linear models and Null Hypothesis Significance Testing.
4 Advanced
4.1 Power analysis
Power analysis is a fundamental, although often neglected, step in Null Hypothesis Significance Testing (the statistical framework that returns p-values). A power analysis is a method to estimate the minimum sample size necessary to detect a particular effect. The statistical power of a test is the percentage of tests that correctly detect an effect when the effect indeed exists. The recommended statistical power is 80% or greater.
Power analyses with linear models can become quite complex, especially if random effects are included. Simulation is a way to simplify the calculations necessary to find the minimum sample size. The R package simr provides users with a set of functions to perform a power analysis with linear models using simulations. You can find a tutorial here.
If you are running Bayesian linear models, you can check out this post on Bayesian CrI-width power analysis.
4.2 Multivariate linear models
Bürkner (2024) explains how to fit linear models with two or more outcome variables (i.e. multivariate models) using brms.