Data analysis

1 Overview


This page provides resources on how to analyse measurements (e.g. formant values, word counts, collexeme association metrics, semantic distance, etc.) you have obtained from your data.

For methods to obtain specialised measurements and metrics (e.g. formant values, word counts, collexeme association metrics, semantic distance, etc.), please check the (work-in-progress) Measure page.


This is not intended to be an exhaustive list, but rather a compendium of approaches and techniques that are currently gaining momentum across linguistic research. For specialised approaches, we recommend to consult methods books written with specific audiences in mind. Note that the methods presented here are general enough that they can be applied to a diverse range of data types.

There are several resources you can use to teach yourself quantitative data analysis skills, depending on your level. Don’t forget to also check Measures and the Skills page, for more general research-related skills.

2 Beginners


Terminology in applied statistics tends to be unruly, so that one same thing can have different names in different disciplines and vice versa one same name can mean different things. We tried here to use a terminological set that is as “neutral” as possible, although we recognise the limitations of this approach. Where relevant, we note alternative terms and their uses.

The following resources are suitable for beginners who want to learn quantitative data analysis from scratch.

2.1 Data wrangling and visualisation

The simple graph has brought more information to the data analyst’s mind than any other device.

—John Tukey

Data wrangling is about getting your data into a useful format for visualisation and modelling.

The programming languages Python and R are two very common languages used for data analysis.

Python is a general-purpose programming language while R is specifically developed for statistical analysis and visualisation. Most academic research uses R for data analysis, although Python is also employed especially for data processing.

If you want to teach yourself R, the following resources are an excellent place to start from:

  • The R for Data Science (R4DS) free online book is an excellent introduction to R and quantitative data analysis.

  • The Data Visualisation Catalogue is a project developed by Severino Ribecca to create a (non-code-based) library of different information visualisation types. The website serves as a learning and inspirational resource for those working with data visualisation.

  • The workshop intRo: Data Analysis with R introduces absolute beginners from the Humanities to R, quantitative data analysis and visualisation. Check out the videos on YouTube: videos. You can find the materials and slides here: materials and slides.

Most data wrangling problems can be solved with the following sets of R tidyverse functions (of course if you use Python or other languages, feel free to use their equivalents): mutating joins and pivoting.

  • Mutating joins allow you to join two or more tibbles together so that you can include information from one tibble into another (for example, if you have participant info in one tibble and you want to join that with the main experiment results tibble). See here for a visual representation of join operations.

  • Pivoting makes it easy to transform a data table that has all the information you need but not in the right format (for example you have a two columns with the participant’s scores at time point 1 and 2, but you want one column that has the time point and one that has the score). See here for a visual representation of pivoting.

2.2 Statistical modelling

The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.

—Nate Silver, The Signal and the Noise

Statistics and statistical modelling are about finding meaning in the patterns that can be observed in the data.

  • Statistics for Linguists: An introduction using R by Bodo Winter is ideal both for absolute beginners and experienced researchers. It is packed with everything you need to successfully and effectively conduct statistical analyses.

  • Statistical (Re)thinking by Richard McElreath is an excellent introduction for absolute beginners, by Richard McElreath, which covers a wide variety of linear models. It focusses on Bayesian inference and how this framework can help us directly answer research questions, assess evidence for different hypothesis, and quantify uncertainty. If you are familiar with the tidyverse, the code from the Statistical (Re)thinking book has been translated into tidyverse by Solomon Kurz, and it can be accessed here: Statistical rethinking with brms, ggplot2, and the tidyverse.

  • The Art of Statistics: Learning from data by David Spiegelhalter uses real-world examples to explain principles of data visualisation and analysis. It touches upon a wide range of topics and disciplines, from communicating proportions, to probability and Bayesian inference, making it a great complement to the other books and resources mentioned here. If you just wish to dip your toes in statistics without committing (yet) to learning how to do statistics, this book is for you.

  • Linear models and linear mixed effects models in R with linguistic applications by Bodo Winter is a short and intense tutorial on linear models. The first part introduces you to Gaussian linear models and the second part to Gaussian linear models that include random effects (variably called mixed, hierarchical, nested models). Note that Gaussian linear models are not appropriate for the data of most linguistic research and you will have to learn and use other types of linear models.

2.2.1 Linear models

One model to rule them all, one model to fit them,
One model to shrink them all, and in probability bind them;
In the Land of Inference where the distributions lie.

—The Statistical Hobbit.

Linear models are a very flexible and relatively straightforward way to model and analyse quantitative data. They have gained momentum and are increasingly being adopted across the entire field of linguistics.

The main perk of learning linear models is that they can be applied to many different types of data, so that once you learn this approach you will be able to apply it to a lot of data analysis scenarios.

The resources mentioned above all focus on linear modelling, so whether you are just starting your statistical journey or you are an experienced researcher who wants to consolidate their understanding of linear modelling, those resources are right for you.

After you have learnt the basics, the Linear models cheat-sheet can guide you through the process of choosing among the appropriate types of linear model depending on the nature of your data. The post also lists tutorials on linear models that use other less common probability distribution families, like the beta, Poisson and ordinal.

Confused about all the model names? Check out this post on how we don’t really need to use all of those names: they are all linear models!

3 Intermediate

If you already have a basic understanding of quantitative data analysis, statistics and R, the following resources can help you develop your skills further.

3.1 Likert scales

Likert-scale data are quite common in many fields of linguistics. Likert scales are common especially in sociolinguistics work, for example in work that investigates attitudes: e.g. a 5-point scale “disagree, somewhat disagree, neutral, somewhat agree, agree”.

Likert-scale data special because they are categorical and ordered.

Ordered data must be modelled using the appropriate distribution, namely the cumulative distribution. Ordinal linear models are an extension of linear models that use the cumulative distribution to model ordinal data, like Likert-scale data.

For an excellent tutorial on how to fit ordinal linear models using brms, see Analysis of rating scales: A pervasive problem in bilingualism research and a solution with Bayesian ordinal models by João Veríssimo.

3.2 Count data (and corpus data)

Count data, like number of a particular construction in a corpus, number of interjection in a conversation, number of infant gestures, etc, should be modelled using a Poisson distribution.

See Poisson regression for linguists: A tutorial introduction to modelling count data with brms by Bodo Winter for a fantastic tutorial.

3.3 Dimensionality reduction

If your data is highly dimensional, i.e. you have a lot of different variables (some of which might even be correlated to each other), you can employ data dimensionality reduction techniques to “synthesise” all the variables into fewer variables that represent components, dimensions or clusters in the data.

These techniques can be used both (a) to find patterns or groupings in the data and to obtain measures that capture these patterns and groupings and (b) to simplify analyses from a set of 15/20 variables to 2/3 components or dimensions.

Note that once you have reduced your data to a few variables (components or dimensions), these can still be further analysed with the other techniques mentioned on this page.

  • A common reduction technique is Principal Component Analysis (PCA). This method combines all of your variables into a limited set of numeric principal components. The scores of the principal components capture variation in the data and can be used for further analysis. You can learn how to carry out a PCA with this tutorial. Also check out Functional Principal Component Analysis below.

  • Multiple Correspondence Analysis (McA) is the discrete equivalent of PCA, i.e. it can be used with discrete/categorical variables. See this tutorial for an introduction.

  • Another dimensionality reduction technique is Cluster Analysis (CA, aka hierarchical clustering). This tutorial guides you through a CA in R.

3.4 Time series and coordinates

Generalised Additive (Mixed) Models (GAMMs) are a flexible extension of linear models that allows us to fit non-linear effects. They are particularly useful with data that come from time series (e.g. f0 and formants, corpus occurrences across time, longitudinal data, etc.) and they can be employed with any kind of data that can be thought of as being represented on a coordinate space (e.g., geolocations, electroencephalographic (EEG) data, 3D tongue imaging, etc).

Functional Principal Component Analysis (FPCA) is another approach to modelling time-series data.

3.5 Bayesian inference

4 Advanced

4.1 Power analysis

Power analysis is a fundamental, although often neglected, step in Null Hypothesis Significance Testing (the statistical framework that returns p-values). A power analysis is a method to estimate the minimum sample size necessary to detect a particular effect. The statistical power of a test is the percentage of tests that correctly detect an effect when the effect indeed exists. The recommended statistical power is 80% or greater.

Power analyses with linear models can become quite complex, especially if random effects are included. Simulation is a way to simplify the calculations necessary to find the minimum sample size. The R package simr provides users with a set of functions to perform a power analysis with linear models using simulations. You can find a tutorial here.

If you are running Bayesian linear models, you can check out this post on Bayesian CrI-width power analysis.

4.2 Multivariate linear models

Estimating Multivariate Models with brms by Paul Bürkner explains how to fit linear models with two or more outcome variables (i.e. multivariate models) using brms.