Data types and reading data in R

Learn about types of data and how to import data in R

Author

Stefano Coretta

Published

June 28, 2023

Pre-requisites

R packages

1 Tabular data

Important

When working through the Notebook entries, always make sure you are in the course Quarto Project you created earlier.

You know you are in a Quarto Project because you can see the name of the Project in the top-right corner of RStudio, next to the light-blue cube icon.

If you see Project (none) in the top-right corner, that means you are not in a Quarto Project.

To make sure you are in the right Quarto project, you can open the project by going to the project folder in File Explorer (Win) or Finder (macOS) and double click on the .Rproj file.

Tabular data

Tabular data is data that has a form of a table: i.e. values structured in columns and rows.

Most of the data we will be using in this course will be tabular and the files will be in the .csv format.

The comma separated values format (.csv) is the best format to save data in because it is basically a plain text file, it’s quick to parse, and can be opened and edited with any software (plus, it’s not a proprietary format like .docx or .xlsx—these formats are specific to particular software).

This is what a .csv file looks like when you open it in a text editor (showing only the first few lines).

Group,ID,List,Target,ACC,RT,logRT,Critical_Filler,Word_Nonword,Relation_type,Branching
L1,L1_01,A,banoshment,1,423,6.0474,Filler,Nonword,Phonological,NA
L1,L1_01,A,unawareness,1,603,6.4019,Critical,Word,Unrelated,Left
L1,L1_01,A,unholiness,1,739,6.6053,Critical,Word,Constituent,Left
L1,L1_01,A,bictimize,1,510,6.2344,Filler,Nonword,Phonological,NA

The file contains tabular data (data that is structured as columns and rows, like a spreadsheet).

To separate the values of each column, a .csv file uses a comma , (hence the name “comma separated values”) to separate the values in every row.

The first line of the file indicates the names of the columns of the table:

Group,ID,List,Target,ACC,RT,logRT,Critical_Filler,Word_Nonword,Relation_type,Branching

There are 11 columns. The rest of the rows is the data, i.e. the values of each column separated by commas.

L1,L1_01,A,banoshment,1,423,6.0474,Filler,Nonword,Phonological,NA
L1,L1_01,A,unawareness,1,603,6.4019,Critical,Word,Unrelated,Left
L1,L1_01,A,unholiness,1,739,6.6053,Critical,Word,Constituent,Left
L1,L1_01,A,bictimize,1,510,6.2344,Filler,Nonword,Phonological,NA

This might look a bit confusing, but you will see later that, after importing this type of file, you can view it as a nice spreadsheet (as you would in Excel).

Another common type of tabular data file is spreadsheets, like spreadsheets created by Microsoft Excel or Apple Numbers. These are all proprietary formats that require you to have the software that were created with if you want to modify them.

Portability and openness are important aspects of conducting ethical research, so that using open and non-proprietary file types makes your research more accessible and doesn’t privilege those who have access to specific software (remember, R is free!).

There are also variations of the comma separated values type, like tab separated values files (.tsv, which uses tab characters instead of commas) and fixed-width files (usually .txt, where columns are separated by as many white spaces as needed so that the columns align).

1.1 Non-tabular data

Of course, R can import also data that is not tabular, like map data and complex hierarchical data.

We will dip our toes into map data at the end of course, but virtually all of the data we will use will be tabular, just because that’s the format you need to do data visualisation and analyses.

1.2 `.rds` files

R has a special way of saving data: .rds files.

.rds files allow you to save an R object to a file on your computer, so that you can read that file in when you need it.

A common use for .rds files is to save tabular data that you have processed so that it can be readily used in many different scripts or even by other people.

In the following sections you will learn how to import (aka read) three types of data: .csv, Excel and .rds files.

2 Download the data files

Throughout the course we will be using data files that come from linguistic research. You should download now the data files from the QML Data website according to the following instructions.

Please, follow these instructions carefully.

Download the zip archive with all the data by right-clicking on the following link and download the file: data.zip.
Unzip the zip file to extract the contents. (If you don’t know how to do this, search for it online for your operating system!)
Create a folder called data/ (the slash is there just to remind you that it’s a folder, but you don’t have to include it in the name) in the Quarto project you are using for the course.
1. To create a folder, go to the Files tab of the bottom-right panel in RStudio.
2. Make sure you are viewing the project’s main folder.
3. Click on the New Folder button, enter “data” in the text box and click OK
Move the contents of the data.zip archive into the data/ folder.
1. Open a Finder or File Explorer window.
2. Navigate to the folder where you have extracted the zip file (it will very likely be the Downloads/ folder).
3. Copy the contents of the zip file.
4. In Finder or File Explorer, navigate to the Quarto project folder, then the data/ folder, and paste the contents in there. (You can also drag and drop if you prefer.)

The rest of the tutorial will assume that you have created a folder called data/ in the Quarto project folder and that the files you downloaded are in that folder. The data folder should like something like this:

data/
└── cameron2020/
    └── gestures.csv
└── coretta2018/
    └── formants.csv
    └── token-measures.csv
└── ...

I recommend that you start being very organised with your files in other projects from now on, whether it’s for this course or your dissertation or else. I also suggest to avoid overly nested structures (for example, avoid having one folder for each week for this course. Rather, save all data files in the data/ folder).

Organising your files

The Open Science Framework has the following recommendations that apply very well to any type of research project.

Use one folder per project. This will also be your RStudio project folder.
Separate raw data from derived data.
Separate code from data.
Make raw data read-only.

To learn more about this, check the OSF page Organising files.

In brief, what these recommendations mean is that you want a folder for your research project/course/else, and inside the folder two folders: one for data and one for code.

The data/ folder could further contain raw/ for raw data (data that should not be lost or changed, for example collected data or annotations) and derived/ for data that derives from the raw data, for example through automated data processing.

I usually also have a separate folder called figs/ or img/ where I save plots. Of course which folders you will have it’s ultimately up to you and needs will vary depending on the project and field!

3 Import `.csv` files

Let’s start with data from this paper: Song et al. 2020. Second language users exhibit shallow morphological processing. DOI: 10.1017/S0272263120000170.

The study consisted of a lexical decision task in which participants were first shown a prime, followed by a target word for which they had to indicate whether it was a real word or a nonce word.

The prime word belonged to one of three possible groups (Relation_type in the data) each of which refers to the morphological relation of the prime and the target word:

Unrelated: for example, prolong (assuming unkindness as target, [[un-kind]-ness]).
Constituent: unkind.
NonConstituent: kindness.

3.1 The tidyverse packages

Importing .csv files is very easy. You can use the read_csv() function from a collection of R packages known as the tidyverse.

To import data in R we will use the read_csv() function from the readr package, one of the tidyverse packages.

If you followed the Setup instructions at the beginning of the course, you are all set. The tidyverse packages should already be installed. You can check in the Packages tab in the bottom right panel of RStudio: if tidyverse is listed in there, then all the tidyverse packages are installed

If not, installing the tidyverse packages is easy: you just need to [install](packages.qmd#install-packages) the tidyverse package and that will take care of installing the most important packages in the collection (called the “core” tidyverse packages).

3.2 `read_csv()`

Did you open the Quarto project?

Before moving on, make sure that you have opened the RStudio Quarto project correctly (see warning at the top of the tutorial).

The read_csv() function from the readr package only requires you to specify the file path as a string (remember, strings are quoted between " ", for example "year_data.txt"). On my computer, the file path of song2020/shallow.csv is /Users/ste/qml/data/song2020/shallow.csv, but on your computer the file path will be different, of course.

Also, note that it is not enough to use the read_csv() function. You also must assign the output of the read_csv() function (i.e. the data we are reading) to a variable, using the assignment arrow <-, just like we were assigning values to variables in the previous weeks.

And since the read_csv() is a function from the tidyverse, you first need to attach the tidyverse packages with library(tidyverse) (remember, you need to attach packages only once per session). This will attach the core tidyverse packages, including readr. Of course, you can also attach the individual packages directly: library(readr). If you use library(tidyverse) there is no need to attach individual tidyverse packages.

Run the following lines in the Console.

library(tidyverse)

shallow <- read_csv("./data/song2020/shallow.csv")

Rows: 6500 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Group, ID, List, Target, Critical_Filler, Word_Nonword, Relation_ty...
dbl (3): ACC, RT, logRT

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you look at the Environment tab, you will see song2020 under Data.

Data frames and tibbles

In R, a data table is called a data frame.

Tibbles are special data frame created with the read functions from the tidyverse. If you are curious about the difference, check this page.

In this course’s tutorials, “data frame” and “tibble” will be used interchangeably (since we are using the read functions from the tidyverse, all resulting data frames will be tibbles).

But wait, what is that "./data/song2020/shallow.csv"? That’s a relative path. Let’s understand the concept of relative paths now. You will be able to view the data soon.

3.3 Relative paths

Relative path

A relative path is a file path that is relative to a folder. The folder the path starts at is represented by ./.

When you are using Quarto projects, the ./ folder paths are relative to the project folder! This is true whichever the name of the folder/project and whichever it’s location on your computer.

For example, if your project it’s called awesome_proj and it’s in Downloads/stuff/, then if you write ./data/results.csv you really mean Downloads/stuff/awesome_proj/data/results.csv!

How does R know the path is relative to the project folder?

That is because when working with Quarto projects, all relative paths are relative to the project folder (i.e. the folder with the .Rproj file)!

The folder which relative paths are relative to is called the working directory (directory is just another way of saying folder).

Working directory

The working directory is the folder which relative paths are relative to.

When using Quarto projects, the working directory is the project folder.

The code read_csv("./data/song2020/shallow.csv") above will work because you are using a Quarto project and inside the project folder there is a folder called data/ and in it there’s the song2020/shallow.csv file.

So from now on I encourage you to use Quarto projects and relative paths always! You will also learn about [R scripts](scripts.qmd) and [Quarto documents](intro-quarto.qmd) later, which will make things even easier.

The benefit of Quarto projects and relative paths is that, if you move your project or rename it, or if you share the project with somebody, all the paths will just work because they are relative!

Get the working directory

You can get the current working directory with the getwd() command.

Run it now in the Console! Is the returned path the project folder path?

If not, it might be that you are not working from a Quarto project. Check the top-right corner of RStudio: is the project name in there or do you see Project (none)?

If it’s the latter, you are not in a Quarto project, but you are running R from somewhere else (meaning, the working directory is somewhere else). If so, close RStudio and open the project.

3.4 View the data

Now we can finally view the data.

The easiest way is to click on the name of the data listed in the Environment tab, in the top-right panel of RStudio.

You will see a nicely formatted table, as you would in a programme like Excel.

Data tables in R (i.e. tabular, spread-sheet like data) are called data frames or tibbles.¹

The shallow data frame contains 11 columns (called variables in the Environment tab). The 11 columns are the following:

Group: L1 vs L2 speakers of English.
ID: Subject unique ID.
List: Word list (A to F).
Target: Target word in the lexical decision trial.
ACC: Lexical decision response accuracy (0 incorrect response, 1 correct response).
RT: Reaction times of response in milliseconds.
logRT: Logged reaction times.
Critical_Filler: Whether the trial was a filler or critical.
Word_Nonword: Whether the Target was a real Word or a Nonword.
Relation_type: The type of relation between prime and target word (Unrelated, NonCostituent, Constituent, Phonological).
Branching: Constituent syntactic branching, Left and Right (shout out to Charlie Puth).

Quiz 3

How many rows does shallow have?

11 650 6500

4 Import Excel sheets

To read an Excel file we need first to attach the readxl package. It should already be installed, because it comes with the tidyverse. If not, then install it.

library(readxl)

Then we can use the read_excel() function. Let’s read the file.

relatives <- read_excel("./data/los2023/relatives.xlsx")

Now you can view the tibble los2023.

Note that if the Excel file has more than one sheet, you can specify the sheet number when reading the file (the default is sheet = 1).

relatives_2 <- read_excel("./data/los2023/relatives.xlsx", sheet = 2)

The second sheet in los2023/relatives.xlx contains the description of the columns in the first sheet.

5 Import `.rds` files

Another useful type of data files is a file type specifically designed for r: .rds files.

Usually, each .rds file contains one R object, like one tibble.

You can read .rds files with the readRDS() function.

glot_status <- readRDS("./data/coretta2022/glot_status.rds")

As always, you need to assign the output of the function to a variable, here glot_status.

.rds files

.rds files are a type of R file which can store any R object and save it on disk.

R objects can be saved to an .rds file with the saveRDS() function and they can be read with the readRDS() function.

View the glot_status tibble now.

It is also very easy to save a tibble to an .rds file with the saveRDS() function.

For example:

saveRDS(shallow, "./data/song2020/shallow.rds")

The first argument is the name of the tibble object and the second argument is the file path to save the object to.

6 Practice

Practice 1

Read the following files in R, making sure you use the right read_*() function.

koppensteiner2016/takete_maluma.txt (a tab separated file)
pankratz2021/si.csv
Go to https://datashare.ed.ac.uk/handle/10283/4006 and download the file conflict_data_.xlsx. Read both sheets (“conflict_data2” and “demographics”). Any issues? (I suggest looking at the spread sheet in Excel if it helps).

7 Summary

You can import tabular data in R with the read_*() functions from the tidyverse package readr.
You can view data in RStudio as spreadsheets.

R scripts

Footnotes

A tibble is a special data frame. We will learn more about tibbles in the following weeks.↩︎

1 Tabular data

1.1 Non-tabular data

1.2 .rds files

2 Download the data files

3 Import .csv files

3.1 The tidyverse packages

3.2 read_csv()

3.3 Relative paths

3.4 View the data

4 Import Excel sheets

5 Import .rds files

6 Practice

7 Summary

Footnotes

1.2 `.rds` files

3 Import `.csv` files

3.2 `read_csv()`

5 Import `.rds` files