DAL tutorial - Week 3

Read data

Important

When working through these tutorials, always make sure you are in the course’s RStudio Quarto Project you created.

You know you are in an RStudio Project because you can see the name of the Project in the top-right corner of RStudio, next to the light blue cube icon.

If you see Project (none) in the top-right corner, that means your are not in an RStudio Project.

To make sure you are in the RStudio project, to open the project go to the project folder in File Explorer or Finder and double click on the .Rproj file.

1 Some computer basics

In the tutorial last week you’ve been playing around with R, RStudio and R scripts.

But what if you want to import data in R?

Easy! You can use the read_*() functions (* is just a place holder for specific types of files to read, like read_csv() or read_excel()) to read your files into R. But before we dive in, let’s first talk about some computer basics. (You can skip this section if it’s too basic for you.)

Enable all extensions

Before moving on, we recommend you enable the option to show all file extensions in the File Explorer/Finder.

Follow the instructions here:

1.1 Files, folder and file extensions

Files saved on your computer live in a specific place. For example, if you download a file from a browser (like Google Chrome, Safari or Firefox), the file is normally saved in the Download folder.

But where does the Download folder live? Usually, in your user folder! The user folder normally is the name of your account or a name you picked when you created your computer account. In my case, my user folder is simply called ste.

User folder

The user folder is the folder with the name of your account.

On macOS

  • Go to Finder > Preferences/Settings.

  • Go to Sidebar.

  • The name next to the house icon is the name of your home folder.

On Windows

  • Right-click an empty area on the navigation panel in File Explorer.

  • From the context menu, select the ‘Show all folders’ and your user profile will be added as a location in the navigation bar.

So, let’s assume I download a file, let’s say big_data.csv, in the Download folder of my user folder.

Now we can represent the location of the big_data.csv file like so:

ste/
└── Downloads/
    └── big_data.csv

To mark that ste and Downloads are folders, we add a final forward slash /. That simply means “hey! I am a folder!”. big_data.csv is a file, so it doesn’t have a final /.

Instead, the file name big_data.csv has a file extension. The file extension is .csv. A file extension marks the type of file: in this the big_data file is a .csv file, a comma separated value file (we will see an example of what that looks like later).

Different file type have different file extensions:

  • Excel files: .xlsx.
  • Plain text files: .txt.
  • Images: .png, .jpg, .gif.
  • Audio: .mp3, .wav.
  • Video: .mp4, .mov, .avi.
  • Etc…
File extension

A file extension is a sequence of letters that indicates the type of a file and it’s separated with a . from the file name.

1.2 File paths

Now, we can use an alternative, more succinct way, to represent the location of the big_data.csv:

ste/Downloads/big_data.csv

This is called a file path! It’s the path through folders that lead you to the file. Folders are separated by / and the file is marked with the extension .csv.

File path

A file path indicates the location of a file on a computer as a path through folders that lead you to the file.

Now the million pound question: where does ste/ live on my computer???

User folders are located in different places depending on the operating system you are using:

  • On macOS: the user folder is in /Users/.

    • You will notice that there is a forward slash also before the name of the folder. That is because the /Users/ folder is a top folder, i.e. there are no folders further up in the hierarchy of folders.
    • This means that the full path for the big_data.csv file on a computer running macOS would be: /Users/ste/Downloads/big_data.csv.
  • On Windows: the user folder is in C:/Users/

    • You will notice that C is followed by a colon :. That is because C is a drive, which contains files and folders. C: is not contained by any other folder, i.e. there are no other folders above C: in the hierarchy of folders.
    • This means that the full path for the big_data.csv file on a Windows computer would be: C:/Users/ste/Downloads/big_data.csv.

When a file path starts from a top-most folder, we call that path the absolute file path.

Absolute path

An absolute path is a file path that starts with a top-most folder.

There is another type of file paths, called relative paths. A relative path is a partial file path, relative to a specific folder. You will learn how to use relative paths below, when we will go through importing files in R using R scripts below.

Importing files in R is very easy with the tidyverse packages. You just need to know the file type (very often the file extension helps) and the location of the file (i.e. the file path).

The next sections will teach you how to import data in R!

Quiz 1
Which of the following is an absolute path?

2 Data types

2.1 Tabular data

Tabular data

Tabular data is data that has a form of a table: i.e. values structured in columns and rows.

Most of the data we will be using in this course will be tabular and the files will be in the .csv format.

The comma separated values format (.csv) is the best format to save data in because it is basically a plain text file, it’s quick to parse, and can be opened and edited with any software (plus, it’s not a proprietary format like .docx or .xlsx—these formats are specific to particular software).

This is what a .csv file looks like when you open it in a text editor (showing only the first few lines).

Group,ID,List,Target,ACC,RT,logRT,Critical_Filler,Word_Nonword,Relation_type,Branching
L1,L1_01,A,banoshment,1,423,6.0474,Filler,Nonword,Phonological,NA
L1,L1_01,A,unawareness,1,603,6.4019,Critical,Word,Unrelated,Left
L1,L1_01,A,unholiness,1,739,6.6053,Critical,Word,Constituent,Left
L1,L1_01,A,bictimize,1,510,6.2344,Filler,Nonword,Phonological,NA

The file contains tabular data (data that is structured as columns and rows, like a spreadsheet).

To separate the values of each column, a .csv file uses a comma , (hence the name “comma separated values”) to separate the values in every row.

The first line of the file indicates the names of the columns of the table:

Group,ID,List,Target,ACC,RT,logRT,Critical_Filler,Word_Nonword,Relation_type,Branching

There are 11 columns. The rest of the rows is the data, i.e. the values of each column separated by commas.

L1,L1_01,A,banoshment,1,423,6.0474,Filler,Nonword,Phonological,NA
L1,L1_01,A,unawareness,1,603,6.4019,Critical,Word,Unrelated,Left
L1,L1_01,A,unholiness,1,739,6.6053,Critical,Word,Constituent,Left
L1,L1_01,A,bictimize,1,510,6.2344,Filler,Nonword,Phonological,NA

This might look a bit confusing, but you will see later that, after importing this type of file, you can view it as a nice spreadsheet (as you would in Excel).

Another common type of tabular data file is spreadsheets, like spreadsheets created by Microsoft Excel or Apple Numbers. These are all proprietary formats that require you to have the software that were created with if you want to modify them.

Portability and openness are important aspects of conducting ethical research, so that using open and non-proprietary file types makes your research more accessible and doesn’t privilege those who have access to specific software (remember, R is free!).

There are also variations of the comma separated values type, like tab separated values files (.tsv, which uses tab characters instead of commas) and fixed-width files (usually .txt, where columns are separated by as many white spaces as needed so that the columns align).

2.2 Non-tabular data

Of course, R can import also data that is not tabular, like map data and complex hierarchical data.

We will dip our toes into map data at the end of course, but virtually all of the data we will use will be tabular, just because that’s the format you need to do data visualisation and analyses.

2.3 .rds files

R has a special way of saving data: .rds files.

.rds files allow you to save an R object to a file on your computer, so that you can read that file in when you need it.

A common use for .rds files is to save tabular data that you have processed so that it can be readily used in many different scripts or even by other people.

In the following sections you will learn how to import (aka read) three types of data: .csv, Excel and .rds files.

3 Download the data files

Throughout the course we will be using data files that come from linguistic research. You should download now the data files according to the following instructions

Please, follow these instructions carefully.

  1. Download the zip archive with all the data by right-clicking on the following link and download the file: data.zip.

  2. Unzip the zip file to extract the contents. (If you don’t know how to do this, ask one of the tutors to help you!)

  3. Create a folder called data/ (the slash is there just to remind you that it’s a folder, but you don’t have to include it in the name) in the Quarto project you are using for the course.

    1. To create a folder, go to the Files tab of the bottom-right panel in RStudio.

    2. Make sure you are viewing the project’s main folder.

    3. Click on the New Folder button, enter “data” in the text box and click OK

  4. Move the contents of the data.zip archive into the data/ folder.

    1. Open a Finder or File Explorer window.

    2. Navigate to the folder where you have extracted the zip file (it will very likely be the Downloads/ folder).

    3. Copy the contents of the zip file.

    4. In Finder or File Explorer, navigate to the Quarto project folder, then the data/ folder, and paste the contents in there. (You can also drag and drop if you prefer.)

The rest of the tutorial will assume that you have created a folder called data/ in the Quarto project folder and that the files you downloaded are in that folder. The data folder should like something like this:

data/
└── cameron2020/
    └── gestures.csv
└── coretta2018/
    └── formants.csv
    └── token-measures.csv
└── ...

I recommend that you start being very organised with your files in other projects from now on, whether it’s for this course or your dissertation or else. I also suggest to avoid overly nested structures (for example, avoid having one folder for each week for this course. Rather, save all data files in the data/ folder).

The Open Science Framework has the following recommendations that apply very well to any type of research project.

  • Use one folder per project. This will also be your RStudio project folder.

  • Separate raw data from derived data.

  • Separate code from data.

  • Make raw data read-only.

To learn more about this, check the OSF page Organising files.

In brief, what these recommendations mean is that you want a folder for your research project/course/else, and inside the folder two folders: one for data and one for code.

The data/ folder could further contain raw/ for raw data (data that should not be lost or changed, for example collected data or annotations) and derived/ for data that derives from the raw data, for example through automated data processing.

I usually also have a separate folder called figs/ or img/ where I save plots. Of course which folders you will have it’s ultimately up to you and needs will vary depending on the project and field!

4 Import .csv files

Let’s start with data from this paper: Song et al. 2020. Second language users exhibit shallow morphological processing. DOI: 10.1017/S0272263120000170.

The study consisted of a lexical decision task in which participants were first shown a prime, followed by a target word for which they had to indicate whether it was a real word or a nonce word.

The prime word belonged to one of three possible groups (Relation_type in the data) each of which refers to the morphological relation of the prime and the target word:

  • Unrelated: for example, prolong (assuming unkindness as target, [[un-kind]-ness]).

  • Constituent: unkind.

  • NonConstituent: kindness.

4.1 The tidyverse packages

Importing .csv files is very easy. You can use the read_csv() function from a collection of R packages known as the tidyverse.

To import data in R we will use the read_csv() function from the readr package, one of the tidyverse packages.

Installing the tidyverse packages is easy: you just need to install the tidyverse package and that will take care of installing the most important packages in the collection (called the “core” tidyverse packages).

Go ahead and install the tidyverse from the Packages tab.1

4.2 read_csv()

Did you open the RStudio Quarto project?

Before moving on, make sure that you have opened the RStudio Quarto project correctly (see warning at the top of the tutorial)

The read_csv() function from the readr package only requires you to specify the file path as a string (remember, strings are quoted between " ", for example "year_data.txt"). On my computer, the file path of song2020/shallow.csv is /Users/ste/dal/data/song2020/shallow.csv, but on your computer the file path will be different, of course.

Also, note that it is not enough to use the read_csv() function. You also must assign the output of the read_csv() function (i.e. the data we are reading) to a variable, using the assignment arrow <-, just like we were assigning values to variables in the previous weeks.

And since the read_csv() is a function from the tidyverse, you first need to attach the tidyverse packages with library(tidyverse) (remember, you need to attach packages only once per session). This will attach the core tidyverse packages, including readr. Of course, you can also attach the individual packages directly: library(readr). If you use library(tidyverse) there is no need to attach individual tidyverse packages.

Before reading the data, create a new R script named tutorial_w03.R and save it in the code/ folder of your Quarto project.

Generally, you start the script with calls to library() to load all the packages you need for the script.

Now we only need one package, tidyverse, but in most cases you will need more than one! The best practice is to attach all of packages first, in the top of your script. Please, get in the habit of doing this from now, so that you can keep your scripts tidy and pretty!

Warning

Please, don’t include install.packages() in your R scripts!

Remember, you only have to install a package once, and you can just type it in the Console.

But DO include library() in your scripts.

At the top of tutorial_w03.R, write the following lines of code. Then run the code.

library(tidyverse)

shallow <- read_csv("./data/song2020/shallow.csv")
Rows: 6500 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Group, ID, List, Target, Critical_Filler, Word_Nonword, Relation_ty...
dbl (3): ACC, RT, logRT

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you look at the Environment tab, you will see song2020 under Data.

Data frames and tibbles

In R, a data table is called a data frame.

Tibbles are special data frame created with the read functions from the tidyverse. If you are curious about the difference, check this page.

In this course’s tutorials, “data frame” and “tibble” will be used interchangeably (since we are using the read functions from the tidyverse, all resulting data frames will be tibbles).

But wait, what is that "./data/song2020/shallow.csv"? That’s a relative path. We briefly mentioned relative paths above, but let’s understand the details now. You will be able to view the data soon.

4.3 Relative paths

Relative path

A relative path is a file path that is relative to a folder. The folder the path starts at is represented by ./.

When you are using R scripts in Quarto projects, the ./ folder paths are relative to is the project folder! This is true whichever the name of the folder/project and whichever it’s location on your computer.

For example, if your project it’s called awesome_proj and it’s in Downloads/stuff/, then if you write ./data/results.csv you really mean Downloads/stuff/awesome_proj/data/results.csv!

How does R know the path is relative to the project folder?

That is because when working with Quarto projects, all relative paths are relative to the project folder (i.e. the folder with the .Rproj file)!

The folder which relative paths are relative to is called the working directory (directory is just another way of saying folder).

Working directory

The working directory is the folder which relative paths are relative to.

When using Quarto projects, the working directory is the project folder.

The code read_csv("./data/song2020/shallow.csv") above will work because you are using a Quarto project and inside the project folder there is a folder called data/ and in it there’s the song2020/shallow.csv file.

So from now on I encourage you to use Quarto projects, R scripts and relative paths always!

The benefit of doing so is that, if you move your project or rename it, or if you share the project with somebody, all the paths will just work because they are relative!

Get the working directory

You can get the current working directory with the getwd() command.

Run it now in the Console! Is the returned path the project folder path?

If not, it might be that you are not working from a Quarto project. Check the top-right corner of RStudio: is the project name in there or do you see Project (none)?

If it’s the latter, you are not in a Quarto project, but you are running R from somewhere else (meaning, the working directory is somewhere else). If so, close RStudio and open the project.

4.4 View the data

Now we can finally view the data.

The easiest way is to click on the name of the data listed in the Environment tab, in the top-right panel of RStudio.

You will see a nicely formatted table, as you would in a programme like Excel.

Data tables in R (i.e. tabular, spread-sheet like data) are called data frames or tibbles.2

The shallow data frame contains 11 columns (called variables in the Environment tab). The 11 columns are the following:

  • Group: L1 vs L2 speakers of English.
  • ID: Subject unique ID.
  • List: Word list (A to F).
  • Target: Target word in the lexical decision trial.
  • ACC: Lexical decision response accuracy (0 incorrect response, 1 correct response).
  • RT: Reaction times of response in milliseconds.
  • logRT: Logged reaction times.
  • Critical_Filler: Whether the trial was a filler or critical.
  • Word_Nonword: Whether the Target was a real Word or a Nonword.
  • Relation_type: The type of relation between prime and target word (Unrelated, NonCostituent, Constituent, Phonological).
  • Branching: Constituent syntactic branching, Left and Right (shout out to Charlie Puth).
Quiz 3
How many rows does shallow have?

5 Import Excel sheets

To read an Excel file we need first to attach the readxl package. It should already be installed, because it comes with the tidyverse. If not, then install it.

library(readxl)

Then we can use the read_excel() function. Let’s read the file.

relatives <- read_excel("./data/los2023/relatives.xlsx")

Now you can view the tibble los2023.

Note that if the Excel file has more than one sheet, you can specify the sheet number when reading the file (the default is sheet = 1).

relatives_2 <- read_excel("./data/los2023/relatives.xlsx", sheet = 2)

The second sheet in los2023/relatives.xlx contains the description of the columns in the first sheet.

6 Import .rds files

Another useful type of data files is a file type specifically designed for r: .rds files.

Usually, each .rds file contains one R object, like one tibble.

You can read .rds files with the readRDS() function.

glot_status <- readRDS("./data/coretta2022/glot_status.rds")

As always, you need to assign the output of the function to a variable, here glot_status.

.rds files

.rds files are a type of R file which can store any R object and save it on disk.

R objects can be saved to an .rds file with the saveRDS() function and they can be read with the readRDS() function.

View the glot_status tibble now.

It is also very easy to save a tibble to an .rds file with the saveRDS() function.

For example:

saveRDS(shallow, "./data/song2020/shallow.rds")

The first argument is the name of the tibble object and the second argument is the file path to save the object to.

7 Practice

Read the following files in R, making sure you use the right read_*() function.

8 Summary

  • You have learnt about directories, file extensions and file paths.

  • You can import tabular data in R with the read_*() functions from the tidyverse package readr.

  • You can view data in RStudio as spreadsheets.

Footnotes

  1. Lab PCs should already have the tidyverse packages installed.↩︎

  2. A tibble is a special data frame. We will learn more about tibbles in the following weeks.↩︎