3  Data in R

In the previous section we learned how to store, extract, and manipulate numerical vectors. But working with real data requires using multiple types of data with different shapes and sizes. In this section we will learn how R
handles different data types and structures, and how to use them to study and summarize data.

3.1 Data types

Data types are classifications of data that help R conform to our intuition. For example, multiplying \(2\) by \(7\) feels right, but multiplying “pineapple” by “stalactite” does not. There are six data types in R: double, integer, logical, character, complex, and raw, each with its own rules for storing and handling. Learning these rules will allow us to analyze data later with less effort and fewer mistakes.

Data type complex is for imaginary numbers, and type raw is for raw bytes of data. It is unlikely that you will ever need these data types, so I will not explain them in these notes.

Doubles are regular numbers with a decimal value (which may be zero). In general, R will save any number that we type as a double.

my_double <- 5
typeof(my_double)
[1] "double"

Integers are numbers that have no decimal component. To create an integer you must type a number followed by an L:

my_integer <- 5L
typeof(my_integer)
[1] "integer"

As data scientists we rarely use integers because we can save them as doubles. But they can help us do complicated operations because R stores them more precisely and with less memory than doubles.

Logicals are truth values TRUE and FALSE, which are useful when we compare numbers or objects:

my_comparison <- -3 < 1
typeof(my_comparison)
[1] "logical"
Write TRUE and FALSE explicitly

At the beginning of every session, R saves T and F as shortcuts for TRUE and FALSE. But T and F are not reserved words; they are regular objects that we can modify and even delete inadvertently. An accidental misuse of T or F will cost you more time and effort than whatever you may save by typing a single letter instead of a full word. So, I strongly suggest you always write the complete words.

There is also a special type of logical value called NA, which denotes a missing value.

missing_value <- NA
typeof(missing_value)
[1] "logical"

Treating NA as a logical value allows R to handle missing data in intuitive ways. Consider, for example, that if a value is missing, we can not know whether it is bigger than a number, or whether it is equal to a word. R reflects this intuition by forcing any comparison involving NA to result in another NA.

NA > 5
[1] NA
NA == "Sancho"
[1] NA

Characters are text like “hello”, “Elvis”, or “Somewhere in La Mancha”; or symbols that we want to handle as text, like “size 45” or “mail/u”. We can create a character object by typing a character or string of characters surrounded by quotes:

my_character <- "Somewhere in La Mancha"
typeof(my_character)
[1] "character"

It is easy to confuse character strings with objects because both look like chunks of text in the code. For example, the_thing and "the_thing" look alike. However, the_thing is the name of an object that can contain any type of data, while "the_thing" is a character string, i.e., a piece of data that we can assign to any name. If we forget the quotation marks when writing a name, R will look for an object that likely doesn’t exist, so we will likely get an error.

noquotes
Error in eval(expr, envir, enclos): object 'noquotes' not found

Also, notice that R will be treat anything surrounded by quotes as a character string—regardless of what is between the quotes.

nums_as_chars <- "-5 + 71"
typeof(nums_as_chars)
[1] "character"

Still, we can identify strings because R always shows them surrounded by quotation marks and in a different colors from other data types.

typeof("9")
[1] "character"
typeof(9)
[1] "double"

A special type of character string is a factor, which stores categorical information like color or level of agreement. Factors can only have certain values called “levels”, which may be ordered (e.g., “agree”, “neutral”, “disagree”) or disordered (e.g., “red” or “green”). To create a factor, we can start with a vector of strings and use function as.factor() to convert it to a factor.

my_factor <- as.factor("red")
my_factor
[1] red
Levels: red

They may seem pointless now, but factors will matter more after we understand the different ways that R can group and organize data.

3.2 Data structures

Data structures are ways of organizing data that make it easier for us to manipulate and understand it. I will explain five different data structures, each with its own advantages and limitations: atomic vectors, matrices, arrays, lists, and data frames.

3.2.1 Atomic vectors

Atomic vectors are the most basic type of data structure. They are one-dimensional groups of data where all values must be of the same type. There is only one exception: any vector can include NA as a value regardless of the type of the other values. This internal consistency of vectors makes it easy for us to store values that are supposed to describe the same property.

To create an atomic vector, we can group values using the combine function c():

quijote_characters <- c("Don Quijote", "Sancho Panza", NA)

Atomic vectors can have almost1 as many elements as you want (including zero).

# length() counts the number of elements in the vector
length(quijote_characters)
[1] 3
length(c())
[1] 0
Coercion

Adding different data types to the same atomic vector does not produce an error. Instead, R automatically follows specific rules to coerce everything inside the vector to be of the same type. If a character string is present in an atomic vector, R will convert all other values to character strings. If a vector only contains logicals and numbers, R will convert the logicals to numbers; every TRUE becomes a 1, and every FALSE becomes a 0. The only values that are not coerced are NAs.

Following these rules helps preserve information. It is easy, for example, to recognize the original types of strings "TRUE" and "3.14", or to transform a vector of 1s and 0s back to TRUEs and FALSEs.

3.2.2 Matrices

Matrices are two-dimensional “sheets” of data. To create a matrix, first give matrix() an atomic vector to reorganize into a matrix. Then, define the number of rows using the nrow argument, or the number of columns using the ncol argument. matrix() will reshape your vector into a matrix with the specified number of rows (or columns).

scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3)
scores_mat
     [,1] [,2] [,3]
[1,]  -27   14    8
[2,]    2  -28   13
[3,]    2   35    4
# Equivalently
scores_mat <- matrix(data = scores_vec, ncol = 3)
scores_mat
     [,1] [,2] [,3]
[1,]  -27   14    8
[2,]    2  -28   13
[3,]    2   35    4

Like atomic vectors, matrices can have any data type, but only one (or NA).

character_mat <- matrix(
    data = c("Mario", "Peach", "Luigi", "Yoshi"), 
    ncol = 2
)
character_mat
     [,1]    [,2]   
[1,] "Mario" "Luigi"
[2,] "Peach" "Yoshi"

By default matrix() will fill up the matrix column by column. But we can fill the matrix row by row if we include the argument byrow = TRUE:

scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3, byrow = TRUE)
scores_mat
     [,1] [,2] [,3]
[1,]  -27    2    2
[2,]   14  -28   35
[3,]    8   13    4

When showing a matrix, R shows expressions with square brackets (e.g., [,1]). The numbers inside the square brackets are positional indices that denote the “coordinates” of the matrix. Two-dimensional objects like matrices have two indices. The first index always refers to the row, and the second always refers to the column. So, as with vectors, we can use square bracket notation [ ] to extract values from matrices.

scores_mat[c(1, 3), 2] # Rows 1 and 3 in column 2
[1]  2 13

We can even extract entire rows or columns.

scores_mat[2,] # Extract entire second row
[1]  14 -28  35
scores_mat[, 1] # Extract entire first column
[1] -27  14   8
Matrices are fancy vectors

Deep down, R thinks of a matrix as a vector folded to look like a square. That means, among other things, that you can reference an element of a vector with a single positional index. Check what happens if you run scores_mat[5].

We can define names for the rows and the columns of a matrix using the rownames() and colnames() functions.

rownames(scores_mat) <- c("Andrew", "Booker", "Comstock")
colnames(scores_mat) <- c("Columbia", "Rapture", "Atlantic")
scores_mat
         Columbia Rapture Atlantic
Andrew        -27       2        2
Booker         14     -28       35
Comstock        8      13        4

Now we can use these names to extract values:

scores_mat["Andrew", c("Rapture", "Columbia")]
 Rapture Columbia 
       2      -27 

There are several useful functions to do matrix operations. For example:

t(scores_mat) # Transpose the matrix
diag(scores_mat) # Extract values in diagonal
scores_mat + scores_mat # Matrix addition
scores_mat * scores_mat # Element-wise multiplication
scores_mat %*% scores_mat # Matrix multiplication

3.2.3 Arrays

An array is a multidimensional object that stacks groups of data. Using 1 dimension in an array forms a column of data with multiple values; using 2 dimensions is like a sheet of paper with several columns of data; 3 dimensions are like a book with several sheets; 4 dimensions are like a box with several books, and so on. Note that layers of an array have consistent sizes. All books have the same number of sheets, and all sheets have the same number of rows and columns.

We can use array() to create an n-dimensional array. The first argument in array() must be a vector with the values that we want to store in the array. The second argument must be a vector where the length denotes the number of dimensions, and the values denote the size of each dimension.

# This is an array with 3 matrices, each with 2 rows and 2 columns
array(c(25:28, 35:38, 45:48), dim = c(2, 2, 3))
, , 1

     [,1] [,2]
[1,]   25   27
[2,]   26   28

, , 2

     [,1] [,2]
[1,]   35   37
[2,]   36   38

, , 3

     [,1] [,2]
[1,]   45   47
[2,]   46   48

The dim argument works from the inside out. The first value is the number of elements in each column, the second value is the number of columns, the third element is the number of matrices, and so on.

Note that the total number of elements in the array is equal to multiplying the sizes of all dimensions. If the vector we use to build the array has a different number of elements, R will discard or recycle values from the vector. Check it yourself.

Applied inception

Try to make an array with 4 dimensions. Following the metaphor from above, try to make an array with 3 books, each of which has 4 sheets with 2 columns and 2 rows each. See a quick solution below.

Code
array(c(1:48), dim = c(2, 2, 4, 3))

Vectors, matrices, and arrays need all of its values to be of the same type. This requirement seems severe, but it allows the computer to store large sets of numbers in a simple and efficient way; and it accelerates computations because it tells R that it can manipulate all values in the object in the same way.

However, we often need to store different types of data in a single place—maybe because all the data belong to the same underlying subject of study. For example, we can describe a dog based on its height, weight, and age (numerical values), and on its color and breed (character strings or factors). R can keep all this information in a single place.

3.2.4 Lists

Lists can group R objects like vectors, arrays, and other lists in a set of one dimension. To create a list, use the function list() and separate each of its elements with a comma.

all_in_one_list <- list(c(3.1, 10), "El Zorro", list(character_mat))
all_in_one_list
[[1]]
[1]  3.1 10.0

[[2]]
[1] "El Zorro"

[[3]]
[[3]][[1]]
     [,1]    [,2]   
[1,] "Mario" "Luigi"
[2,] "Peach" "Yoshi"

The double-bracketed indices tell you which element of the list is displayed. The single-bracketed indices tell you which subelement of an element is displayed. For example, 3.1 is the first subelement of the first element in the list, and "El Zorro" is the first subelement of the second element. This double notation helps us recognize which level of the stacking we are in, regardless of what we stack inside the list.

There are two ways to access an element from a list, depending on what we want to do with the output. We can use single-bracket notation [ ] to get a new list with elements from the original list:

new_list <- all_in_one_list[1]
new_list
[[1]]
[1]  3.1 10.0
typeof(new_list)
[1] "list"

Or we can use double-bracket notation [[ ]] to get only the subelements of an element from the original list (we can not extract multiple elements this way).

new_item <- all_in_one_list[[1]]
new_item
[1]  3.1 10.0
typeof(new_item)
[1] "double"

We can name the elements of a list when we first create it:

countries_info <- list(
    countries = c("Japan", "Egypt", "Mexico", "Finland"),
    temperatures_fahrenheit = c(50, 90, 65, -10),
    speak_spanish = c(FALSE, FALSE, TRUE, FALSE)
)

Or we can name them after creating the list using names():

names(all_in_one_list) <- c("weight", "hero", "game")
all_in_one_list
$weight
[1]  3.1 10.0

$hero
[1] "El Zorro"

$game
$game[[1]]
     [,1]    [,2]   
[1,] "Mario" "Luigi"
[2,] "Peach" "Yoshi"

With a named list, we can also use dollar sign notation $ to extract elements. This notation produces the same result as using the double bracket notation [[ ]]

countries_info$speak_spanish
[1] FALSE FALSE  TRUE FALSE
countries_info[["speak_spanish"]]
[1] FALSE FALSE  TRUE FALSE

3.2.5 Data frames

Data frames are the most common storage structure for data analysis. We can think of them as a group of atomic vectors (columns) of the same length. Usually, each row of a data frame represents an individual observation and each column represents a different measurement or variable of that observation.

Different vectors in a data frame can have different data types, but they must all have the same length. If we use vectors of different lengths, R will recycle values of the shorter vectors to ensure that the data frame has a square shape.

We can create a data frame using the data.frame() function. Give data.frame() any number of vectors of equal length, each separated with a comma. Each vector should be set equal to a name that describes the vector. data.frame() will turn each vector into a column of the new data frame:

aliens_df <- data.frame(
    name = c("Bender", "Fry", "Nibbler", "Zoidberg"),
    species = c("Robot", "Human", "Nibblonian", "Decapodian"), 
    fingers_per_hand = c(3, 4, 3, NA)
)
aliens_df
      name    species fingers_per_hand
1   Bender      Robot                3
2      Fry      Human                4
3  Nibbler Nibblonian                3
4 Zoidberg Decapodian               NA

We can use the function dim() to get the size of each dimension of the data frame, and the function str() to get a compact summary:

dim(aliens_df)
[1] 4 3
str(aliens_df)
'data.frame':   4 obs. of  3 variables:
 $ name            : chr  "Bender" "Fry" "Nibbler" "Zoidberg"
 $ species         : chr  "Robot" "Human" "Nibblonian" "Decapodian"
 $ fingers_per_hand: num  3 4 3 NA

As you can see, str() gives us the data frame dimensions, reminds us that aliens_df is of type data.frame, lists the names and data types of all of the variables (columns) contained in the data frame, and prints the first five values.

To us, a data frame resembles a matrix, but to R it is a list with an attribute “class” set to “data.frame”

typeof(aliens_df)
[1] "list"
class(aliens_df)
[1] "data.frame"

This means that we can extract extract values from data frames the same way we extract from lists. We can use single brackets [] to get a new data frame:

aliens_df[2]
     species
1      Robot
2      Human
3 Nibblonian
4 Decapodian
# Equivalently
aliens_df["species"]
     species
1      Robot
2      Human
3 Nibblonian
4 Decapodian

Or we can use double brackets [[ ]] or dollar sign notation $ to get atomic vectors:

aliens_df$name
[1] "Bender"   "Fry"      "Nibbler"  "Zoidberg"

Also, as with matrices, we can use a single bracket with two indices (note that this produces an atomic vector):

aliens_df[c(1,2), "fingers_per_hand"]
[1] 3 4

Creating data frames from scratch is cumbersome and prone to errors. In the next section, we will see how to import data from different sources into R, as well as basic ways to prepare it for analysis.

3.3 References

Most of this section is based on “Hands-On Programming with R”, by Garret Grolemund; and on “An Introduction to R”, by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.


  1. The maximum length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9. See help(.Machine) for more information.↩︎