3 Data in R

There are multiple types of data that can come in different shapes and sizes. In the previous section we learned how to store, extract, and manipulate numerical vectors. In this section we will learn how R handles different data types and structures, and how to use them to study and summarize data.

3.1 Data types

Data types are classifications of data that help R conform to our intuition: multiplying $2$ by $7$ feels right; multiplying “pineapple” by “stalactite” does not. R has data types double, integer, logical, character, complex, and raw, each with its own rules for storing and handling. Learning these rules now will allow us to analyze data later with less effort and fewer mistakes.

Data type complex is for imaginary numbers, and type raw is for raw bytes of data. It is unlikely that you will ever need these data types, so I will not explain them in these notes.

Doubles are regular numbers with a decimal value (which may be zero). In general, R will save any number that we type as a double.

my_double <- 5
typeof(my_double)

[1] "double"

Integers are numbers that have no decimal component. To create an integer you must type a number followed by an L:

my_integer <- 5L
typeof(my_integer)

[1] "integer"

Many data scientists ignore integers because we can save them as doubles. But R stores integers more precisely and with less memory than doubles. This extra efficiency is helpful when the computations are lengthy or imprecise, or when computer memory is limited.

Logicals are truth values TRUE and FALSE, which are useful when we compare numbers or objects:

my_comparison <- -3 < 1
typeof(my_comparison)

[1] "logical"

Write TRUE and FALSE explicitly

At the beginning of every session, R saves T and F as shortcuts for TRUE and FALSE. But T and F are not reserved words; they are regular objects that we can modify and even delete inadvertently. Fixing an accidental misuse of T or F will cost you more time and effort than whatever you may save by typing a single letter instead of a full word. So, I strongly suggest you always write TRUE and FALSE explicitly.

There is also a special type of logical value called NA, which denotes a missing value.

missing_value <- NA
typeof(missing_value)

[1] "logical"

Treating NA as a logical value allows R to handle missing data in intuitive ways. Consider, for example, that if a value is missing, we cannot know whether it is bigger than a number, or whether it is equal to a word. R reflects this intuition by forcing any comparison involving NA to result in another NA.

NA > 5

[1] NA

NA == "Sancho"

[1] NA

Characters are text like “hello”, “Elvis”, or “Somewhere in La Mancha”; or symbols that we want to handle as text, like “size 45” or “mail/u”. We can create a character object by typing a character or string of characters surrounded by quotes:

my_character <- "Somewhere in La Mancha"
typeof(my_character)

[1] "character"

It is easy to confuse character strings with objects because both look like chunks of text in the code. the_thing and "the_thing", for example, look alike. But the_thing is the name of an object that can contain any type of data, while "the_thing" is a piece of data that we can assign to any name. If we forget the quotation marks when writing a name, R will look for an object that likely doesn’t exist and will return an error.

noquotes

Error: object 'noquotes' not found

Also, notice that R will treat anything surrounded by quotation marks as a character string—regardless of what is between the quotes.

nums_as_chars <- "-5 + 71"
typeof(nums_as_chars)

[1] "character"

Still, we can identify strings because R always shows them surrounded by quotation marks and in a color different from other data types.

typeof("9")

[1] "character"

typeof(9)

[1] "double"

A special type of character string is a factor, which stores categorical information like color or degree of agreement. Factors can only have certain values called “levels”, which may be ordered (e.g., “agree”, “neutral”, “disagree”) or disordered (e.g., “red” or “green”). To create a factor, we can start with a vector of strings and use function as.factor() to convert it to a factor.

my_factor <- as.factor("red")
my_factor

[1] red
Levels: red

Factors look like a simple character string to us. But R stores factors more efficiently than regular characters. In rough terms, R invisibly replaces the words in the factor vector with integers, and associates each integer with a level of the original vector. This allows R to handle the factor internally as if it was a vector of integers, which is fast. Then, whenever we need to look at the data, R swaps the internal integers for their corresponding character values.

3.2 Data structures

A data structure is a way of organizing data that makes it easier for us to manipulate and understand them. I will explain five different data structures, each with its own advantages and limitations: atomic vectors, matrices, arrays, lists, and data frames.

3.2.1 Atomic vectors

Atomic vectors are the most basic type of data structure. They are one-dimensional groups of data where all values must be of the same type. There is only one exception: any vector can include NA as a value regardless of the type of the other values. This internal consistency of vectors makes it easy for us to store values that are supposed to describe the same property.

To create an atomic vector, we can group values using the combine function c():

quijote_characters <- c("Don Quijote", "Sancho Panza", NA)

Atomic vectors can have almost¹ as many elements as you want (including zero).

# length() counts the number of elements in the vector
length(quijote_characters)

[1] 3

length(c())

[1] 0

Coercion

Adding different data types to the same atomic vector does not produce an error. Instead, R automatically follows specific rules to coerce everything inside the vector to be of the same type. If a character string is present in an atomic vector, R will convert all other values to character strings. If a vector only contains logical values and numbers, R will convert the logicals to numbers; every TRUE becomes a 1, and every FALSE becomes a 0. The only values that are not coerced are NAs.

Following these rules helps preserve information. It is easy, for example, to recognize the original types of strings "TRUE" and "3.14", or to transform a vector of 1s and 0s back to TRUEs and FALSEs.

3.2.2 Matrices

Matrices are two-dimensional “sheets” of data. To create a matrix, first give matrix() an atomic vector to reorganize into a matrix. Then, define the number of rows using the nrow argument, or the number of columns using the ncol argument. matrix() will reshape your vector into a matrix with the specified number of rows (or columns).

scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3)
scores_mat

     [,1] [,2] [,3]
[1,]  -27   14    8
[2,]    2  -28   13
[3,]    2   35    4

# Equivalently
scores_mat <- matrix(data = scores_vec, ncol = 3)
scores_mat

     [,1] [,2] [,3]
[1,]  -27   14    8
[2,]    2  -28   13
[3,]    2   35    4

Like atomic vectors, matrices can have any data type, but only one (or NA).

character_mat <- matrix(
    data = c("Mario", "Peach", "Luigi", "Yoshi"), 
    ncol = 2
)
character_mat

     [,1]    [,2]   
[1,] "Mario" "Luigi"
[2,] "Peach" "Yoshi"

By default matrix() will fill up the matrix column by column. But we can fill the matrix row by row if we include the argument byrow = TRUE:

scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3, byrow = TRUE)
scores_mat

     [,1] [,2] [,3]
[1,]  -27    2    2
[2,]   14  -28   35
[3,]    8   13    4

When showing a matrix, R shows expressions with square brackets (e.g., [,1]). The numbers inside the square brackets are positional indices that denote the “coordinates” of the matrix. Two-dimensional objects like matrices have two indices. The first index always refers to the row, and the second always refers to the column. So, as with vectors, we can use square bracket notation [ ] to extract values from matrices.

scores_mat[c(1, 3), 2] # Rows 1 and 3 in column 2

[1]  2 13

We can even extract entire rows or columns.

scores_mat[2,] # Extract entire second row

[1]  14 -28  35

scores_mat[, 1] # Extract entire first column

[1] -27  14   8

Matrices are fancy vectors

Deep down, R thinks of a matrix as a vector folded to look like a square. That means, among other things, that you can reference an element of a vector with a single positional index. Check what happens if you run scores_mat[5].

We can define names for the rows and the columns of a matrix using the rownames() and colnames() functions.

rownames(scores_mat) <- c("Andrew", "Booker", "Comstock")
colnames(scores_mat) <- c("Columbia", "Rapture", "Atlantic")
scores_mat

         Columbia Rapture Atlantic
Andrew        -27       2        2
Booker         14     -28       35
Comstock        8      13        4

Now we can use these names to extract values:

scores_mat["Andrew", c("Rapture", "Columbia")]

 Rapture Columbia 
       2      -27

There are several useful functions to do matrix operations. For example:

t(scores_mat) # Transpose the matrix
diag(scores_mat) # Extract values in diagonal
scores_mat + scores_mat # Matrix addition
scores_mat * scores_mat # Element-wise multiplication
scores_mat %*% scores_mat # Matrix multiplication

3.2.3 Arrays

An array is a multidimensional object that stacks groups of data. Using 1 dimension in an array forms a column of data with multiple values; using 2 dimensions is like a sheet of paper with several columns of data; 3 dimensions are like a book with several sheets; 4 dimensions are like a box with several books, and so on. Note that layers of an array have consistent sizes. All books have the same number of sheets, and all sheets have the same number of rows and columns.

We can use array() to create an n-dimensional array. The first argument in array() must be a vector with the values that we want to store in the array. The second argument must be a vector where the length denotes the number of dimensions, and the values denote the size of each dimension.

# This is an array with 3 matrices, each with 2 rows and 2 columns
array(c(25:28, 35:38, 45:48), dim = c(2, 2, 3))

, , 1

     [,1] [,2]
[1,]   25   27
[2,]   26   28

, , 2

     [,1] [,2]
[1,]   35   37
[2,]   36   38

, , 3

     [,1] [,2]
[1,]   45   47
[2,]   46   48

The dim argument works from the inside out. The first value is the number of elements in each column, the second value is the number of columns, the third element is the number of matrices, and so on.

Note that the total number of elements in the array is equal to multiplying the sizes of all dimensions. If the vector we use to build the array has a different number of elements, R will discard or recycle values from the vector. Check it yourself.

Applied inception

Try to make an array with 4 dimensions. Following the metaphor from above, try to make an array with 3 books, each of which has 4 sheets with 2 columns and 2 rows each. See a quick solution below.

Code

array(c(1:48), dim = c(2, 2, 4, 3))

Vectors, matrices, and arrays need all of its values to be of the same type. This requirement seems severe, but it allows the computer to store large sets of numbers in a simple and efficient way; and it accelerates computations because it tells R that it can manipulate all values in the object in the same way.

However, we often need to store different types of data in a single place—maybe because all the data belong to the same underlying subject of study. For example, we can describe a dog based on its height, weight, and age (numerical values), and on its color and breed (character strings or factors). R can keep all this information in a single place.

3.2.4 Lists

Lists can group R objects like vectors, arrays, and other lists in a set of one dimension. To create a list, use the function list() and separate each of its elements with a comma.

all_in_one_list <- list(c(3.1, 10), "El Zorro", list(character_mat))
all_in_one_list

[[1]]
[1]  3.1 10.0

[[2]]
[1] "El Zorro"

[[3]]
[[3]][[1]]
     [,1]    [,2]   
[1,] "Mario" "Luigi"
[2,] "Peach" "Yoshi"

The double-bracketed indices tell you which element of the list is displayed. The single-bracketed indices tell you which subelement of an element is displayed. For example, 3.1 is the first subelement of the first element in the list, and "El Zorro" is the first subelement of the second element. This double notation helps us recognize which level of the stacking we are in, regardless of what we stack inside the list.

There are two ways to access an element from a list, depending on what we want to do with the output. We can use single-bracket notation [ ] to get a new list with elements from the original list:

new_list <- all_in_one_list[1]
new_list

[[1]]
[1]  3.1 10.0

typeof(new_list)

[1] "list"

Or we can use double-bracket notation [[ ]] to get only the subelements of an element from the original list (we cannot extract multiple elements this way).

new_item <- all_in_one_list[[1]]
new_item

[1]  3.1 10.0

typeof(new_item)

[1] "double"

We can name the elements of a list when we first create it:

countries_info <- list(
    countries = c("Japan", "Egypt", "Mexico", "Finland"),
    temperatures_fahrenheit = c(50, 90, 65, -10),
    speak_spanish = c(FALSE, FALSE, TRUE, FALSE)
)

Or we can name them after creating the list using names():

names(all_in_one_list) <- c("weight", "hero", "game")
all_in_one_list

$weight
[1]  3.1 10.0

$hero
[1] "El Zorro"

$game
$game[[1]]
     [,1]    [,2]   
[1,] "Mario" "Luigi"
[2,] "Peach" "Yoshi"

With a named list, we can also use dollar sign notation $ to extract elements. This notation produces the same result as using the double bracket notation [[ ]]

countries_info$speak_spanish

[1] FALSE FALSE  TRUE FALSE

countries_info[["speak_spanish"]]

[1] FALSE FALSE  TRUE FALSE

3.2.5 Data frames

Data frames are the most common storage structure for data analysis. We can think of them as a group of atomic vectors (columns) of the same length. Usually, each row of a data frame represents an individual observation and each column represents a different measurement or variable of that observation.

Different vectors in a data frame can have different data types, but they must all have the same length. If we use vectors of different lengths, R will recycle values of the shorter vectors to ensure that the data frame has a square shape.

We can create a data frame using the data.frame() function. Give data.frame() any number of vectors of equal length, each separated with a comma. It is usually convenient to set each vector equal to a name that describes the vector. data.frame() will turn each vector into a column of the new data frame:

aliens_df <- data.frame(
    name = c("Bender", "Fry", "Nibbler", "Zoidberg"),
    species = c("Robot", "Human", "Nibblonian", "Decapodian"), 
    fingers_per_hand = c(3, 4, 3, NA)
)
aliens_df

      name    species fingers_per_hand
1   Bender      Robot                3
2      Fry      Human                4
3  Nibbler Nibblonian                3
4 Zoidberg Decapodian               NA

We can use the function dim() to get the size of each dimension of the data frame, and the function str() to get a compact summary:

dim(aliens_df)

[1] 4 3

str(aliens_df)

'data.frame':   4 obs. of  3 variables:
 $ name            : chr  "Bender" "Fry" "Nibbler" "Zoidberg"
 $ species         : chr  "Robot" "Human" "Nibblonian" "Decapodian"
 $ fingers_per_hand: num  3 4 3 NA

As you can see, str() gives us the data frame dimensions, reminds us that aliens_df is of type data.frame, lists the names and data types of all of the variables (columns) contained in the data frame, and prints the first five values.

To us, a data frame resembles a matrix, but to R it is a list with an attribute “class” set to “data.frame”

typeof(aliens_df)

[1] "list"

class(aliens_df)

[1] "data.frame"

This means that we can extract extract values from data frames the same way we extract from lists. We can use single brackets [] to get a new data frame:

aliens_df[2]

     species
1      Robot
2      Human
3 Nibblonian
4 Decapodian

# Equivalently
aliens_df["species"]

     species
1      Robot
2      Human
3 Nibblonian
4 Decapodian

Or we can use double brackets [[ ]] or dollar sign notation $ to get atomic vectors:

aliens_df$name

[1] "Bender"   "Fry"      "Nibbler"  "Zoidberg"

Also, as with matrices, we can use a single bracket with two indices (note that this produces an atomic vector):

aliens_df[c(1,2), "fingers_per_hand"]

[1] 3 4

Creating data frames from scratch is cumbersome and prone to errors. In the next section, we will see how to import data from different sources into R, as well as basic ways to prepare it for analysis.

3.3 References

Most of this section is based on “Hands-On Programming with R”, by Garret Grolemund; and on “An Introduction to R”, by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.

The maximum length (number of elements) of a vector is $2^{31} - 1 \approx 2*10^9$. Run help(.Machine) in your R console to learn more about this.↩︎