4 Basic data processing

Now we can apply our understanding of R to work with pre-made files of data. The first step is to locate our working directory, which is the default location where R will look for files we want to load and where it will put any files we save. This directory is different on each computer, but we can find it by running:

getwd()

[1] "C:/Users/user_name/workshop_folder/learning_r/code"

We can move our working directory to any folder on our computer by writing a new file path inside the function setwd(). I prefer to set my working directory to a folder dedicated exclusively to whichever project I am currently working on. This way, every file related to my project is in the same place. For example:

setwd("C:/Users/user_name/workshop_folder/learning_r/code")

We can also change our working directory by clicking on Session > Set Working Directory > Choose Directory in the RStudio menu bar. The graphical user interfaces in Windows and Mac have similar options. If we start R from a UNIX command line (as on Linux machines), the working directory will be whichever directory we were in when we called R.

list.files() will show us what files are in our working directory. If the file that we want to open is in our working directory, then we are ready to proceed.

4.1 Loading data

Once we can locate files in our computer, we can load them into R. Note, however, that we need specific ways to open different file formats.

4.1.1 Plain text files

A plain-text file stores a table of data in a text document. Each row of the table is saved in its own line, and a simple symbol separates the cells within a row. This symbol is most often a comma, and sometimes a tab or a pipe delimiter |, but it can also be any other character. Each file uses only one symbol to separate cells, which minimizes confusion.

Plain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau and the Social Security Administration) publish their data as plain-text files.

We will work with data from this ¹ plain text file. Use Ctrl+Shift+s to download the file. I will save it in a folder called data_files inside my working directory under the name flower.csv. You can save it wherever you want as long as you can keep track of it.

We can use read.table() to load plain-text files into R. The first argument of read.table() is a string with the name of our file (if it is in your working directory), or with the file path to our file (if it is not in our working directory).

flower_df <- read.table(
    "data_files/flower.csv",
    header = TRUE,
    sep = ",",
    encoding = "UTF-8"
)

In the code above, I added arguments header, sep, and encoding. header tells R whether the first line of the file contains variable names instead of values; this will help us identify the variables in the data frame. sep tells R the symbol that the file uses to separate the cells; this will help us preserve the correct location of the data cells. encoding tells R the type of machine-speak that it needs to understand the symbols in the file; this can prevent having weird, mistaken symbols in our loaded data.

Other useful arguments are skip and nrow. skip tells R to omit a specific number of lines before it starts reading values from the file. This argument is helpful when the file starts with text that is not part of the data set, or when we want to read only part of a data set. nrow tells R to only read a certain number of lines, starting from the top. Keep in mind that nrow does not count the header in the number of rows it reads.

flower_df_chunk <- read.table(
    "data_files/flower.csv", 
    header = TRUE, 
    sep = ",", 
    skip = 0, 
    nrow = 3
)
flower_df_chunk

  treat nitrogen block height weight leafarea shootarea flowers
1   tip   medium     1    7.5   7.62     11.7      31.9       1
2   tip   medium     1   10.7  12.14     14.1      46.0      10
3   tip   medium     1   11.2  12.76      7.1      66.7      10

R has shortcut functions that call read.table() in the background with different default values for popular types of files:

read.table is the general purpose read function.
read.csv reads comma-separated values (.csv) files.
read.delim reads tab-delimited files.
read.csv2 reads .csv files with European decimal format.
read.delim2 reads tab-delimited files with European decimal format.

read.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Make sure to use the web address that links directly to the file, not to a web page that has a link to the file.

There is a special type of plain-text file called fixed-width file (.fwf). Instead of a symbol, a fixed-width file uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.

If our flowers data came in a fixed-width file, the first few lines would look like this:

treat  nitrogen block  height  weight  leafarea  shootarea  flowers
tip    medium   1      7.5     7.62    11.7      31.9       1
tip    medium   1      10.7    12.14   14.1      46.0       10
tip    medium   1      11.2    12.76   7.1       66.7       10
tip    medium   1      10.4    8.78    11.9      20.3       1
tip    medium   1      10.4    13.58   14.5      26.9       4
tip    medium   1      9.8     10.08   12.2      72.7       9

Fixed-width files may be visually intuitive, but computers find them difficult to work with. This may explain why R has a function for reading fixed-width files, but not for creating or saving them.

We can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.

4.1.2 Excel files

The best way to load data from Excel files (.xlsx) is to first save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated features that make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.

Still, it is possible to load Excel files directly if we really need to. R has no native way of loading these files, but we can use the package readxl, which works on Windows, OS X, and Linux. We install it using install.packages("readxl") and then load it using library(readxl). Once we load the package, we can use the function read_excel() to load files of the type .xls and .xlsx (see help("read_excel") for more information).

4.1.3 Files from other programs

As with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that our data are transcribed properly.

Still, sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g., SAS or SPSS). In these cases, we can resort to one of several libraries:

haven, for reading files from SAS, SPSS, and Stata.
R.matlab for reading files for versions MAT 4 and MAT 5.
foreign for reading minitab and Systat file formats. This library can also read files from SAS, SPSS, and Stata, but I prefer to use haven in these cases.

4.2 Cleaning data

Once we load our data files as data.frames, we should verify that all of the information has an appropriate format. The process of identifying, removing, and correcting inaccurate information is often referred to as data cleaning. We will practice data cleaning using a messy version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.

Since this is a .csv file, we can load it using:

flower_messy_df = read.csv(
    file = "data_files/flower_messy.csv",
    header = TRUE,
    encoding = "UTF-8"
)

First, we should ensure the column names follow the rules we saw in section 1. This will facilitate accessing the data in the columns later. We can check these column names using the colnames() function:

colnames(flower_messy_df)

[1] "Treat"      "Nitrogen"   "block"      "Height"     "Weight"    
[6] "leaf.area"  "shoot.area" "FLOWERS"

If we opened the data file using something like Excel or Notepad, we would see that the names for columns 6 and 7 had blank spaces inside them. When loading the data, read.csv() automatically substitutes these blank spaces with periods ., so that the names conform to R’s conventions. read.csv() is pretty good at checking column names and other things, but it’s not perfect. So, it’s always a good idea to double-check everything ourselves.

The column names of flower_messy_df are legible, but we don’t want to struggle with their mix of upper and lower-case letters. Let’s rewrite all the names in lower case, which is quick and easy if we use tolower().

new_colnames <- tolower(colnames(flower_messy_df)) # Modify column names
new_colnames

[1] "treat"      "nitrogen"   "block"      "height"     "weight"    
[6] "leaf.area"  "shoot.area" "flowers"

These new column names are better, but we still need to change them inside flower_messy_df. Before moving on, let’s create a new data set called flower_clean_df.

flower_clean_df <- flower_messy_df

Using a copy of the original data set makes it easier to track our changes because we can always look back at the original version. It also makes it faster to mend mistakes because it avoids reloading our original data, which can take a long time with large files.

Now we can use our improved column names.

colnames(flower_clean_df) <- new_colnames # Replace column names in data frame
colnames(flower_clean_df) # Verify replacement

[1] "treat"      "nitrogen"   "block"      "height"     "weight"    
[6] "leaf.area"  "shoot.area" "flowers"

The last change to these column names will be to substitute the periods with underscores. In R, this is purely out of personal preference, but it’s a good excuse to meet gsub(), which substitutes patterns of strings:

colnames(flower_clean_df) <- gsub(
    pattern = "\\.", # What we want to remove
    replacement = "_", # What we want to have instead
    x = colnames(flower_clean_df) # The object we want to modify
)
colnames(flower_clean_df)

[1] "treat"      "nitrogen"   "block"      "height"     "weight"    
[6] "leaf_area"  "shoot_area" "flowers"

Note that I had to use "\\." instead of simply "." to match the period. The reason is that gsub() interprets "." as saying “match any character”. This may sound silly but it helps when working with regular expressions—a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to explain here, but if you expect to work with text data regularly, I encourage you to learn more about them.

With our improved column names it will be easier to focus on giving every column an appropriate format: numbers should be of type “double” or “integer”, and text should be of type “character” or “factor”. Let’s check the types of the columns in our current data set.

str(flower_clean_df)

'data.frame':   96 obs. of  8 variables:
 $ treat     : chr  "tip" "tip" "tip" "tip" ...
 $ nitrogen  : chr  "medium" "medium" "medium" "Medium" ...
 $ block     : int  1 1 1 1 1 1 1 1 2 2 ...
 $ height    : num  7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...
 $ weight    : num  7.62 12.14 12.76 8.78 13.58 ...
 $ leaf_area : num  11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...
 $ shoot_area: num  31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...
 $ flowers   : chr  "\"1\"" "10" "10" "1" ...

Column “flowers” seems to contain numbers but is classified as type “character”. The reason is that there are quotes around the first value in this column:

head(flower_clean_df[["flowers"]])

[1] "\"1\"" "10"    "10"    "1"     "4"     "9"

R recognizes that the value itself has quotes, so it adds a backslash \ to differentiate them from the quotes it uses to print strings. Let’s remove these confusing quotes.

flower_clean_df["flowers"] <- gsub(
    pattern = "\"", # \" the backlash tells R to match quotes
    replacement = "", # This is how we write "nothing"
    x = flower_clean_df$flowers # x needs to be a vector, so use
                                # double brackets or dollar sign
)
head(flower_clean_df$flowers)

[1] "1"  "10" "10" "1"  "4"  "9"

Now we can coerce the column “flowers” to be of type “double”.

flower_clean_df["flowers"] <- as.numeric(flower_clean_df$flowers)
typeof(flower_clean_df$flowers)

[1] "double"

head(flower_clean_df$flowers)

[1]  1 10 10  1  4  9

Columns “treat” and “nitrogen” are of type character. This is not wrong, but they will be easier to handle if we convert them to factors.

flower_clean_df["treat"] <- factor(flower_clean_df$treat)
flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
str(flower_clean_df)

'data.frame':   96 obs. of  8 variables:
 $ treat     : Factor w/ 2 levels "notip","tip": 2 2 2 2 2 2 2 2 2 2 ...
 $ nitrogen  : Factor w/ 8 levels "high","High",..: 7 7 7 8 7 7 8 7 7 7 ...
 $ block     : int  1 1 1 1 1 1 1 1 2 2 ...
 $ height    : num  7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...
 $ weight    : num  7.62 12.14 12.76 8.78 13.58 ...
 $ leaf_area : num  11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...
 $ shoot_area: num  31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...
 $ flowers   : num  1 10 10 1 4 9 7 6 5 8 ...

Column “treat” looks fine, but column “nitrogen” looks suspicious. It is supposed to have only three levels (“low”, “medium”, and “high”), but its description counts eight. Let’s examine these levels:

levels(flower_clean_df$nitrogen)

[1] "high"   "High"   "HIGH"   "low"    "lOw"    "Low"    "medium" "Medium"

Remember that R is case sensitive, so it interprets each of spelling “high” and “low” as a different value. We can fix this using tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.

flower_clean_df["nitrogen"] <- tolower(flower_clean_df$nitrogen)
flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
levels(flower_clean_df$nitrogen)

[1] "high"   "low"    "medium"

Unless I have a good reason not to, I usually transform all character columns to have only lower case letters.

4.3 Data summaries and visualizations

Now that our data are clean, we can get a complete summary to understand them better. Function summary() recognizes the type of each column and displays some notable features:

summary(flower_clean_df)

   treat      nitrogen      block         height           weight      
 notip:48   high  :32   Min.   :1.0   Min.   : 1.200   Min.   : 5.790  
 tip  :48   low   :32   1st Qu.:1.0   1st Qu.: 4.475   1st Qu.: 9.027  
            medium:32   Median :1.5   Median : 6.450   Median :11.395  
                        Mean   :1.5   Mean   : 6.840   Mean   :12.155  
                        3rd Qu.:2.0   3rd Qu.: 9.025   3rd Qu.:14.537  
                        Max.   :2.0   Max.   :17.200   Max.   :23.890  
   leaf_area       shoot_area        flowers      
 Min.   : 5.80   Min.   :  5.80   Min.   : 1.000  
 1st Qu.:11.07   1st Qu.: 39.05   1st Qu.: 4.000  
 Median :13.45   Median : 70.05   Median : 6.000  
 Mean   :14.05   Mean   : 79.78   Mean   : 7.062  
 3rd Qu.:16.45   3rd Qu.:113.28   3rd Qu.: 9.000  
 Max.   :49.20   Max.   :189.60   Max.   :17.000

Now let’s imagine we want to study the distribution of values for weight. We can use a histogram to check the shape of this distribution.

hist(
    flower_clean_df$weight, 
    breaks = 15,
    xlim = c(5, 25),
    xlab = "Weight",
    main = "Few weights are above 20"
)

Or we can get a simpler description using a box plot.

boxplot(
    flower_clean_df$weight, 
    ylab = "Weight", 
    col = "darkgreen",
    main = "Most weights are between 9 and 15"
)

A single box plot is less descriptive than a histogram. But it is easier to compare box plots to look for big differences between distributions. Let’s compare the distributions of height by nitrogen level:

boxplot(
    height ~ nitrogen,
    data = flower_clean_df, 
    col = c("yellow", "blue", "pink"),
    main = "No clear association between height and nitrogen"
)

Now let’s investigate the relationship between shoot area and leaf area. And let’s check whether this relationship changes depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.

plot(
    x = flower_clean_df$leaf_area,
    y = flower_clean_df$shoot_area, 
    col = flower_clean_df$treat,
    main = "Shoot area seems proportional to leaf area",
    xlab = "Leaf area",
    ylab = "Shoot area"
)
# Add a legend to the plot
legend(
    x = "bottomright", 
    legend = levels(flower_clean_df$treat), 
    col = 1:length(levels(flower_clean_df$treat)), 
    pch = 16
)

Now let’s see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13. We can use a mosaic plot for this.

nitrogen_by_treat_table = xtabs(
    formula = ~ nitrogen + treat,
    data = flower_clean_df[which(flower_clean_df$leaf_area > 13),]
)
nitrogen_by_treat_table

        treat
nitrogen notip tip
  high      14  12
  low        7   3
  medium    10   7

mosaicplot(nitrogen_by_treat_table, main = "Nitrogen by treat table")

4.4 Success!

Dear reader, you are now a capable useR. I hope this humble introduction helps you learn more about many different topics in R. Be curious, be bold, and, above all, be patient. Rome wasn’t built in a day. Best of luck!

4.5 References

The subsection on loading data is based on “Hands-On Programming with R”, by Garret Grolemund; and on “An Introduction to R”, by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.

You can find the original file here, courtesy of Douglas et al. (see references).↩︎