Introduction to R and spatstat

Installing R and Downloading Packages

RStudio is a universally used programing environment for the R language. You must download R before downloading RStudio. To download R, visit r-project.org. Once you are done downloading R, visit rstudio.com to download RStudio. For a more detailed set of download instructions, see the “To Install R” and “To Install RStudio” sections for your OS at this website. Note that the OS you use doesn’t really matter for R. There are a few differences between Linux and Windows based systems once you get into parallel programming, but they are relatively easy to work around.

Once you have R and RStudio installed on your machine, we will install a few packages that we commonly use for data analysis. First, on the command line, run:

install.packages('spatstat')

to download the spatstat package. Then, run

install.packages('devtools')
library(devtools)
install_github('aproudian2/rapt')

to download the rapt (r atom probe tomography) package. R is a language built around open-source packages that contain functions to do all sorts of different things. Before you use a package, you need to first install it using the command install.packages('package_name_here'). This command will download the package files to your computer. We downloaded the rapt package from github directly because we haven’t published it as an official CRAN R package yet. To access the functions in a package you have downloaded, you need to run the command library(package_name_here) either in the command line or at the beginning of your script. So, to make both rapt and spatstat available to us for this session, run the commands

library(rapt)
library(spatstat)

To explore available packages and their documentation, check out https://www.rdocumentation.org/.

Creating Scripts and Projects

Thus far, we have only used the command line (or console) to execute commands. This usually isn’t the best way to use R, as commands you execute here disappear after you execute them. Most of the time, we write code in scripts. These are files in which you can write and save lines of code to run or edit again later. Once you start writing code for a project, it is a good idea to keep all the scripts for that project in one place. Coincidentally enough, there is something called a project in R for that specific purpose. A project is a place for you to store a group of related files in one place. These directories can even be set up with git so that multiple people can work on them at the same time.

Because the book sections for introducing scripts and projects are relatively short, I will point you to R4DS Section 6 - Workflow: scripts and R4DS Section 8 - Workflow: projects to get familiar with these concepts. If you are interested in setting up a project with git, check out this tutorial.

Let’s go ahead and set up a project and create a new script that we can use for this tutorial:

Create a new project called “R_tutorial” or something similar
Open this project and create a new script called “introduction” or whatever you want to call it

A quick tip about naming things in R is that you want to avoid spaces in names as much as possible. Try to use underscores or dashes instead. This will make reading files into our analysis much easier later.

Once you have created and opened your introduction script, you should see it open in a window in the top left section of the RStudio console. This window is called the editor, and it is where you edit and run scripts. Perhaps the most helpful shortcut for coding in the editor is Ctrl+Enter. This key combo will execute whatever line of code your cursor is on at the moment. This shortcut will likely become second nature to you as you get more experience writing scripts and testing them line by line. If you want to run the entire script from beginning to end, line-by-line, click the Source button in the top right of the editor window. Note that the Run button that is right next to the Source button will execute the line your cursor is on, but I recommend using Ctrl+Enter for its speediness.

Here is a screenshot showing about what your RStudio environment should look like with labels for some of the things we have gone over, and more:

If you are curious about how to change the color of your RStudio environment, go to Tools $\rightarrow$ Global Options $\rightarrow$ Appearance and play around with the RStudio theme and Editor theme.

Coding Basics with R

Before we get into coding, I would suggest reading R4DS Section 4 - Workflow: basics to get more comfortable with how code looks in R, how to assign variables, and how to call functions. One thing that isn’t mentioned in this section in commenting. Commenting your code is an important part of the programming progress (both for other people looking at your code and for you trying to figure out what the heck you were trying to do when you come back to a script a few weeks later). To comment out a line of code, begin the line with a #. For example:

# This is a comment!

# This is
# multiple lines
# of comments!

Use comments frequently; don’t ever be afraid to explain something in the scripts!

The boolean operators for R are:

symbol	meaning
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	equal to
!=	not equal to
x \| y	x OR y
x & y	x AND y

Let’s now get into some data structures in R. The important object classes for base R are:

Vectors
Matrices
Data Frames
Lists

We’ll go over the bare bone basics of each of these data structures, but if you are curious or get stuck working with one of them, a quick google search will result in loads of detailed tutorials and tricks that you can do with each type. After we go through these, we’ll get into some other basic coding tools.

Vectors

A vector is a one dimensional group of values of the same type (numeric, boolean, character, etc.). To create a vector, you use the c() function. For example, to initialize a vector nums that contains the integers between one and five, you can run:

nums <- c(1, 2, 3, 4, 5)

There are also shortcuts in R that will produce the exact same vector:

nums <- 1:5
nums <- seq(from = 1, to = 5, by = 1)
nums <- seq(from = 1, to = 5, length.out = 5)

will all produce the same results. To access data in a vector, you need the index of the value you are trying to access. To get out the second index of our vector, we can run:

nums[2]

## [1] 2

or, if you want to access a subset of the vector, you can use another vector to index num:

nums[3:5]

## [1] 3 4 5

Matrices

A matrix is a multi-dimensional vector. It can only hold one data type, but can contain data in rows and columns instead of in just one dimension. To create a matrix, you use the matrix() function. The first argument data is a vector of all values you want in the matrix, the second argument nrow is the number of rows in the matrix, the third argument ncol is the number of columns in the matrix, and the fourth argument byrow determines the direction you want to fill the matrix with your vector (either by row or by column). For example, if we wanted to create a matrix called mat with two rows and three columns that holds the integers 1-6, we would run:

mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = FALSE)

to fill the matrix by row first, we would run:

mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE)

To get an entry in a matrix, you need both the row and column index. To access the first row, second column entry, you would run:

mat[1,2]

## [1] 2

Similar to a vector, you can get a range of entries using vectors in place of single indices. For example:

mat[1:2,2:3]

##      [,1] [,2]
## [1,]    2    3
## [2,]    5    6

If you want to access an entire row or column of the matrix, you just leave that index position blank in your indexing. For example, to get the entire first row of mat, you run:

mat[1, ]

## [1] 1 2 3

You can do all sorts of math with matrices: Adding, subtracting, multiplying, or dividing each element by a scalar a can be achieved by running

a <- 3
mat + a # Add a scalar to each element 
mat - a # Subtract a scalar from each element
mat * a # Multiply each element by a scalar
mat / a # Divide each element by a scalar

You can also do element-wise addition and multiplication if you have two matrices that are the exact same size:

mat2 <- matrix(c(10, 11, 12, 13, 14, 15), nrow = 2, ncol = 3, byrow = TRUE)

mat + mat2 # Element-wise addition 
mat * mat2 # Element-wise multiplication

You can also do matrix multiplication if you have two compatible matrices. Note that the function t() will take the transpose of a matrix:

mat %*% t(mat2) # Matrix multiplication

Data Frames

A data frame is a more dynamic type of 2D data structure that is designed so that each column can have a different data type (but each entry in a single column is the same type). The idea is that each row of a data frame holds a variety of data about an instance of some process. Because of this, each column in a data frame must contain the same number of entries. Each column of a data frame has a string name, which should be unique. To create a data frame, you use the data.frame() function. Within the function, you assign the column names and data. For example, we could create an employee data frame where each row corresponds to a single person:

employees <- data.frame(
  id = c(1, 2, 3, 4),
  name = c("Bob", "Katie", "June", "Mark"),
  salary = c(40000,50000,60000,70000),
  full_time = c(TRUE, FALSE, FALSE, TRUE))

To access the columns of a data frame, you can either use indexing, or you can use the name of the column paired with the $ operator. If we wanted to get a numeric vector containing all of the employee salaries, we could run:

employees[,3]

## [1] 40000 50000 60000 70000

or alternatively

employees$salary

## [1] 40000 50000 60000 70000

Then, to access individual rows, we need to index. To get the second entry in the salary field, we can run:

employees[2,3]

## [1] 50000

or alternatively we could run

employees$salary[2]

## [1] 50000

Figuring out which type of indexing to use is very application dependent, and you will figure out which is easier for different uses fairly quickly with experience.

You can easily add another column to a data frame after its creation. You just assign a vector to a column name that doesn’t yet exist. For example, if we wanted to add a column holding each employee’s age, we would run:

employees$age <- c(23, 54, 34, 41)

and the new column is automatically added to the data frame. You do all sorts of cool extractions of data from data frames. For example, if we want to see the names of all the employees who are older than 30 years old, we can run:

employees$name[employees$age > 30]

## [1] Katie June  Mark 
## Levels: Bob June Katie Mark

Lists

A list is an indexed collection of whatever types of objects you want. You could have a list where the first entry is a number, the second entry is a vector, the third is a data frame, the fourth is a matrix, and the fifth is another list. This data structure is quite helpful for organizing large amounts of data. Say you have a data set with a bunch of different points, and at each point you want to collect data that needs to be stored in a matrix. A good way to keep track of the data for the entire data set would be to make a list out of the matrices.

To make a list, you use the list() function, similar to how you use the data.frame() function. Lists can have named entries, but they don’t need to. To create a list lst of numbers, we would run:

lst <- list(1, 2, 3, 4)

or we could make the same list but name each entry:

lst <- list(number1 = 1, number2 = 2, number3 = 3, number4 = 4)

To access the elements of a list, you need either the name or index of the element. You then access the element by running:

lst$number2

## [1] 2

or alternatively

lst[[2]] # note the double brackets for indexing lists

## [1] 2

Or, if you want to get a subset of the list, still as a list, you can run:

lst[1:3] #Note the single brackets for returning a list subset

## $number1
## [1] 1
## 
## $number2
## [1] 2
## 
## $number3
## [1] 3

You can also name the elements of a list after you have created it:

lst <- list(1, 2, 3, c("a","b","c"))
names(lst) <- c('number1', 'number2', 'number3', 'first_3_letters')

Functions in R will often return lists, and they are one of the most useful data structures in R, so it is important to get comfortable with them.

Factors

One more class of data structure in R is called a factor. We don’t use this type of structure too much in our spatial analysis, but it does occasionally come up. Factors are a way to store categorical data in R. Say you have a vector with a bunch of strings that are either 'blue' or 'red'. If you store this data as a factor, R breaks the vector into categories, and stores the number of occurrences in each category. This can be helpful in some data analysis applications. Because factors are used so infrequently in most of our analysis, I will leave it there for now. However, if you want to learn more about factors, go read R4DS Section 15 - Factors.

Loops

Loops are some of the most important functions of any coding language. If you are unfamiliar with the concept of a for loop or while loop, I would suggest looking at this tutorial. An example of the syntax of a for loop in R is:

for(i in 1:3){
  print(i)
}

## [1] 1
## [1] 2
## [1] 3

This prints the integers between 1 and 3. The key for for loops in R is giving the loop some sensible vector after the in keyword. The index variable (i in this case) will loop through each entry in the vector.

An example of the syntax for a while loop in R is:

i <- 1
while(i <= 3){
  print(i)
  i <- i + 1
}

## [1] 1
## [1] 2
## [1] 3

This loop will also print the integers between 1 and 3. The key for while loops in R is to give a sensible boolean stopping condition that will eventually be met to avoid an infinite loop.

Functions

Functions are a way to call complex sets of operations with a simple command in R. You give a function some input parameters, and it outputs some return value (or plot). A basic example of a function in R is the sum() function, which is built into the base version of R. This function takes in a vector of numbers, and returns their sum as a single numerical value. You call a function by placing the input parameters inside of the parenthesis after the function name. So, for example, to call the sum function:

nums <- c(1, 2, 3, 4, 5, 6)
a <- sum(nums)
a

## [1] 21

So, the sum of the integers between 1 and 6 is now stored in a variable a, which we can now use to do other things. Loading different packages in R will give you access to the functions in those packages. If you ever want to see the documentation for a function in R, you can run the command ?function_name_here, and the documentation will pop up in the lower right window of RStudio.

Say we wanted to create our own function that does that exact same thing as the built in sum() function. One possible syntax for this would be:

my_sum <- function(x){
  total <- 0
  for(i in 1:length(x)){
    total <- total + x[i]
  }
  return(total)
}

If we evaluated this function definition (hit Ctrl+Enter with the cursor somewhere in it), we would then have access to use my_sum() as a function to add the numbers in a vector (most of the time the built in R functions will be faster than any home-brewed functions you make up because they often do computations in C, which is much faster than R. Because of this, try to use built in R functions as much as possible).

The `apply` Family of Functions

The apply family of functions in R are a group of powerful functions that can replace for and while loops in many situations. They are one of the most frequently used functions in R, and you should definitely learn how to use them. They are a bit complicated to explain, so I won’t here, but I would recommend reading [this tutorial][https://www.datacamp.com/community/tutorials/r-tutorial-apply-family].

Coding Basics with `spatstat`

The spatstat package is one of our most heavily used packages in the Zimmerman group. This package provides the means to perform all sorts of interesting analysis of spatial data sets. The documentation for this package can be found here. There is a book called Spatial Point Patterns: Methodology and Applications with R that was written as a tutorial/textbook to be paired with the spatstat package. This book is an incredible reference for all things related to spatial statistics of point patterns (it is in fact what I am refererenceing heavily to make these tutorials). Dr. Zimmerman has multiple copies of this book floating around, so it shouldn’t be too hard to get your hands on one.

The large majority of the spatial data sets we analyze in the Zimmerman group are 3-dimensional (3D). Unfortunately, while there is some development in the spatstat package for 3D data sets, it mostly focuses on 2D data. As a result, we have extended many of the 2D functions available in spatstat to our own package: rapt. I will mention the rapt package sporadically through these tutorials, but this section will be mostly an introduction to spatstat.

Point Patterns and `ppp`, `pp3` Objects

The most important object classes in spatstat are ppp and pp3. You can find the R documentation on these classes here (ppp) and here (pp3). These are the classes where point positions for 2D point patterns (ppp) and 3D point patterns (pp3) are stored. A point pattern is a collection of point locations in space. In order to use spatstat functions to analyze a point pattern, the pattern will need to be stored in an object of one of these two classes.

In order to create a ppp or pp3 object, you use the ppp() or pp3() functions, respectively. Say you have the coordinates of a point pattern stored in a data frame named positions with columns x, y, and z, and this data came from a domain of $x, y, z \in [0,5]$. If you want to create a pp3 object out of this data, you can run:

# First, make up the positions data frame (this is good practice for working with data frames!)
positions <- data.frame(x = c(1.1, 4.0, 3.2), y = c(4.7, 3.2, 1.1), z = c(4.3, 1.2, 1.7))

# Now, create the pp3 object
my_pp3 <- pp3(positions$x, positions$y, positions$z, xrange = c(0,5), yrange = c(0,5), zrange = c(0,5))

Similarly, if you wanted to create a ppp object out of just the x and y coordinates of positions, you would run:

my_ppp <- ppp(positions$x, positions$y, xrange = c(0,5), yrange = c(0,5))

We now have our point pattern stored in an object class where we can apply lots of spatstat functions. For example, to plot our point patterns, we can use the plot() function. This is a base function in R, but when you pass a ppp or pp3 object into it, spatstat can tell and uses the spatstat version of the plot() function to display your point pattern. Go ahead and try it out:

plot(my_pp3)

plot(my_ppp)

To plot the pp3 object in an interactive plot where you can drag around to see different angles, you can use the plot3d.pp3() function from the rapt package. Try:

plot3d.pp3(my_pp3)

You should be able to click and drag the plot. This can be very useful for checking out 3D point patterns.

Each ppp or pp3 object has attributes containing information about the coordinates of the points in the pattern, as well as information about the domain of the point pattern. To access the domain of our ppp object my_ppp, we can run

my_ppp$window

## window: rectangle = [0, 5] x [0, 5] units

which returns an owin object containing the 2D window that the point pattern lives in. To access the domain of our pp3 object my_pp3, we can run

my_pp3$domain

## Box: [0, 5] x [0, 5] x [0, 5] units

which returns a box3 object containing the 3D domain that the point pattern lives in. Knowing how to access this domain info will be useful for future analysis.

Helpful Functions for `ppp` and `pp3` Objects

Developing skills with spatstat will come with experience, but there are a few functions I recommend committing to memory right away:

-npoints(): Takes an argument of either a ppp or pp3 object and returns the number of points in the point pattern.

-coords(): Takes an argument of either a ppp or pp3 object and returns a data frame with the coordinates of the points in the pattern.

-volume(): Takes an argument of a box3 object and returns the volume (usually used like: volume(my_pp3$domain)).

-area(): Takes an argument of an owin object and returns the area (usually used like: area(my_ppp$window)).