In this document, we will walk through the different coding procedures used to obtain results for the paper Three Dimensional Cluster Analysis for Atom Probe Tomography Using Ripley’s K function and Machine Learning by Vincent, Proudian, and Zimmerman. Note that most of the computational results presented in the paper were performed on a high performance computing system. The summary below will outline the procedure at a scale that can be run on a personal machine. All referenced files for this page are available here.

Install Neccesary Packages

Install the rapt package (R for Atom Probe Tomography) with all package dependencies. Available on GitHub: https://github.com/aproudian2/rapt. We will also need the rgl, parallel, dplyr, and RColorBrewer packages.

install.packages('devtools')
library(devtools)
install_github('aproudian2/rapt')
library(rapt)
library(rgl)
library(parallel)
library(dplyr)
library(RColorBrewer)

Background and Methods

Simulation of Clusters

The clustersim() function within the rapt package can be used to simulate clustered point patterns following the procedure described in the main body of the paper. Documentation for this function can be found by running the command ?clustersim. Below, we will walk through the process of simulating and visualizing some different clustered point patterns.

First, we need to upload the RCP patterns generated using code by Desmond and Weeks (2009). This code produces two output files: a “FinalConfig” file containing the xyz positions of the center of each sphere in the pattern and a “system” file containing all of the parameters used in the simulation. We can upload these files into a pp3 object using the read.rcp() function.

rcp.upload <- read.rcp('FinalConfig1', 'system1', scaleUp = TRUE, newRadius = 0.5)

We then use the stitch.size() function in rapt to tile the pattern and create a larger volume of RCP points. This tiling is possible due to the periodic boundary conditions imposed on the RCP simulation. We create a \(60 \times 60 \times 60\) volume of RCP points:

rcp.stitched <- stitch.size(rcp.upload, boxSize = c(60,60,60))

Now, for the simplicity of this example, we can use this pattern as both the underlying and cluster centroid patterns for our cluster simulation (typically we would use different RCP pattern realizations). Below we simulate a clustered point pattern using the clustersim() function with parameters \(\eta = 0.1\), \(\mu_R = 4\), \(\rho_1 = 1\), \(\rho_2 = 0\), \(\beta = 0\), and \(\xi = 0\). Note that the plot below is interactive; feel free to drag it around with your mouse and use the scroll wheel to zoom in and out.

cluster.simple <- clustersim(rcp.stitched, rcp.stitched, 0.5,
                              pcp = 0.1, cr = 4, 
                              rho1 = 1, rho2 = 0, 
                              rb = 0, pb = 0,
                              toplot = T)

Below is an example with a slightly more “noisy”" set of cluster parameters: \(\eta = 0.1\), \(\mu_R = 4\), \(\rho_1 = 0.4\), \(\rho_2 = 0.03\), \(\beta = 0.2\), and \(\xi = 0.1\).

cluster.noisy <- clustersim(rcp.stitched, rcp.stitched, 0.5,
                              pcp = 0.1, cr = 4, 
                              rho1 = 0.4, rho2 = 0.03, 
                              rb = 0.2, pb = 0.1,
                              toplot = T)

T(r)-Metric Extraction

Next we will get into the process of calculating \(T(r)\) for a point pattern. The first step in this process is to calculate \(K(r)\) for the point pattern of interest. To do this in R, use the K3est() function from the spatstat library. This function accepts a pp3 object as its main argument; for our simulated point patterns above, this pp3 object is the first entry of the returned lists cluster.simple and cluster.noisy. Another important argument to the K3est() function is the correction parameter, which specifies the form of the edge correction for \(K(r)\). We will always use the “translation” correction for our work. For more information about the different types of edge corrections, we suggest the book Spatial Point Patterns by Baddeley, Rubak, and Turner (2015). The other arguments to K3est() determine the maximum \(r\) value that \(K(r)\) is measured to, and the number of \(r\) values it is measured at along the way.

So, let’s compare the \(K(r)\) signal from the simple and noisy point patterns simulated above:

k.simple <- K3est(cluster.simple[[1]], rmax = 17, nrval = 100, correction = 'translation')
k.noisy <- K3est(cluster.noisy[[1]], rmax = 17, nrval = 100, correction = 'translation')

plot(k.simple$r, k.simple$trans, type = 'l', xlab = 'r', ylab = 'K(r)', lwd = 2, col = 'blue')
lines(k.noisy$r, k.noisy$trans, lwd = 2, col = 'red')
legend(0, 20000, legend = c('Simple', 'Noisy'), col = c('blue', 'red'), lwd = 2)

Now, looking at this plot, we see that the “Simple” signal appears to oscillate up and down with larger amplitude than the “Noisy” signal. Intuitively this should make sense; If we think of variation in the K-function as a measure of clustering, the simple signal should be stronger than the noisy signal, as there is stronger clustering occurring in the simple point pattern.

The two lines shown in the plot above are not the easiest to compare to one another, however, because the magnitude of the differences is relatively small compared to the overall trend of the functions themselves (increasing polynomially). This is where \(T(r)\) becomes useful. But, before we can calculate \(T(r)\), we need to perform random relabeling of a fraction \(\eta = 0.1\) of the UPP to figure out the “expected random signal” of this UPP. We do so in the following code snippet, which utilizes the the pK3est() function and parallel computing to speed up the process (note that we would typically use somewhere in the range of 25000 random relabelings, but for this vignette we will only perform 100). This code may take a few minutes to run.

rrl.data <- pK3est(perc = 0.1, pattern = rcp.stitched, 
                   nEvals = 100, nrval = 100, rmax = 17, 
                   correction="trans", anom=TRUE, sorted=FALSE,
                   cores = 8)

Now, the square root of the expected random signal is contained in the second element of the list returned from pK3est(). Let’s look at how this signal compares to the clustered signals:

plot(k.simple$r, k.simple$trans, type = 'l', xlab = 'r', ylab = 'K(r)', lwd = 2, col = 'blue')
lines(k.noisy$r, k.noisy$trans, lwd = 2, col = 'red')
lines(k.noisy$r, rrl.data[[2]]^2, lwd = 2, col = 'black')
legend(0, 20000, legend = c('Simple', 'Noisy', 'Expected Random'), col = c('blue', 'red', 'black'), lwd = 2)

To get a better comparison, we can instead plot \(T(r) = \sqrt{K_\text{A-obs}(r)} - \sqrt{K_{(N/2)}(r)}\) for each of these signals. Note that \(T(r) = 0\) for the expected random signal by definition.

t.simple <- data.frame(r = k.simple$r, trans = sqrt(k.simple$trans) - rrl.data[[2]])
t.noisy <- data.frame(r = k.noisy$r, trans = sqrt(k.noisy$trans) - rrl.data[[2]])

plot(k.simple$r, t.simple$trans, type = 'l', xlab = 'r', ylab = 'T(r)', lwd = 2, col = 'blue')
lines(k.noisy$r, t.noisy$trans, lwd = 2, col = 'red')
lines(k.noisy$r, rep(0, 100), lwd = 2, col = 'black')
legend(0, -5, legend = c('Simple', 'Noisy', 'Expected Random'), col = c('blue', 'red', 'black'), lwd = 2)

And now we can really see the advantage of using \(T(r)\) as opposed to \(K(r)\); the deviations from the random signal are put on main display with the \(T(r)\) function. These \(T(r)\) signals could have instead been directly calculated using the anomK3est() function from the rapt library. For the simple cluster pattern, this would be done by running:

simple.alt <- anomK3est(pattern = cluster.simple[[1]], toSub = rrl.data[[2]], rmax = 17, nrval = 100)

We can now use the k3metrics() function to extract the \(T(r)\) metrics from the two \(T(r)\) signals shown above.

mets.simple <- k3metrics(t.simple$r[10:100], t.simple$trans[10:100], toplot = T)

Using the toplot = TRUE argument, we get a plot of \(T(r)\) (black) and its first three derivatives (red, purple, and blue correspond to first, second, and third derivative respectively). The points associated with each of the five \(T(r)\) metrics are also highlighted. The actual object returned from this function is a list containing the values of each metric, where the order is [[1]] Tmax, [[2]] Rmax, [[3]] Rdmin, [[4]]Rd3max, [[5]] Tdmin.

mets.simple

## [[1]]
## [1] 20.41701
## 
## [[2]]
## [1] 4.979798
## 
## [[3]]
## [1] 8.070707
## 
## [[4]]
## [1] 7.727273
## 
## [[5]]
## [1] -8.15958

We can calculate the same values for the noisy clustered data:

mets.noisy <- k3metrics(t.noisy$r[10:100], t.noisy$trans[10:100], toplot = T, ylim = c(-2, 6))

mets.noisy

## [[1]]
## [1] 5.820987
## 
## [[2]]
## [1] 4.636364
## 
## [[3]]
## [1] 7.383838
## 
## [[4]]
## [1] 7.040404
## 
## [[5]]
## [1] -1.699889

We have now done enough background to get at the purpose of this paper; to use these \(T(r)\) metrics to make inference about the global properties of the clustered point pattern that lead to them. By inspection, we see that there are differences between the metrics for these two patterns, but how these differences connect to cluster properties is much less clear.

Simulation Results

Generate Training and Testing Sets

The code below contains a scaled-down example workflow for generating the training and testing data sets. In this example, the training set will contain 100 total data points (10 iterations at each of 10 different parameter combinations) and the testing set will contain 25 total data points (one iteration at each of 25 different parameter combinations).

First we set some of the important parameters:

cluster.size <- c(60,60,60) # x, y, z dimensions of the simulated point patterns
nrand <- 10 # Number of distinct parameter combinations
nclust <- 10 # Number of iterations to perform at each parameter combination
s <- 100 # Seed for random number generators
maxr <- 17 # Max r value to calculate K(r) to
nr <- 100 # Number of r values to calculate K(r) at
cores2use <- 5 # Number of cores to use in parallel computation

Now we generate the 100 different parameter combinations and store them in the list params:

set.seed(s) # Set the random seed
params <- list() 
cr <- runif(nrand, min = 2, max = 6.5) # mu_R values
rho1 <- runif(nrand, min = 0.2, max = 1) # rho_1 values
rho2 <- runif(nrand, min = 0, max = 0.03) # tho_2 values
rb <- runif(nrand, min = 0, max = 0.5) # radius blur (aka beta) values
pb <- runif(nrand, min = 0, max = 0.2) # position blur (aka xi) values
for(i in 1:nrand){
  params[[i]] <- c(cr[i], rho1[i], rho2[i], rb[i], pb[i]) # create one big list of parameter combinations
}

params.train <- params # Will use this later

We need the square root of the expected random \(K(r)\) signal, which we can store in a vector named rand.signal. Remember that we calculated this signal in the T(r) and Metric Extraction section. It is stored as the second element of the rrl.data list:

rand.signal <- rrl.data[[2]]

We will now upload all the RCP files needed to generate different simulation iterations at each parameter combination (files are on GitHub):

under.list <- list() # Create list for the UPPs
over.list <- list() # Create list for the patterns used to initialize cluster centroids

for(i in 1:nclust){
  under <- read.rcp(paste('FinalConfig', toString(i), sep=''), 
                    paste('system', toString(i), sep=''),
                    scaleUp = TRUE, newRadius = 0.5)
  over <- under
  under.list[[i]] <- stitch.size(under, boxSize = c(60,60,60)) 
  over.list[[i]] <- stitch.size(over, boxSize = c(60,60,60))
}

# Will use these later
under.list.train <- under.list
over.list.train <- over.list

We next write a function that allows us to easily simulate clustered point patterns and extract \(T(r)\) metrics based on what we have set up thus far

kmet_extract <- function(i, nper, maxr, nr, toSub, pcp = 0.05114235, tol = 0.005){
  res <- matrix(NA, nrow = nper, ncol = 5)
  colnames(res) <- c('Tm', 'Rm', 'Rdm', 'Rddm', 'Tdm')
  set.seed(i)
  seeds <- round(runif(nper, 1, 1e8))
  for(j in 1:nper){
    # Simulate a cluster according to the ith parameter combination
    cluster <- clustersim(under.list[[j]], over.list[[j]], 0.5, 
                          pcp = pcp,
                          cr = params[[i]][1],
                          rho1 = params[[i]][2],
                          rho2 = params[[i]][3],
                          rb = params[[i]][4],
                          pb = params[[i]][5],
                          tol = tol,
                          s = seeds[j])
    if(is.numeric(cluster)){
      res[j,] <- c(NA, NA, NA, NA, NA)
      next
    }
    # Measure T(r) on the pattern
    result <- anomK3est(cluster[[1]], toSub, maxr, nr)
    rvals <- result$r
    tvals <- result$trans
    rvals.new <- rvals[10:length(rvals)]
    tvals.new <- tvals[10:length(rvals)]
    
    # Extract T(r) metrics
    metrics <- k3metrics(rvals.new, tvals.new, F)
    res[j,] <- c(metrics[[1]], metrics[[2]], metrics[[3]], metrics[[4]], metrics[[5]])
    
    # Clear memory
    rm(cluster, result, rvals, tvals, rvals.new, tvals.new)
    gc()
  }
  return(res)
}

Next we initialize a parallel computing cluster (Google “parallel computing in R” if you are unfamiliar).

cl <- makePSOCKcluster(cores2use) # Create a PSOCK cluster containing the number of cores earlier specified
clusterEvalQ(cl, library(rapt)) # Load the rapt and zoo libraries into the parallel computing environment
clusterEvalQ(cl, library(zoo))
clusterExport(cl, c('over.list', 'under.list', 'params')) # Make these objects available in the parallel environment

Now that we have everything set up so nicely, we only need to run one command to get out our results! This might take a few minutes to run.

train.data <- parLapply(cl, 1:nrand, kmet_extract, nclust, maxr, nr, rand.signal, pcp = 0.05114235, tol = 0.005)
stopCluster(cl)

The object train.data is now a list of 10 matrices, where the \(i^\text{th}\) matrix corresponds to the \(i^\text{th}\) parameter combination in the params.train list. Each matrix contains results from the the 10 iterations performed for the associated cluster parameter set. For example, we can look at the first parameter combination set and the 10 data points corresponding to it:

params.train[[1]]

## [1] 3.38494750 0.69999718 0.01607433 0.24415300 0.06613211

train.data[[1]]

##             Tm       Rm      Rdm     Rddm       Tdm
##  [1,] 20.45882 5.323232 8.757576 7.040404 -4.383947
##  [2,] 19.43588 5.151515 8.585859 6.868687 -4.476304
##  [3,] 19.21325 4.979798 8.757576 7.555556 -4.772993
##  [4,] 21.45222 5.494949 9.444444 7.383838 -4.633885
##  [5,] 22.68797 5.838384 9.616162 9.787879 -4.488128
##  [6,] 18.59852 4.979798 8.070707 6.868687 -4.599503
##  [7,] 19.75045 5.151515 9.101010 7.383838 -4.651663
##  [8,] 18.84017 4.979798 8.585859 7.040404 -4.795017
##  [9,] 20.10392 5.151515 8.585859 7.040404 -4.406182
## [10,] 20.68984 5.494949 9.444444 7.555556 -4.440929

The above matrix shown the \(T(r)\) metrics measured on each of the ten clustered point patterns simulated with these parameters. Notice that they are all similar, but not exactly the same. This is why we do multiple simulations at each parameter combination set; to capture this variation and incorporate it into our models.

We also need to generate our testing data set, which is 25 simulations each at a different combination of parameters. We follow a very similar procedure to the one outlined above for generating the test set. Once again, this may take a few minutes to run:

nrand <- 25 # Number of distinct parameter combinations
nclust <- 1 # Number of iterations to perform at each parameter combination
s <- 101 # Seed for random number generators

set.seed(s) 
params <- list() 
cr <- runif(nrand, min = 2, max = 6.5) # mu_R values
rho1 <- runif(nrand, min = 0.2, max = 1) # rho_1 values
rho2 <- runif(nrand, min = 0, max = 0.03) # tho_2 values
rb <- runif(nrand, min = 0, max = 0.5) # radius blur (aka beta) values
pb <- runif(nrand, min = 0, max = 0.2) # position blur (aka xi) values
for(i in 1:nrand){
  params[[i]] <- c(cr[i], rho1[i], rho2[i], rb[i], pb[i]) # create one big list of parameter combinations
}

params.test <- params

over.inds <- sample(1:10, nrand, replace = TRUE)
under.inds <- sample(1:10, nrand, replace = TRUE)

over.list <- over.list.train[over.inds]
under.list <- under.list.train[under.inds]

kmet_extract_test <- function(i, maxr, nr, toSub, pcp = 0.05114235, tol = 0.005){
  res <- t(as.matrix(rep(NA, 5)))
  colnames(res) <- c('Tm', 'Rm', 'Rdm', 'Rddm', 'Tdm')
  
  cluster <- clustersim(under.list[[i]], over.list[[i]], 0.5, 
                        pcp = pcp,
                        cr = params[[i]][1],
                        rho1 = params[[i]][2],
                        rho2 = params[[i]][3],
                        rb = params[[i]][4],
                        pb = params[[i]][5],
                        tol = tol,
                        s = i)
  if(is.numeric(cluster)){
    res <- c(NA, NA, NA, NA, NA)
    return(res)
  }
  
  # Measure T(r) on the pattern
  result <- anomK3est(cluster[[1]], toSub, maxr, nr)
  rvals <- result$r
  tvals <- result$trans
  rvals.new <- rvals[10:length(rvals)]
  tvals.new <- tvals[10:length(rvals)]
  
  # Extract T(r) metrics
  metrics <- k3metrics(rvals.new, tvals.new, F)
  res <- c(metrics[[1]], metrics[[2]], metrics[[3]], metrics[[4]], metrics[[5]])
  
  # Clear memory
  rm(cluster, result, rvals, tvals, rvals.new, tvals.new)
  gc()
  
  return(res)
}

# Set up parallel cluster
cl <- makePSOCKcluster(cores2use) 
clusterEvalQ(cl, library(rapt)) 
clusterEvalQ(cl, library(zoo))
clusterExport(cl, c('over.list', 'under.list', 'params')) 

# Perform simulations and metric extraction in parallel
sim.res.test <- parLapply(cl, 1:nrand, kmet_extract_test, maxr, nr, rand.signal, pcp = 0.05114235, tol = 0.005)
test.data <- matrix(unlist(sim.res.test), nrow = 25, byrow = T)
colnames(test.data) <- c('Tm', 'Rm', 'Rdm', 'Rddm', 'Tdm')
stopCluster(cl)

Now we have the testing data, which we will use to evaluate our models, stored in a matrix named test.data:

head(test.data)

##             Tm       Rm       Rdm      Rddm       Tdm
## [1,] 29.959066 5.323232  9.616162  7.212121 -8.384301
## [2,]  2.648401 2.404040  3.949495  2.747475 -2.375117
## [3,] 20.660073 6.525253 10.989899  9.444444 -4.414132
## [4,] 23.194387 7.383838 13.050505  9.272727 -3.637428
## [5,] 35.773989 7.555556 12.020202 10.989899 -4.200462
## [6,] 15.327732 5.666667  9.272727  8.757576 -2.937881

Training and Evaluating Models

Now that we have our training (train.data) and testing (test.data) data sets generated, we can get into actually training and evaluating our different models.

First, there are a few packages that we need to utilize for training of these models. Specifically, the caret package, which provides access to a wide variety of machine learning tools, and the doParallel package, which we will use to parallelize the training of a random forest model. Make sure to install these packages if you don’t already have them.

library(caret)
library(doParallel)

First, we create a big data frame for the training data, where each row contains a single paring of the input simulation parameters from params.train and corresponding \(T(r)\)-metrics from train.data. We call this data frame train.df.

train.df <- matrix(NA, nrow = length(train.data)*nrow(train.data[[1]]), ncol = 10)

cnt <- 1
for(i in 1:length(train.data)){
  for(j in 1:nrow(train.data[[1]])){
    train.df[cnt, 1:5] <- params.train[[i]]
    train.df[cnt, 6:10] <- train.data[[i]][j,]
    cnt <- cnt + 1
  }
}

train.df <- as.data.frame(train.df)
names(train.df) <- c("cr","rho1","rho2","rb","pb","Tm","Rm","Rdm","Rddm","Tdm")

Next, we need to add a column to this data frame containing the weighted radius \(R_w\) of the pattern, which is calculated from \(\mu_R\) and \(\sigma_R\) as follows:

\[R_w = \frac{\mu_R^4 + 6 \mu_R^2 \sigma_R^2 + 3 \sigma_R^4}{\mu_R^3 + 3 \mu_R \sigma_R^2}\]

train.df$sigma <- train.df$cr*train.df$rb
train.df$rw <- (train.df$cr^4 + 6*train.df$cr^2*train.df$sigma^2 + 3*train.df$sigma^4)/(train.df$cr^3 + 3*train.df$cr*train.df$sigma^2)

Next, we get rid of any “incomplete cases” in the training data; that is, we remove any rows of train.df that have NA values (cluster simulations occasionally return NA values if the global type-\(A\) point percentage isn’t within the specific tolerance level). We can’t use these data points to help us train the models, so we just remove them.

train.df.complete <- train.df[complete.cases(train.df),]

Next, we use the preProcess() function from the caret package to perform principal component analysis on our \(T(r)\)-metrics. We add the principal components as additional columns to the train.df.complete data frame.

mets <- c("Tm", "Rm" ,"Rdm", "Rddm", "Tdm")
pp.train <- preProcess(train.df.complete[,mets], method = c("scale", "center", "pca", "BoxCox"), thresh = 1)
data.train.pca <- predict(pp.train, train.df.complete[,mets])
train.df.complete <- data.frame(train.df.complete, data.train.pca)

Let’s look at a few rows of the train.df.complete data frame to see the information we have collected about each data point in our training set:

head(train.df.complete)

##         cr      rho1       rho2       rb         pb       Tm       Rm      Rdm
## 1 3.384947 0.6999972 0.01607433 0.244153 0.06613211 20.45882 5.323232 8.757576
## 2 3.384947 0.6999972 0.01607433 0.244153 0.06613211 19.43588 5.151515 8.585859
## 3 3.384947 0.6999972 0.01607433 0.244153 0.06613211 19.21325 4.979798 8.757576
## 4 3.384947 0.6999972 0.01607433 0.244153 0.06613211 21.45222 5.494949 9.444444
## 5 3.384947 0.6999972 0.01607433 0.244153 0.06613211 22.68797 5.838384 9.616162
## 6 3.384947 0.6999972 0.01607433 0.244153 0.06613211 18.59852 4.979798 8.070707
##       Rddm       Tdm     sigma       rw         PC1       PC2         PC3
## 1 7.040404 -4.383947 0.8264451 3.929064 1.028896473 0.6288903 -0.38345086
## 2 6.868687 -4.476304 0.8264451 3.929064 1.195579772 0.7039168 -0.31723017
## 3 7.555556 -4.772993 0.8264451 3.929064 1.041780552 0.7946626 -0.03237741
## 4 7.383838 -4.633885 0.8264451 3.929064 0.669275933 0.6695120 -0.35523918
## 5 9.787879 -4.488128 0.8264451 3.929064 0.001140357 0.2907124  0.20045686
## 6 6.868687 -4.599503 0.8264451 3.929064 1.388553260 0.8146326 -0.17092912
##           PC4          PC5
## 1 -0.10314279 -0.009007408
## 2 -0.05403028 -0.030043902
## 3 -0.08461954 -0.192988479
## 4  0.02224586 -0.108415379
## 5 -0.42225523 -0.154576207
## 6 -0.12105714  0.021722692

Now that we have all this information collected, we can use it to actually train our models! We will start by training three different regression models for \(R_w\): a generalized linear model (GLM), a Bayesian regularized neural network model (BRNN), and a random forest model (RF). We do so using methods from the caret package. We utilize 10-fold cross-validation to avoid over fitting of the training set.

out.param <- 'rw'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
rw.models <- list()

rw.models[[1]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "glm", trControl = trainControl("cv", number = 10))

rw.models[[2]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "brnn", trControl = trainControl("cv", number = 10), tuneGrid = data.frame('neurons' = c(5, 7, 9)))

ncores <- 5
cl <- makeCluster(ncores)
registerDoParallel(cl)
rw.models[[3]] <- train(train.df.complete[,predictors],train.df.complete[,out.param],method = "parRF", trControl = trainControl("cv", number = 10))
stopCluster(cl)
registerDoSEQ()

names(rw.models) <- c("glm","brnn","rf")

Next, we evaluate the different \(R_w\) models using root mean square error of prediction (RMSEP), defined as

\[RMSEP = \sqrt{\frac{1}{n}\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}\,\, ,\] where \(n\) is the size of the testing data set, \(y_i\) is the true value of the desired response variable (\(R_w\) in this case), and \(\hat{y}_i\) is the corresponding model estimate. Smaller values of RMSEP means that the model does a better job at estimating the true value of the response.

To calculate RMSEP for each \(R_w\) model, we first need to create a data frame with the complete cases of test data similar to how we did with the training data, including all of the true parameters, \(T(r)\) metrics, and principal components.

param.matrix <- matrix(unlist(params.test), nrow = 25, byrow = T)
test.df <- cbind(param.matrix, test.data) # Combine parameters and T(r)-metrics
test.df <- as.data.frame(test.df)
names(test.df) <- c("cr","rho1","rho2","rb","pb","Tm","Rm","Rdm","Rddm","Tdm")

test.df.complete <- test.df[complete.cases(test.df),] # Only keep complete cases

# Calculate R_w for each cluster simulation
test.df.complete$sigma <- test.df.complete$cr * test.df.complete$rb
test.df.complete$rw <- (test.df.complete$cr^4 + 6* test.df.complete$cr^2 * test.df.complete$sigma^2 + 3*test.df.complete$sigma^4)/
  (test.df.complete$cr^3 + 3*test.df.complete$cr*test.df.complete$sigma^2)

test.pca <- predict(pp.train, test.df.complete[,mets]) # Perform PCA in same way as performed on training data
test.df.complete <- data.frame(test.df.complete, test.pca) # Smoosh everything together into one big data frame

Now that we have the testing set data frame all set up, we can calculate RMSEP from the models contained in the list rw.models and select the smallest one:

predictors <- c("PC1","PC2","PC3","PC4","PC5")
out.param <- 'rw'

toRound <- 6
model.RMSEP <- lapply(rw.models, function(x){
  a <- predict(x, newdata = test.df.complete[,predictors])
  RMSEP <- sqrt(mean((a-test.df.complete[,out.param])^2))
  print(x$method)
  print(paste('RMSEP: ', toString(round(RMSEP,toRound)), sep = ''))
})

## [1] "glm"
## [1] "RMSEP: 0.604321"
## [1] "brnn"
## [1] "RMSEP: 1.240015"
## [1] "parRF"
## [1] "RMSEP: 0.820562"

So, based on our small example, the GLM does best at estimating \(R_w\). As the size of the training data set is increased, the BRNN and RF models will begin to perform better than the GLM, and the BRNN will slightly edge out the RF as the better model (as seen in the main paper results, which use a training set containing 100,000 data points instead of the 100 used here).

Now that we have a \(R_w\) model selected, we can run very similar code to train and evaluate models for \(\rho_1\) and \(\rho_2\):

out.param <- 'rho1'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
rho1.models <- list()

rho1.models[[1]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "glm", trControl = trainControl("cv", number = 10))

rho1.models[[2]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "brnn", trControl = trainControl("cv", number = 10), tuneGrid = data.frame('neurons' = c(5, 7, 9)))

ncores <- 5
cl <- makeCluster(ncores)
registerDoParallel(cl)
rho1.models[[3]] <- train(train.df.complete[,predictors],train.df.complete[,out.param],method = "parRF", trControl = trainControl("cv", number = 10))
stopCluster(cl)
registerDoSEQ()

names(rho1.models) <- c("glm","brnn","rf")

predictors <- c("PC1","PC2","PC3","PC4","PC5")
out.param <- 'rho1'
toRound <- 6
model.RMSEP <- lapply(rw.models, function(x){
  a <- predict(x, newdata = test.df.complete[,predictors])
  RMSEP <- sqrt(mean((a-test.df.complete[,out.param])^2))
  print(x$method)
  print(paste('RMSEP: ', toString(round(RMSEP,toRound)), sep = ''))
})

## [1] "glm"
## [1] "RMSEP: 4.619748"
## [1] "brnn"
## [1] "RMSEP: 4.773576"
## [1] "parRF"
## [1] "RMSEP: 4.482717"

For \(\rho_1\), the RF model has smallest RMSEP.

We do the same for \(\rho_2\).

out.param <- 'rho2'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
rho2.models <- list()

rho2.models[[1]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "glm", trControl = trainControl("cv", number = 10))

rho2.models[[2]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "brnn", trControl = trainControl("cv", number = 10), tuneGrid = data.frame('neurons' = c(5, 7, 9)))

ncores <- 5
cl <- makeCluster(ncores)
registerDoParallel(cl)
rho2.models[[3]] <- train(train.df.complete[,predictors],train.df.complete[,out.param],method = "parRF", trControl = trainControl("cv", number = 10))
stopCluster(cl)
registerDoSEQ()

names(rho2.models) <- c("glm","brnn","rf")

predictors <- c("PC1","PC2","PC3","PC4","PC5")
out.param <- 'rho2'
toRound <- 6
model.RMSEP <- lapply(rw.models, function(x){
  a <- predict(x, newdata = test.df.complete[,predictors])
  RMSEP <- sqrt(mean((a-test.df.complete[,out.param])^2))
  print(x$method)
  print(paste('RMSEP: ', toString(round(RMSEP,toRound)), sep = ''))
})

## [1] "glm"
## [1] "RMSEP: 5.079752"
## [1] "brnn"
## [1] "RMSEP: 5.25549"
## [1] "parRF"
## [1] "RMSEP: 4.965455"

For \(\rho_2\), the RF model has smallest RMSEP.

Visualizing Results

We have selected the best model for each cluster parameter, but we have yet to truly evaluate whether or not these models perform well. We can do this by visualizing the model estimate errors.

We will first visualize the results for the \(R_w\) GLM model, as this was the \(R_w\) model with smallest RMSEP in the previous section. To start, we need to calculate a vector of the model residuals (difference between model estimates and true values) from the test data set, which we will store in a vector named diff. We can then calculate the percent error associated with each residual, which we store in a vector named diff.perc.

out.param <- 'rw'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
pred.full <- extractPrediction(rw.models[1], testX = test.df.complete[,predictors], testY = test.df.complete[,out.param]) # Extract predictions from the GLM model 
pred <- pred.full[pred.full$dataType == 'Test',] # Select only the test data predictions
diff <- pred$obs - pred$pred # Calculate difference between true values and estimates
diff.perc <- diff*100/pred$obs # Calculate percent error between true values and estimates

Next we will plot the estimated \(R_w\) values from the test set versus the corresponding true simulation input values (i.e. the estimates versus the true values). We will color code the plot according to percent error percentile:

est_plot <- function(diff.perc, percs, cols, ...){
  sorted <- data.frame('obs' = pred$obs[order(abs(diff.perc))], 'pred' = pred$pred[order(abs(diff.perc))])
  n <- nrow(sorted)
  
  plot(sorted$obs, sorted$pred, col = 'black', pch = 16, type = 'n', ...)
  
  vals.subed <- list('all' = sorted)
  ind <- rep(0, length(percs)+1)
  for(i in 1:length(percs)){
    ind[i+1] <- round(n*percs[i]/100)
    vals.subed[[i+1]] <- sorted[(ind[i]+1):ind[i+1],]
    n.i <- nrow(vals.subed[[i+1]])
    points(vals.subed[[i+1]]$obs, vals.subed[[i+1]]$pred, pch = 16, col = cols[i])
  }
  ind.last <- (tail(ind, n = 1)+1):n
  points(vals.subed$all$obs[ind.last], vals.subed$all$pred[ind.last], col = 'black', pch = 16)
  
  abline(0, 1, col = 'black', lty = 2, lwd = 1.5)
}

est_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"), 
         xlab = 'Simulation Input R_w', ylab = 'Estimated R_w',
         main = 'Weighted Radius Model Estimates vs True Values')
legend(2.5, 6.5, legend = sapply(c(50, 75, 90), function(x){paste(toString(x), '%',sep = '')}), 
         pch = 16, col = brewer.pal(3, "Set1"), title = "Error Percentile")

From this plot, we see which estimates from our test data set fall close to their true values (an exact match would fall on the black dotted line, which is the line \(y = x\)).

Next, we can visualize the percent errors of each estimate from the test data set for the \(R_w\) model:

error_plot <- function(diff.perc, percs, cols, ...){
  n <- length(diff.perc)
  diff.perc.sorted <- sort(abs(diff.perc))
  xs <- (1:n)*100/n
  
  plot(xs, diff.perc.sorted, xlab = 'Error Percentile', type = 'n', ...)
  vals <- list('all' = data.frame('perc' = xs, 'error' = diff.perc.sorted))
  
  ind <- rep(0, length(percs)+1)
  for(i in 1:length(percs)){
    ind[i+1] <- round(n*percs[i]/100)
    vals[[i+1]] <- vals$all[(ind[i]+1):ind[i+1],]
    n.i <- nrow(vals[[i+1]])
    points(vals[[i+1]]$perc, vals[[i+1]]$error, pch = 16, col = cols[i], cex = 1)
    segments(-10, tail(vals[[i+1]]$error, n = 1), tail(vals[[i+1]]$perc, n = 1), tail(vals[[i+1]]$error, n = 1),
             col = cols[i], lwd = 2, lty = 2)
    segments(tail(vals[[i+1]]$perc, n = 1), -10, tail(vals[[i+1]]$perc, n = 1), tail(vals[[i+1]]$error, n = 1),
             col = cols[i], lwd = 2, lty = 2)
  }
  ind.last <- (tail(ind, n = 1 )+1):n
  points(vals$all$perc[ind.last], vals$all$error[ind.last], col = 'black', pch = 16, cex = 1)
  legend(5, max(vals$all$error)*0.98, legend = sapply(percs, function(x){paste(toString(x), '%',sep = '')}), 
         pch = 16, col = cols, title = "Error Percentile")
}

par(mar = c(4, 4, 2, 2), mgp = c(2.3, 1, 0))
error_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"), yaxt = 'n', 
           main = 'Weighted Radius Model Estimate Percent Errors',
           ylab = 'Percent Error')
axis(side = 2, at = seq(0, 80, 5), labels = T)

From this plot, we see that 50% of our estimates fall within 7% error of their corresponding true values, 75% of estimates fall within 10% error, and 90% of estimates fall within 20% error. This gives us a more holistic idea about the performance of our \(R_w\) model.

We can make similar plots for both the \(\rho_1\) and \(\rho_2\) models:

out.param <- 'rho1'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
pred.full <- extractPrediction(rho1.models[3], testX = test.df.complete[,predictors], testY = test.df.complete[,out.param]) # Predictions from the RF model
pred <- pred.full[pred.full$dataType == 'Test',] 
diff <- pred$obs - pred$pred 
diff.perc <- diff*100/pred$obs

est_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"), 
         xlab = 'Simulation Input rho_1', ylab = 'Estimated rho_1',
         main = 'Rho_1 Model Estimates vs True Values')
legend(0.21, 0.8, legend = sapply(c(50, 75, 90), function(x){paste(toString(x), '%',sep = '')}), 
         pch = 16, col = brewer.pal(3, "Set1"), title = "Error Percentile")

error_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"), yaxt = 'n', 
           main = 'Rho_1 Model Estimate Percent Errors',
           ylab = 'Percent Error')
axis(side = 2, at = seq(0, 105, 5), labels = T)

For the \(\rho_1\) model, we see that 50% of estimates fall within 15% error, 75% of estimates fall within about 30% error, and 90% of estimates fall within 65% error.

Lastly, we can look at plots from the \(\rho_2\) model. In these plots, we use absolute error in place of percent error, because values of \(\rho_2 \approx 0\) make percent error blow up to extremely large numbers:

out.param <- 'rho2'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
pred.full <- extractPrediction(rho2.models[3], testX = test.df.complete[,predictors], testY = test.df.complete[,out.param]) # Predictions from the RF model
pred <- pred.full[pred.full$dataType == 'Test',] 
diff <- abs(pred$obs - pred$pred)

est_plot(diff, c(50, 75, 90), brewer.pal(3, "Set1"), 
         xlab = 'Simulation Input rho_2', ylab = 'Estimated rho_2',
         main = 'Rho_1 Model Estimates vs True Values')
legend(0.002, 0.021, legend = sapply(c(50, 75, 90), function(x){paste(toString(x), '%',sep = '')}), 
         pch = 16, col = brewer.pal(3, "Set1"), title = "Error Percentile")

error_plot(diff, c(50, 75, 90), brewer.pal(3, "Set1"), yaxt = 'n',
           main = 'Rho_2 Model Estimates vs True Values',
           ylab = 'Absolute Error')
axis(side = 2, at = seq(0, 0.01, 0.001), labels = T)

Experimental Data

Here we will work through the process of performing the \(K(r)\)-based parameter estimation method on a real APT reconstruction. We will work with a ‘.pos’ file and associated ‘.rrng’ file, both of which are generated in and exported from Cameca’s IVAS software. We upload these files into R using the readPOS() and readRRNG() functions in rapt.

pos <- readPOS('example.pos')
rng <- readRRNG('example.rrng')

Next we can use the rngPOS() function to subset our pos object down to only those points whose mass are within the ranges provided in the rng object. We then use the rngCount() function to count the number of points of each species type within our ranged reconstruction:

pos.r <- rngPOS(pos, rng)
cnts <- rngCount(pos.r, rng)
cnts.totals <- group_by(cnts, name) %>%
               summarize(counts = sum(counts), fraction = sum(fraction))

cnts.totals

## Registered S3 method overwritten by 'cli':
##   method     from    
##   print.boxx spatstat

## # A tibble: 3 x 3
##   name   counts fraction
##   <fct>   <int>    <dbl>
## 1 Al1   1125200   0.949 
## 2 Mg1     27643   0.0233
## 3 Zn1     33004   0.0278

The table above shows the different elements specified within the range file, along with their counts and atomic percentages within the ranged reconstruction. As in the paper, we will treat Mg and Zn as a single clustering species. We can compute this combined species total atomic fraction in the sample:

frac <- sum(cnts.totals$fraction[2:3])
frac

## [1] 0.05114235

And we see that 5.11% of the atoms in our sample are of the clustering species. Next, we use the createSpat() function to turn our ranged position file into a pp3 object. We create a marked pp3 object for the entire pattern, as well as an un-marked pp3 object containing only points of the clustering species.

# Define a domain for the pp3 object
win.r <- box3(c(min(pos.r$x), max(pos.r$x)), 
            c(min(pos.r$y), max(pos.r$y)), 
            c(min(pos.r$z), max(pos.r$z)))

# Full marked pattern
pp3.full <- createSpat(pos.r, win = win.r)
marks(pp3.full) <- pos.r$mark

# Only clustering species points
pp3.aggregates <- subset(pp3.full, marks == 'Mg1' | marks == 'Zn1' )

Next, we scale our reconstruction by a factor of \(\alpha = 2.832\), for reasons discussed in the main paper. Also for consistency with the simulated data, we translate the coordinates so that the domain of our sample is bound within the first quadrant:

alpha <- 2.831518

# Scale coordinates
coo <- coords(pp3.full)
coo.scaled <- coo*alpha

# Translate coordinates
shifts <- c((domain(pp3.full)$xrange*alpha)[1], 
          (domain(pp3.full)$yrange*alpha)[1], 
          (domain(pp3.full)$zrange*alpha)[1])
coo.scaled$x <- coo.scaled$x - shifts[1]
coo.scaled$y <- coo.scaled$y - shifts[2]
coo.scaled$z <- coo.scaled$z - shifts[3]

# Translate domain
win.scaled <- box3((domain(pp3.full)$xrange*alpha) - shifts[1], 
                   (domain(pp3.full)$yrange*alpha) - shifts[2], 
                   (domain(pp3.full)$zrange*alpha) - shifts[3])

# Re-create scaled and shifted pp3 objects
pp3.full.scaled <- pp3(coo.scaled$x, coo.scaled$y, coo.scaled$z, win.scaled)
marks(pp3.full.scaled) <- pos.r$mark

pp3.agg.scaled <- subset(pp3.full.scaled, marks == 'Zn1' | marks == 'Mg1')

Now that our APT data is in the correct format, we are almost at the point where we can measure \(T(r)\) on the clustering species. However, we first need to measure the expected random signal for this UPP (because the UPP is no longer the RCP pattern used for simulations, we need to re-calculate what a random signal looks like). We will do this using 100 random relabelings of 5.11% of the new experimental UPP. The pK3est() function allows us to do this easily, as we saw at the beginning of this document. This may take a few minutes to compute.

rrl.exp.data <- pK3est(perc = frac, pattern = pp3.full.scaled, 
                       nEvals = 100, rmax = 17, nrval = 100, 
                       correction="trans", anom=TRUE, sorted=FALSE,
                       cores = 8)

We can now calculate \(T(r)\) for our clustering species using this expected random signal:

t.agg <- anomK3est(pattern = pp3.agg.scaled, toSub = rrl.exp.data[[2]], rmax = 17, nrval = 100)

And then measure the \(T(r)\) metrics from the sample:

met.calc <- k3metrics(t.agg$r, t.agg$trans, toplot = TRUE, ylim = c(-0.5,3.5))

met.agg <- as.data.frame(t(unlist(met.calc)))
names(met.agg) <- c("Tm","Rm","Rdm","Rddm","Tdm")
met.agg

##         Tm       Rm      Rdm     Rddm        Tdm
## 1 3.315087 5.838384 9.616162 7.383838 -0.4958785

And printed above we see the five \(T(r)\)-metrics from the APT reconstruction. We can now combine these metrics into the five principal components found from the training data, and plug these PCs into our models to obtain estimates for the weighted radius (\(R_w\)) and intra-cluster concentration (\(\rho_1\)) of clusters in the reconstruction.

reconstruction.pca <- predict(pp.train, met.agg) # Turn metrics into PCs
rw.estimate <- predict(rw.models[[1]], newdata = reconstruction.pca) # Get R_w estimate from best R_w model
rho1.estimate <- predict(rho1.models[[3]], newdata = reconstruction.pca) # Get rho_1 estimate from best rho_1 model

We now report the final estimates for \(R_w\) and \(\rho_1\).

\(R_w\) Estimate:

##        1 
## 6.408299

\(\rho_1\) Estimate:

##         1 
## 0.4591814

Given enough testing data, one could then calculate confidence bounds on these estimates. However, because in this vignette our testing data set was so small, it will not be possible here. Constructing these bounds is reliant on the idea that there are multiple data points within the testing set with similar parameters to those estimated from our experimental data set. We can use the variation in the testing data of these points with similar parameters as an estimate of the variance of the estimates for our data.

Remember that this vignette uses very small training and testing data sets to make it computationally feasible without a supercomputer, which is why the estimates presented here do not match the ones shown in the paper. Also remember that the \(R_w\) estimate here is scaled, and can be divided by \(\alpha\) to get back to units of the original data set.

Maximum Separation Algorithm

Introduction

The last part of our paper involves comparing the \(T(r)\)-metric based inference method to estimates made from the maximum separation algorithm (MSA). The msa() function within the rapt package allows for easy computation of the MSA on a marked pp3 object within the R environment. Find documentation for this function by running ?msa.

We can do a quick example of how this function works using the cluster.noisy simulation from earlier in this document. The second element of this lists (the second element of the list returned by the clustersim() function) contains the marked pp3 object that we will pass into the msa() function. Guidance for selection of the dmax, Nmin, denv, and der parameters is given by Vaumousse, Cerezo, and Warren (2003).

msa.noisy <- msa(cluster.noisy[[2]], dmax = 1.3, Nmin = 10, denv = 1.3, der = 1.3, clust.mark = c('A','B'))

We can then look at the clusters identified by the maximum separation algorithm compared to the true clusters present in the point pattern (we know the truth in this case because we simulated it). The red points in the plot below are the MSA found clusters, while the black points are the true clusters.

mfrow3d(nr = 1, nc = 2, sharedMouse = TRUE)
data.noisy <- as.data.frame(cluster.noisy[[2]]$data)
data.noisy <- data.noisy[marks(cluster.noisy[[2]]) == 'A',]
plot3d(data.noisy)
msa.data.noisy <- as.data.frame(cluster.noisy[[2]]$data)
msa.data.noisy <- msa.data.noisy[unlist(msa.noisy$A),]
plot3d(msa.data.noisy, col = 'red')
rglwidget(elementId = 'msanoisy')

From this plot, we can see that MSA does a relatively good job in this case of identifying the points that reside in each cluster. We know that the true parameters used were \(\mu_R = 4\), \(\rho_1 = 0.4\), and \(\rho_2 = 0.03\). We can compare these values to the values estimated by MSA:

mean(msa.noisy$radius)

## [1] 3.64859

mean(msa.noisy$den)

## [1] 0.6228271

msa.noisy$bgnd.den

## [1] 0.03080937

We see that the estimates are generally in the correct neighborhood. Care should be taken in trusting these MSA estimates in real data, as they are highly dependent on the parameters dmax and Nmin. This point is discussed with more detail in the main paper.

Applied to Experimental Data

Now that we know how the msa() function works, we can apply it to the experimental APT reconstruction from the previous section. This analysis will show how dependent the output from the MSA is on dmax and Nmin.

We will run the MSA on the marked points pattern contained in the pp3.full.scaled object, which contains the scaled version of the experimental reconstruction. We will do this at multiple combinations of dmax and Nmin. This may take a few minutes to run:

nmins <- c(10, 15, 20)
dmaxs <- c(1.4, 1.7, 2.0)
params <- expand.grid(nmins, dmaxs)

msa.sweep <- lapply(1:nrow(params), function(i){
  return(msa(pp3.full.scaled, 
             params[i,2], 
             params[i,1], 
             params[i,2], 
             params[i,2], 
             clust.mark = c('Mg1', 'Zn1')))
})

Now, we can plot the average cluster radius and average intra-cluster concentration identified by MSA as a function of both Nmin and dmax:

mfrow3d(nr = 1, nc = 2, sharedMouse = FALSE)
plot3d(msa.data$Nmin, msa.data$dmax, msa.data$rad, type = 's', col = 'blue',
       xlab = 'Nmin', ylab = 'dmax', zlab = 'Average Cluster Radius')
plot3d(msa.data$Nmin, msa.data$dmax, msa.data$rho1, type = 's', col  = 'red',
       xlab = 'Nmin', ylab = 'dmax', zlab = 'Average Intra-Cluster Concentration')
rglwidget()

And we see that these summary statistics of what the MSA identifies as clusters vary significantly based on Nmin and dmax.

Instead of looking at these summaries, we could instead just look at the clusters themselves. The plots blow show the cluster species points identified as being in a cluster for each combination of Nmin and dmax.

We see from these plots that the clusters identified by the MSA on our experimental data set change significantly as a function of the user-defined initialization parameters. This motivates the use of a characterization method like the \(K(r)\)-based one outlined in the paper, which does not rely on the user to select any parameters to obtain results.

References

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. Chapman; Hall/CRC.

Desmond, Kenneth W, and Eric R Weeks. 2009. “Random Close Packing of Disks and Spheres in Confined Geometries.” Physical Review E 80 (5). APS: 051305.

Vaumousse, D, A Cerezo, and PJ Warren. 2003. “A Procedure for Quantification of Precipitate Microstructures from Three-Dimensional Atom Probe Data.” Ultramicroscopy 95. Elsevier: 215–21.

Three Dimensional Cluster Analysis for Atom Probe Tomography Using Ripley’s K function and Machine Learning – Code Procedure Summary

Galen B. Vincent

April 2020