In this document, we will walk through the different coding procedures used to obtain results for the paper Three Dimensional Cluster Analysis for Atom Probe Tomography Using Ripley’s K function and Machine Learning by Vincent, Proudian, and Zimmerman. Note that most of the computational results presented in the paper were performed on a high performance computing system. The summary below will outline the procedure at a scale that can be run on a personal machine. All referenced files for this page are available here.
Install the rapt
package (R for Atom Probe Tomography) with all package dependencies. Available on GitHub: https://github.com/aproudian2/rapt. We will also need the rgl
, parallel
, dplyr
, and RColorBrewer
packages.
install.packages('devtools')
library(devtools)
install_github('aproudian2/rapt')
library(rapt)
library(rgl)
library(parallel)
library(dplyr)
library(RColorBrewer)
The clustersim()
function within the rapt
package can be used to simulate clustered point patterns following the procedure described in the main body of the paper. Documentation for this function can be found by running the command ?clustersim
. Below, we will walk through the process of simulating and visualizing some different clustered point patterns.
First, we need to upload the RCP patterns generated using code by Desmond and Weeks (2009). This code produces two output files: a “FinalConfig” file containing the xyz positions of the center of each sphere in the pattern and a “system” file containing all of the parameters used in the simulation. We can upload these files into a pp3
object using the read.rcp()
function.
rcp.upload <- read.rcp('FinalConfig1', 'system1', scaleUp = TRUE, newRadius = 0.5)
We then use the stitch.size()
function in rapt
to tile the pattern and create a larger volume of RCP points. This tiling is possible due to the periodic boundary conditions imposed on the RCP simulation. We create a \(60 \times 60 \times 60\) volume of RCP points:
rcp.stitched <- stitch.size(rcp.upload, boxSize = c(60,60,60))
Now, for the simplicity of this example, we can use this pattern as both the underlying and cluster centroid patterns for our cluster simulation (typically we would use different RCP pattern realizations). Below we simulate a clustered point pattern using the clustersim()
function with parameters \(\eta = 0.1\), \(\mu_R = 4\), \(\rho_1 = 1\), \(\rho_2 = 0\), \(\beta = 0\), and \(\xi = 0\). Note that the plot below is interactive; feel free to drag it around with your mouse and use the scroll wheel to zoom in and out.
cluster.simple <- clustersim(rcp.stitched, rcp.stitched, 0.5,
pcp = 0.1, cr = 4,
rho1 = 1, rho2 = 0,
rb = 0, pb = 0,
toplot = T)
Below is an example with a slightly more “noisy”" set of cluster parameters: \(\eta = 0.1\), \(\mu_R = 4\), \(\rho_1 = 0.4\), \(\rho_2 = 0.03\), \(\beta = 0.2\), and \(\xi = 0.1\).
cluster.noisy <- clustersim(rcp.stitched, rcp.stitched, 0.5,
pcp = 0.1, cr = 4,
rho1 = 0.4, rho2 = 0.03,
rb = 0.2, pb = 0.1,
toplot = T)
Next we will get into the process of calculating \(T(r)\) for a point pattern. The first step in this process is to calculate \(K(r)\) for the point pattern of interest. To do this in R, use the K3est()
function from the spatstat
library. This function accepts a pp3
object as its main argument; for our simulated point patterns above, this pp3
object is the first entry of the returned lists cluster.simple
and cluster.noisy
. Another important argument to the K3est()
function is the correction
parameter, which specifies the form of the edge correction for \(K(r)\). We will always use the “translation” correction for our work. For more information about the different types of edge corrections, we suggest the book Spatial Point Patterns by Baddeley, Rubak, and Turner (2015). The other arguments to K3est()
determine the maximum \(r\) value that \(K(r)\) is measured to, and the number of \(r\) values it is measured at along the way.
So, let’s compare the \(K(r)\) signal from the simple and noisy point patterns simulated above:
k.simple <- K3est(cluster.simple[[1]], rmax = 17, nrval = 100, correction = 'translation')
k.noisy <- K3est(cluster.noisy[[1]], rmax = 17, nrval = 100, correction = 'translation')
plot(k.simple$r, k.simple$trans, type = 'l', xlab = 'r', ylab = 'K(r)', lwd = 2, col = 'blue')
lines(k.noisy$r, k.noisy$trans, lwd = 2, col = 'red')
legend(0, 20000, legend = c('Simple', 'Noisy'), col = c('blue', 'red'), lwd = 2)
Now, looking at this plot, we see that the “Simple” signal appears to oscillate up and down with larger amplitude than the “Noisy” signal. Intuitively this should make sense; If we think of variation in the K-function as a measure of clustering, the simple signal should be stronger than the noisy signal, as there is stronger clustering occurring in the simple point pattern.
The two lines shown in the plot above are not the easiest to compare to one another, however, because the magnitude of the differences is relatively small compared to the overall trend of the functions themselves (increasing polynomially). This is where \(T(r)\) becomes useful. But, before we can calculate \(T(r)\), we need to perform random relabeling of a fraction \(\eta = 0.1\) of the UPP to figure out the “expected random signal” of this UPP. We do so in the following code snippet, which utilizes the the pK3est()
function and parallel computing to speed up the process (note that we would typically use somewhere in the range of 25000 random relabelings, but for this vignette we will only perform 100). This code may take a few minutes to run.
rrl.data <- pK3est(perc = 0.1, pattern = rcp.stitched,
nEvals = 100, nrval = 100, rmax = 17,
correction="trans", anom=TRUE, sorted=FALSE,
cores = 8)
Now, the square root of the expected random signal is contained in the second element of the list returned from pK3est()
. Let’s look at how this signal compares to the clustered signals:
plot(k.simple$r, k.simple$trans, type = 'l', xlab = 'r', ylab = 'K(r)', lwd = 2, col = 'blue')
lines(k.noisy$r, k.noisy$trans, lwd = 2, col = 'red')
lines(k.noisy$r, rrl.data[[2]]^2, lwd = 2, col = 'black')
legend(0, 20000, legend = c('Simple', 'Noisy', 'Expected Random'), col = c('blue', 'red', 'black'), lwd = 2)
To get a better comparison, we can instead plot \(T(r) = \sqrt{K_\text{A-obs}(r)} - \sqrt{K_{(N/2)}(r)}\) for each of these signals. Note that \(T(r) = 0\) for the expected random signal by definition.
t.simple <- data.frame(r = k.simple$r, trans = sqrt(k.simple$trans) - rrl.data[[2]])
t.noisy <- data.frame(r = k.noisy$r, trans = sqrt(k.noisy$trans) - rrl.data[[2]])
plot(k.simple$r, t.simple$trans, type = 'l', xlab = 'r', ylab = 'T(r)', lwd = 2, col = 'blue')
lines(k.noisy$r, t.noisy$trans, lwd = 2, col = 'red')
lines(k.noisy$r, rep(0, 100), lwd = 2, col = 'black')
legend(0, -5, legend = c('Simple', 'Noisy', 'Expected Random'), col = c('blue', 'red', 'black'), lwd = 2)
And now we can really see the advantage of using \(T(r)\) as opposed to \(K(r)\); the deviations from the random signal are put on main display with the \(T(r)\) function. These \(T(r)\) signals could have instead been directly calculated using the anomK3est()
function from the rapt
library. For the simple cluster pattern, this would be done by running:
simple.alt <- anomK3est(pattern = cluster.simple[[1]], toSub = rrl.data[[2]], rmax = 17, nrval = 100)
We can now use the k3metrics()
function to extract the \(T(r)\) metrics from the two \(T(r)\) signals shown above.
mets.simple <- k3metrics(t.simple$r[10:100], t.simple$trans[10:100], toplot = T)
Using the toplot = TRUE
argument, we get a plot of \(T(r)\) (black) and its first three derivatives (red, purple, and blue correspond to first, second, and third derivative respectively). The points associated with each of the five \(T(r)\) metrics are also highlighted. The actual object returned from this function is a list containing the values of each metric, where the order is [[1]] Tmax, [[2]] Rmax, [[3]] Rdmin, [[4]]Rd3max, [[5]] Tdmin.
mets.simple
## [[1]]
## [1] 20.41701
##
## [[2]]
## [1] 4.979798
##
## [[3]]
## [1] 8.070707
##
## [[4]]
## [1] 7.727273
##
## [[5]]
## [1] -8.15958
We can calculate the same values for the noisy clustered data:
mets.noisy <- k3metrics(t.noisy$r[10:100], t.noisy$trans[10:100], toplot = T, ylim = c(-2, 6))
mets.noisy
## [[1]]
## [1] 5.820987
##
## [[2]]
## [1] 4.636364
##
## [[3]]
## [1] 7.383838
##
## [[4]]
## [1] 7.040404
##
## [[5]]
## [1] -1.699889
We have now done enough background to get at the purpose of this paper; to use these \(T(r)\) metrics to make inference about the global properties of the clustered point pattern that lead to them. By inspection, we see that there are differences between the metrics for these two patterns, but how these differences connect to cluster properties is much less clear.
The code below contains a scaled-down example workflow for generating the training and testing data sets. In this example, the training set will contain 100 total data points (10 iterations at each of 10 different parameter combinations) and the testing set will contain 25 total data points (one iteration at each of 25 different parameter combinations).
First we set some of the important parameters:
cluster.size <- c(60,60,60) # x, y, z dimensions of the simulated point patterns
nrand <- 10 # Number of distinct parameter combinations
nclust <- 10 # Number of iterations to perform at each parameter combination
s <- 100 # Seed for random number generators
maxr <- 17 # Max r value to calculate K(r) to
nr <- 100 # Number of r values to calculate K(r) at
cores2use <- 5 # Number of cores to use in parallel computation
Now we generate the 100 different parameter combinations and store them in the list params
:
set.seed(s) # Set the random seed
params <- list()
cr <- runif(nrand, min = 2, max = 6.5) # mu_R values
rho1 <- runif(nrand, min = 0.2, max = 1) # rho_1 values
rho2 <- runif(nrand, min = 0, max = 0.03) # tho_2 values
rb <- runif(nrand, min = 0, max = 0.5) # radius blur (aka beta) values
pb <- runif(nrand, min = 0, max = 0.2) # position blur (aka xi) values
for(i in 1:nrand){
params[[i]] <- c(cr[i], rho1[i], rho2[i], rb[i], pb[i]) # create one big list of parameter combinations
}
params.train <- params # Will use this later
We need the square root of the expected random \(K(r)\) signal, which we can store in a vector named rand.signal
. Remember that we calculated this signal in the T(r) and Metric Extraction section. It is stored as the second element of the rrl.data
list:
rand.signal <- rrl.data[[2]]
We will now upload all the RCP files needed to generate different simulation iterations at each parameter combination (files are on GitHub):
under.list <- list() # Create list for the UPPs
over.list <- list() # Create list for the patterns used to initialize cluster centroids
for(i in 1:nclust){
under <- read.rcp(paste('FinalConfig', toString(i), sep=''),
paste('system', toString(i), sep=''),
scaleUp = TRUE, newRadius = 0.5)
over <- under
under.list[[i]] <- stitch.size(under, boxSize = c(60,60,60))
over.list[[i]] <- stitch.size(over, boxSize = c(60,60,60))
}
# Will use these later
under.list.train <- under.list
over.list.train <- over.list
We next write a function that allows us to easily simulate clustered point patterns and extract \(T(r)\) metrics based on what we have set up thus far
kmet_extract <- function(i, nper, maxr, nr, toSub, pcp = 0.05114235, tol = 0.005){
res <- matrix(NA, nrow = nper, ncol = 5)
colnames(res) <- c('Tm', 'Rm', 'Rdm', 'Rddm', 'Tdm')
set.seed(i)
seeds <- round(runif(nper, 1, 1e8))
for(j in 1:nper){
# Simulate a cluster according to the ith parameter combination
cluster <- clustersim(under.list[[j]], over.list[[j]], 0.5,
pcp = pcp,
cr = params[[i]][1],
rho1 = params[[i]][2],
rho2 = params[[i]][3],
rb = params[[i]][4],
pb = params[[i]][5],
tol = tol,
s = seeds[j])
if(is.numeric(cluster)){
res[j,] <- c(NA, NA, NA, NA, NA)
next
}
# Measure T(r) on the pattern
result <- anomK3est(cluster[[1]], toSub, maxr, nr)
rvals <- result$r
tvals <- result$trans
rvals.new <- rvals[10:length(rvals)]
tvals.new <- tvals[10:length(rvals)]
# Extract T(r) metrics
metrics <- k3metrics(rvals.new, tvals.new, F)
res[j,] <- c(metrics[[1]], metrics[[2]], metrics[[3]], metrics[[4]], metrics[[5]])
# Clear memory
rm(cluster, result, rvals, tvals, rvals.new, tvals.new)
gc()
}
return(res)
}
Next we initialize a parallel computing cluster (Google “parallel computing in R” if you are unfamiliar).
cl <- makePSOCKcluster(cores2use) # Create a PSOCK cluster containing the number of cores earlier specified
clusterEvalQ(cl, library(rapt)) # Load the rapt and zoo libraries into the parallel computing environment
clusterEvalQ(cl, library(zoo))
clusterExport(cl, c('over.list', 'under.list', 'params')) # Make these objects available in the parallel environment
Now that we have everything set up so nicely, we only need to run one command to get out our results! This might take a few minutes to run.
train.data <- parLapply(cl, 1:nrand, kmet_extract, nclust, maxr, nr, rand.signal, pcp = 0.05114235, tol = 0.005)
stopCluster(cl)
The object train.data
is now a list of 10 matrices, where the \(i^\text{th}\) matrix corresponds to the \(i^\text{th}\) parameter combination in the params.train
list. Each matrix contains results from the the 10 iterations performed for the associated cluster parameter set. For example, we can look at the first parameter combination set and the 10 data points corresponding to it:
params.train[[1]]
## [1] 3.38494750 0.69999718 0.01607433 0.24415300 0.06613211
train.data[[1]]
## Tm Rm Rdm Rddm Tdm
## [1,] 20.45882 5.323232 8.757576 7.040404 -4.383947
## [2,] 19.43588 5.151515 8.585859 6.868687 -4.476304
## [3,] 19.21325 4.979798 8.757576 7.555556 -4.772993
## [4,] 21.45222 5.494949 9.444444 7.383838 -4.633885
## [5,] 22.68797 5.838384 9.616162 9.787879 -4.488128
## [6,] 18.59852 4.979798 8.070707 6.868687 -4.599503
## [7,] 19.75045 5.151515 9.101010 7.383838 -4.651663
## [8,] 18.84017 4.979798 8.585859 7.040404 -4.795017
## [9,] 20.10392 5.151515 8.585859 7.040404 -4.406182
## [10,] 20.68984 5.494949 9.444444 7.555556 -4.440929
The above matrix shown the \(T(r)\) metrics measured on each of the ten clustered point patterns simulated with these parameters. Notice that they are all similar, but not exactly the same. This is why we do multiple simulations at each parameter combination set; to capture this variation and incorporate it into our models.
We also need to generate our testing data set, which is 25 simulations each at a different combination of parameters. We follow a very similar procedure to the one outlined above for generating the test set. Once again, this may take a few minutes to run:
nrand <- 25 # Number of distinct parameter combinations
nclust <- 1 # Number of iterations to perform at each parameter combination
s <- 101 # Seed for random number generators
set.seed(s)
params <- list()
cr <- runif(nrand, min = 2, max = 6.5) # mu_R values
rho1 <- runif(nrand, min = 0.2, max = 1) # rho_1 values
rho2 <- runif(nrand, min = 0, max = 0.03) # tho_2 values
rb <- runif(nrand, min = 0, max = 0.5) # radius blur (aka beta) values
pb <- runif(nrand, min = 0, max = 0.2) # position blur (aka xi) values
for(i in 1:nrand){
params[[i]] <- c(cr[i], rho1[i], rho2[i], rb[i], pb[i]) # create one big list of parameter combinations
}
params.test <- params
over.inds <- sample(1:10, nrand, replace = TRUE)
under.inds <- sample(1:10, nrand, replace = TRUE)
over.list <- over.list.train[over.inds]
under.list <- under.list.train[under.inds]
kmet_extract_test <- function(i, maxr, nr, toSub, pcp = 0.05114235, tol = 0.005){
res <- t(as.matrix(rep(NA, 5)))
colnames(res) <- c('Tm', 'Rm', 'Rdm', 'Rddm', 'Tdm')
cluster <- clustersim(under.list[[i]], over.list[[i]], 0.5,
pcp = pcp,
cr = params[[i]][1],
rho1 = params[[i]][2],
rho2 = params[[i]][3],
rb = params[[i]][4],
pb = params[[i]][5],
tol = tol,
s = i)
if(is.numeric(cluster)){
res <- c(NA, NA, NA, NA, NA)
return(res)
}
# Measure T(r) on the pattern
result <- anomK3est(cluster[[1]], toSub, maxr, nr)
rvals <- result$r
tvals <- result$trans
rvals.new <- rvals[10:length(rvals)]
tvals.new <- tvals[10:length(rvals)]
# Extract T(r) metrics
metrics <- k3metrics(rvals.new, tvals.new, F)
res <- c(metrics[[1]], metrics[[2]], metrics[[3]], metrics[[4]], metrics[[5]])
# Clear memory
rm(cluster, result, rvals, tvals, rvals.new, tvals.new)
gc()
return(res)
}
# Set up parallel cluster
cl <- makePSOCKcluster(cores2use)
clusterEvalQ(cl, library(rapt))
clusterEvalQ(cl, library(zoo))
clusterExport(cl, c('over.list', 'under.list', 'params'))
# Perform simulations and metric extraction in parallel
sim.res.test <- parLapply(cl, 1:nrand, kmet_extract_test, maxr, nr, rand.signal, pcp = 0.05114235, tol = 0.005)
test.data <- matrix(unlist(sim.res.test), nrow = 25, byrow = T)
colnames(test.data) <- c('Tm', 'Rm', 'Rdm', 'Rddm', 'Tdm')
stopCluster(cl)
Now we have the testing data, which we will use to evaluate our models, stored in a matrix named test.data
:
head(test.data)
## Tm Rm Rdm Rddm Tdm
## [1,] 29.959066 5.323232 9.616162 7.212121 -8.384301
## [2,] 2.648401 2.404040 3.949495 2.747475 -2.375117
## [3,] 20.660073 6.525253 10.989899 9.444444 -4.414132
## [4,] 23.194387 7.383838 13.050505 9.272727 -3.637428
## [5,] 35.773989 7.555556 12.020202 10.989899 -4.200462
## [6,] 15.327732 5.666667 9.272727 8.757576 -2.937881
Now that we have our training (train.data
) and testing (test.data
) data sets generated, we can get into actually training and evaluating our different models.
First, there are a few packages that we need to utilize for training of these models. Specifically, the caret
package, which provides access to a wide variety of machine learning tools, and the doParallel
package, which we will use to parallelize the training of a random forest model. Make sure to install these packages if you don’t already have them.
library(caret)
library(doParallel)
First, we create a big data frame for the training data, where each row contains a single paring of the input simulation parameters from params.train
and corresponding \(T(r)\)-metrics from train.data
. We call this data frame train.df
.
train.df <- matrix(NA, nrow = length(train.data)*nrow(train.data[[1]]), ncol = 10)
cnt <- 1
for(i in 1:length(train.data)){
for(j in 1:nrow(train.data[[1]])){
train.df[cnt, 1:5] <- params.train[[i]]
train.df[cnt, 6:10] <- train.data[[i]][j,]
cnt <- cnt + 1
}
}
train.df <- as.data.frame(train.df)
names(train.df) <- c("cr","rho1","rho2","rb","pb","Tm","Rm","Rdm","Rddm","Tdm")
Next, we need to add a column to this data frame containing the weighted radius \(R_w\) of the pattern, which is calculated from \(\mu_R\) and \(\sigma_R\) as follows:
\[R_w = \frac{\mu_R^4 + 6 \mu_R^2 \sigma_R^2 + 3 \sigma_R^4}{\mu_R^3 + 3 \mu_R \sigma_R^2}\]
train.df$sigma <- train.df$cr*train.df$rb
train.df$rw <- (train.df$cr^4 + 6*train.df$cr^2*train.df$sigma^2 + 3*train.df$sigma^4)/(train.df$cr^3 + 3*train.df$cr*train.df$sigma^2)
Next, we get rid of any “incomplete cases” in the training data; that is, we remove any rows of train.df
that have NA
values (cluster simulations occasionally return NA
values if the global type-\(A\) point percentage isn’t within the specific tolerance level). We can’t use these data points to help us train the models, so we just remove them.
train.df.complete <- train.df[complete.cases(train.df),]
Next, we use the preProcess()
function from the caret
package to perform principal component analysis on our \(T(r)\)-metrics. We add the principal components as additional columns to the train.df.complete
data frame.
mets <- c("Tm", "Rm" ,"Rdm", "Rddm", "Tdm")
pp.train <- preProcess(train.df.complete[,mets], method = c("scale", "center", "pca", "BoxCox"), thresh = 1)
data.train.pca <- predict(pp.train, train.df.complete[,mets])
train.df.complete <- data.frame(train.df.complete, data.train.pca)
Let’s look at a few rows of the train.df.complete
data frame to see the information we have collected about each data point in our training set:
head(train.df.complete)
## cr rho1 rho2 rb pb Tm Rm Rdm
## 1 3.384947 0.6999972 0.01607433 0.244153 0.06613211 20.45882 5.323232 8.757576
## 2 3.384947 0.6999972 0.01607433 0.244153 0.06613211 19.43588 5.151515 8.585859
## 3 3.384947 0.6999972 0.01607433 0.244153 0.06613211 19.21325 4.979798 8.757576
## 4 3.384947 0.6999972 0.01607433 0.244153 0.06613211 21.45222 5.494949 9.444444
## 5 3.384947 0.6999972 0.01607433 0.244153 0.06613211 22.68797 5.838384 9.616162
## 6 3.384947 0.6999972 0.01607433 0.244153 0.06613211 18.59852 4.979798 8.070707
## Rddm Tdm sigma rw PC1 PC2 PC3
## 1 7.040404 -4.383947 0.8264451 3.929064 1.028896473 0.6288903 -0.38345086
## 2 6.868687 -4.476304 0.8264451 3.929064 1.195579772 0.7039168 -0.31723017
## 3 7.555556 -4.772993 0.8264451 3.929064 1.041780552 0.7946626 -0.03237741
## 4 7.383838 -4.633885 0.8264451 3.929064 0.669275933 0.6695120 -0.35523918
## 5 9.787879 -4.488128 0.8264451 3.929064 0.001140357 0.2907124 0.20045686
## 6 6.868687 -4.599503 0.8264451 3.929064 1.388553260 0.8146326 -0.17092912
## PC4 PC5
## 1 -0.10314279 -0.009007408
## 2 -0.05403028 -0.030043902
## 3 -0.08461954 -0.192988479
## 4 0.02224586 -0.108415379
## 5 -0.42225523 -0.154576207
## 6 -0.12105714 0.021722692
Now that we have all this information collected, we can use it to actually train our models! We will start by training three different regression models for \(R_w\): a generalized linear model (GLM), a Bayesian regularized neural network model (BRNN), and a random forest model (RF). We do so using methods from the caret
package. We utilize 10-fold cross-validation to avoid over fitting of the training set.
out.param <- 'rw'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
rw.models <- list()
rw.models[[1]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "glm", trControl = trainControl("cv", number = 10))
rw.models[[2]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "brnn", trControl = trainControl("cv", number = 10), tuneGrid = data.frame('neurons' = c(5, 7, 9)))
ncores <- 5
cl <- makeCluster(ncores)
registerDoParallel(cl)
rw.models[[3]] <- train(train.df.complete[,predictors],train.df.complete[,out.param],method = "parRF", trControl = trainControl("cv", number = 10))
stopCluster(cl)
registerDoSEQ()
names(rw.models) <- c("glm","brnn","rf")
Next, we evaluate the different \(R_w\) models using root mean square error of prediction (RMSEP), defined as
\[RMSEP = \sqrt{\frac{1}{n}\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}\,\, ,\] where \(n\) is the size of the testing data set, \(y_i\) is the true value of the desired response variable (\(R_w\) in this case), and \(\hat{y}_i\) is the corresponding model estimate. Smaller values of RMSEP means that the model does a better job at estimating the true value of the response.
To calculate RMSEP for each \(R_w\) model, we first need to create a data frame with the complete cases of test data similar to how we did with the training data, including all of the true parameters, \(T(r)\) metrics, and principal components.
param.matrix <- matrix(unlist(params.test), nrow = 25, byrow = T)
test.df <- cbind(param.matrix, test.data) # Combine parameters and T(r)-metrics
test.df <- as.data.frame(test.df)
names(test.df) <- c("cr","rho1","rho2","rb","pb","Tm","Rm","Rdm","Rddm","Tdm")
test.df.complete <- test.df[complete.cases(test.df),] # Only keep complete cases
# Calculate R_w for each cluster simulation
test.df.complete$sigma <- test.df.complete$cr * test.df.complete$rb
test.df.complete$rw <- (test.df.complete$cr^4 + 6* test.df.complete$cr^2 * test.df.complete$sigma^2 + 3*test.df.complete$sigma^4)/
(test.df.complete$cr^3 + 3*test.df.complete$cr*test.df.complete$sigma^2)
test.pca <- predict(pp.train, test.df.complete[,mets]) # Perform PCA in same way as performed on training data
test.df.complete <- data.frame(test.df.complete, test.pca) # Smoosh everything together into one big data frame
Now that we have the testing set data frame all set up, we can calculate RMSEP from the models contained in the list rw.models
and select the smallest one:
predictors <- c("PC1","PC2","PC3","PC4","PC5")
out.param <- 'rw'
toRound <- 6
model.RMSEP <- lapply(rw.models, function(x){
a <- predict(x, newdata = test.df.complete[,predictors])
RMSEP <- sqrt(mean((a-test.df.complete[,out.param])^2))
print(x$method)
print(paste('RMSEP: ', toString(round(RMSEP,toRound)), sep = ''))
})
## [1] "glm"
## [1] "RMSEP: 0.604321"
## [1] "brnn"
## [1] "RMSEP: 1.240015"
## [1] "parRF"
## [1] "RMSEP: 0.820562"
So, based on our small example, the GLM does best at estimating \(R_w\). As the size of the training data set is increased, the BRNN and RF models will begin to perform better than the GLM, and the BRNN will slightly edge out the RF as the better model (as seen in the main paper results, which use a training set containing 100,000 data points instead of the 100 used here).
Now that we have a \(R_w\) model selected, we can run very similar code to train and evaluate models for \(\rho_1\) and \(\rho_2\):
out.param <- 'rho1'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
rho1.models <- list()
rho1.models[[1]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "glm", trControl = trainControl("cv", number = 10))
rho1.models[[2]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "brnn", trControl = trainControl("cv", number = 10), tuneGrid = data.frame('neurons' = c(5, 7, 9)))
ncores <- 5
cl <- makeCluster(ncores)
registerDoParallel(cl)
rho1.models[[3]] <- train(train.df.complete[,predictors],train.df.complete[,out.param],method = "parRF", trControl = trainControl("cv", number = 10))
stopCluster(cl)
registerDoSEQ()
names(rho1.models) <- c("glm","brnn","rf")
predictors <- c("PC1","PC2","PC3","PC4","PC5")
out.param <- 'rho1'
toRound <- 6
model.RMSEP <- lapply(rw.models, function(x){
a <- predict(x, newdata = test.df.complete[,predictors])
RMSEP <- sqrt(mean((a-test.df.complete[,out.param])^2))
print(x$method)
print(paste('RMSEP: ', toString(round(RMSEP,toRound)), sep = ''))
})
## [1] "glm"
## [1] "RMSEP: 4.619748"
## [1] "brnn"
## [1] "RMSEP: 4.773576"
## [1] "parRF"
## [1] "RMSEP: 4.482717"
For \(\rho_1\), the RF model has smallest RMSEP.
We do the same for \(\rho_2\).
out.param <- 'rho2'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
rho2.models <- list()
rho2.models[[1]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "glm", trControl = trainControl("cv", number = 10))
rho2.models[[2]] <- train(train.df.complete[,predictors], train.df.complete[,out.param], method = "brnn", trControl = trainControl("cv", number = 10), tuneGrid = data.frame('neurons' = c(5, 7, 9)))
ncores <- 5
cl <- makeCluster(ncores)
registerDoParallel(cl)
rho2.models[[3]] <- train(train.df.complete[,predictors],train.df.complete[,out.param],method = "parRF", trControl = trainControl("cv", number = 10))
stopCluster(cl)
registerDoSEQ()
names(rho2.models) <- c("glm","brnn","rf")
predictors <- c("PC1","PC2","PC3","PC4","PC5")
out.param <- 'rho2'
toRound <- 6
model.RMSEP <- lapply(rw.models, function(x){
a <- predict(x, newdata = test.df.complete[,predictors])
RMSEP <- sqrt(mean((a-test.df.complete[,out.param])^2))
print(x$method)
print(paste('RMSEP: ', toString(round(RMSEP,toRound)), sep = ''))
})
## [1] "glm"
## [1] "RMSEP: 5.079752"
## [1] "brnn"
## [1] "RMSEP: 5.25549"
## [1] "parRF"
## [1] "RMSEP: 4.965455"
For \(\rho_2\), the RF model has smallest RMSEP.
We have selected the best model for each cluster parameter, but we have yet to truly evaluate whether or not these models perform well. We can do this by visualizing the model estimate errors.
We will first visualize the results for the \(R_w\) GLM model, as this was the \(R_w\) model with smallest RMSEP in the previous section. To start, we need to calculate a vector of the model residuals (difference between model estimates and true values) from the test data set, which we will store in a vector named diff
. We can then calculate the percent error associated with each residual, which we store in a vector named diff.perc
.
out.param <- 'rw'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
pred.full <- extractPrediction(rw.models[1], testX = test.df.complete[,predictors], testY = test.df.complete[,out.param]) # Extract predictions from the GLM model
pred <- pred.full[pred.full$dataType == 'Test',] # Select only the test data predictions
diff <- pred$obs - pred$pred # Calculate difference between true values and estimates
diff.perc <- diff*100/pred$obs # Calculate percent error between true values and estimates
Next we will plot the estimated \(R_w\) values from the test set versus the corresponding true simulation input values (i.e. the estimates versus the true values). We will color code the plot according to percent error percentile:
est_plot <- function(diff.perc, percs, cols, ...){
sorted <- data.frame('obs' = pred$obs[order(abs(diff.perc))], 'pred' = pred$pred[order(abs(diff.perc))])
n <- nrow(sorted)
plot(sorted$obs, sorted$pred, col = 'black', pch = 16, type = 'n', ...)
vals.subed <- list('all' = sorted)
ind <- rep(0, length(percs)+1)
for(i in 1:length(percs)){
ind[i+1] <- round(n*percs[i]/100)
vals.subed[[i+1]] <- sorted[(ind[i]+1):ind[i+1],]
n.i <- nrow(vals.subed[[i+1]])
points(vals.subed[[i+1]]$obs, vals.subed[[i+1]]$pred, pch = 16, col = cols[i])
}
ind.last <- (tail(ind, n = 1)+1):n
points(vals.subed$all$obs[ind.last], vals.subed$all$pred[ind.last], col = 'black', pch = 16)
abline(0, 1, col = 'black', lty = 2, lwd = 1.5)
}
est_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"),
xlab = 'Simulation Input R_w', ylab = 'Estimated R_w',
main = 'Weighted Radius Model Estimates vs True Values')
legend(2.5, 6.5, legend = sapply(c(50, 75, 90), function(x){paste(toString(x), '%',sep = '')}),
pch = 16, col = brewer.pal(3, "Set1"), title = "Error Percentile")
From this plot, we see which estimates from our test data set fall close to their true values (an exact match would fall on the black dotted line, which is the line \(y = x\)).
Next, we can visualize the percent errors of each estimate from the test data set for the \(R_w\) model:
error_plot <- function(diff.perc, percs, cols, ...){
n <- length(diff.perc)
diff.perc.sorted <- sort(abs(diff.perc))
xs <- (1:n)*100/n
plot(xs, diff.perc.sorted, xlab = 'Error Percentile', type = 'n', ...)
vals <- list('all' = data.frame('perc' = xs, 'error' = diff.perc.sorted))
ind <- rep(0, length(percs)+1)
for(i in 1:length(percs)){
ind[i+1] <- round(n*percs[i]/100)
vals[[i+1]] <- vals$all[(ind[i]+1):ind[i+1],]
n.i <- nrow(vals[[i+1]])
points(vals[[i+1]]$perc, vals[[i+1]]$error, pch = 16, col = cols[i], cex = 1)
segments(-10, tail(vals[[i+1]]$error, n = 1), tail(vals[[i+1]]$perc, n = 1), tail(vals[[i+1]]$error, n = 1),
col = cols[i], lwd = 2, lty = 2)
segments(tail(vals[[i+1]]$perc, n = 1), -10, tail(vals[[i+1]]$perc, n = 1), tail(vals[[i+1]]$error, n = 1),
col = cols[i], lwd = 2, lty = 2)
}
ind.last <- (tail(ind, n = 1 )+1):n
points(vals$all$perc[ind.last], vals$all$error[ind.last], col = 'black', pch = 16, cex = 1)
legend(5, max(vals$all$error)*0.98, legend = sapply(percs, function(x){paste(toString(x), '%',sep = '')}),
pch = 16, col = cols, title = "Error Percentile")
}
par(mar = c(4, 4, 2, 2), mgp = c(2.3, 1, 0))
error_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"), yaxt = 'n',
main = 'Weighted Radius Model Estimate Percent Errors',
ylab = 'Percent Error')
axis(side = 2, at = seq(0, 80, 5), labels = T)
From this plot, we see that 50% of our estimates fall within 7% error of their corresponding true values, 75% of estimates fall within 10% error, and 90% of estimates fall within 20% error. This gives us a more holistic idea about the performance of our \(R_w\) model.
We can make similar plots for both the \(\rho_1\) and \(\rho_2\) models:
out.param <- 'rho1'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
pred.full <- extractPrediction(rho1.models[3], testX = test.df.complete[,predictors], testY = test.df.complete[,out.param]) # Predictions from the RF model
pred <- pred.full[pred.full$dataType == 'Test',]
diff <- pred$obs - pred$pred
diff.perc <- diff*100/pred$obs
est_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"),
xlab = 'Simulation Input rho_1', ylab = 'Estimated rho_1',
main = 'Rho_1 Model Estimates vs True Values')
legend(0.21, 0.8, legend = sapply(c(50, 75, 90), function(x){paste(toString(x), '%',sep = '')}),
pch = 16, col = brewer.pal(3, "Set1"), title = "Error Percentile")
error_plot(diff.perc, c(50, 75, 90), brewer.pal(3, "Set1"), yaxt = 'n',
main = 'Rho_1 Model Estimate Percent Errors',
ylab = 'Percent Error')
axis(side = 2, at = seq(0, 105, 5), labels = T)
For the \(\rho_1\) model, we see that 50% of estimates fall within 15% error, 75% of estimates fall within about 30% error, and 90% of estimates fall within 65% error.
Lastly, we can look at plots from the \(\rho_2\) model. In these plots, we use absolute error in place of percent error, because values of \(\rho_2 \approx 0\) make percent error blow up to extremely large numbers:
out.param <- 'rho2'
predictors <- c("PC1","PC2","PC3","PC4","PC5")
pred.full <- extractPrediction(rho2.models[3], testX = test.df.complete[,predictors], testY = test.df.complete[,out.param]) # Predictions from the RF model
pred <- pred.full[pred.full$dataType == 'Test',]
diff <- abs(pred$obs - pred$pred)
est_plot(diff, c(50, 75, 90), brewer.pal(3, "Set1"),
xlab = 'Simulation Input rho_2', ylab = 'Estimated rho_2',
main = 'Rho_1 Model Estimates vs True Values')
legend(0.002, 0.021, legend = sapply(c(50, 75, 90), function(x){paste(toString(x), '%',sep = '')}),
pch = 16, col = brewer.pal(3, "Set1"), title = "Error Percentile")
error_plot(diff, c(50, 75, 90), brewer.pal(3, "Set1"), yaxt = 'n',
main = 'Rho_2 Model Estimates vs True Values',
ylab = 'Absolute Error')
axis(side = 2, at = seq(0, 0.01, 0.001), labels = T)
Here we will work through the process of performing the \(K(r)\)-based parameter estimation method on a real APT reconstruction. We will work with a ‘.pos’ file and associated ‘.rrng’ file, both of which are generated in and exported from Cameca’s IVAS software. We upload these files into R using the readPOS()
and readRRNG()
functions in rapt
.
pos <- readPOS('example.pos')
rng <- readRRNG('example.rrng')
Next we can use the rngPOS()
function to subset our pos
object down to only those points whose mass are within the ranges provided in the rng
object. We then use the rngCount()
function to count the number of points of each species type within our ranged reconstruction:
pos.r <- rngPOS(pos, rng)
cnts <- rngCount(pos.r, rng)
cnts.totals <- group_by(cnts, name) %>%
summarize(counts = sum(counts), fraction = sum(fraction))
cnts.totals
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat
## # A tibble: 3 x 3
## name counts fraction
## <fct> <int> <dbl>
## 1 Al1 1125200 0.949
## 2 Mg1 27643 0.0233
## 3 Zn1 33004 0.0278
The table above shows the different elements specified within the range file, along with their counts and atomic percentages within the ranged reconstruction. As in the paper, we will treat Mg and Zn as a single clustering species. We can compute this combined species total atomic fraction in the sample:
frac <- sum(cnts.totals$fraction[2:3])
frac
## [1] 0.05114235
And we see that 5.11% of the atoms in our sample are of the clustering species. Next, we use the createSpat()
function to turn our ranged position file into a pp3
object. We create a marked pp3
object for the entire pattern, as well as an un-marked pp3
object containing only points of the clustering species.
# Define a domain for the pp3 object
win.r <- box3(c(min(pos.r$x), max(pos.r$x)),
c(min(pos.r$y), max(pos.r$y)),
c(min(pos.r$z), max(pos.r$z)))
# Full marked pattern
pp3.full <- createSpat(pos.r, win = win.r)
marks(pp3.full) <- pos.r$mark
# Only clustering species points
pp3.aggregates <- subset(pp3.full, marks == 'Mg1' | marks == 'Zn1' )
Next, we scale our reconstruction by a factor of \(\alpha = 2.832\), for reasons discussed in the main paper. Also for consistency with the simulated data, we translate the coordinates so that the domain of our sample is bound within the first quadrant:
alpha <- 2.831518
# Scale coordinates
coo <- coords(pp3.full)
coo.scaled <- coo*alpha
# Translate coordinates
shifts <- c((domain(pp3.full)$xrange*alpha)[1],
(domain(pp3.full)$yrange*alpha)[1],
(domain(pp3.full)$zrange*alpha)[1])
coo.scaled$x <- coo.scaled$x - shifts[1]
coo.scaled$y <- coo.scaled$y - shifts[2]
coo.scaled$z <- coo.scaled$z - shifts[3]
# Translate domain
win.scaled <- box3((domain(pp3.full)$xrange*alpha) - shifts[1],
(domain(pp3.full)$yrange*alpha) - shifts[2],
(domain(pp3.full)$zrange*alpha) - shifts[3])
# Re-create scaled and shifted pp3 objects
pp3.full.scaled <- pp3(coo.scaled$x, coo.scaled$y, coo.scaled$z, win.scaled)
marks(pp3.full.scaled) <- pos.r$mark
pp3.agg.scaled <- subset(pp3.full.scaled, marks == 'Zn1' | marks == 'Mg1')
Now that our APT data is in the correct format, we are almost at the point where we can measure \(T(r)\) on the clustering species. However, we first need to measure the expected random signal for this UPP (because the UPP is no longer the RCP pattern used for simulations, we need to re-calculate what a random signal looks like). We will do this using 100 random relabelings of 5.11% of the new experimental UPP. The pK3est()
function allows us to do this easily, as we saw at the beginning of this document. This may take a few minutes to compute.
rrl.exp.data <- pK3est(perc = frac, pattern = pp3.full.scaled,
nEvals = 100, rmax = 17, nrval = 100,
correction="trans", anom=TRUE, sorted=FALSE,
cores = 8)
We can now calculate \(T(r)\) for our clustering species using this expected random signal:
t.agg <- anomK3est(pattern = pp3.agg.scaled, toSub = rrl.exp.data[[2]], rmax = 17, nrval = 100)
And then measure the \(T(r)\) metrics from the sample:
met.calc <- k3metrics(t.agg$r, t.agg$trans, toplot = TRUE, ylim = c(-0.5,3.5))
met.agg <- as.data.frame(t(unlist(met.calc)))
names(met.agg) <- c("Tm","Rm","Rdm","Rddm","Tdm")
met.agg
## Tm Rm Rdm Rddm Tdm
## 1 3.315087 5.838384 9.616162 7.383838 -0.4958785
And printed above we see the five \(T(r)\)-metrics from the APT reconstruction. We can now combine these metrics into the five principal components found from the training data, and plug these PCs into our models to obtain estimates for the weighted radius (\(R_w\)) and intra-cluster concentration (\(\rho_1\)) of clusters in the reconstruction.
reconstruction.pca <- predict(pp.train, met.agg) # Turn metrics into PCs
rw.estimate <- predict(rw.models[[1]], newdata = reconstruction.pca) # Get R_w estimate from best R_w model
rho1.estimate <- predict(rho1.models[[3]], newdata = reconstruction.pca) # Get rho_1 estimate from best rho_1 model
We now report the final estimates for \(R_w\) and \(\rho_1\).
\(R_w\) Estimate:
## 1
## 6.408299
\(\rho_1\) Estimate:
## 1
## 0.4591814
Given enough testing data, one could then calculate confidence bounds on these estimates. However, because in this vignette our testing data set was so small, it will not be possible here. Constructing these bounds is reliant on the idea that there are multiple data points within the testing set with similar parameters to those estimated from our experimental data set. We can use the variation in the testing data of these points with similar parameters as an estimate of the variance of the estimates for our data.
Remember that this vignette uses very small training and testing data sets to make it computationally feasible without a supercomputer, which is why the estimates presented here do not match the ones shown in the paper. Also remember that the \(R_w\) estimate here is scaled, and can be divided by \(\alpha\) to get back to units of the original data set.
The last part of our paper involves comparing the \(T(r)\)-metric based inference method to estimates made from the maximum separation algorithm (MSA). The msa()
function within the rapt
package allows for easy computation of the MSA on a marked pp3
object within the R environment. Find documentation for this function by running ?msa
.
We can do a quick example of how this function works using the cluster.noisy
simulation from earlier in this document. The second element of this lists (the second element of the list returned by the clustersim()
function) contains the marked pp3
object that we will pass into the msa()
function. Guidance for selection of the dmax
, Nmin
, denv
, and der
parameters is given by Vaumousse, Cerezo, and Warren (2003).
msa.noisy <- msa(cluster.noisy[[2]], dmax = 1.3, Nmin = 10, denv = 1.3, der = 1.3, clust.mark = c('A','B'))
We can then look at the clusters identified by the maximum separation algorithm compared to the true clusters present in the point pattern (we know the truth in this case because we simulated it). The red points in the plot below are the MSA found clusters, while the black points are the true clusters.
mfrow3d(nr = 1, nc = 2, sharedMouse = TRUE)
data.noisy <- as.data.frame(cluster.noisy[[2]]$data)
data.noisy <- data.noisy[marks(cluster.noisy[[2]]) == 'A',]
plot3d(data.noisy)
msa.data.noisy <- as.data.frame(cluster.noisy[[2]]$data)
msa.data.noisy <- msa.data.noisy[unlist(msa.noisy$A),]
plot3d(msa.data.noisy, col = 'red')
rglwidget(elementId = 'msanoisy')
From this plot, we can see that MSA does a relatively good job in this case of identifying the points that reside in each cluster. We know that the true parameters used were \(\mu_R = 4\), \(\rho_1 = 0.4\), and \(\rho_2 = 0.03\). We can compare these values to the values estimated by MSA:
mean(msa.noisy$radius)
## [1] 3.64859
mean(msa.noisy$den)
## [1] 0.6228271
msa.noisy$bgnd.den
## [1] 0.03080937
We see that the estimates are generally in the correct neighborhood. Care should be taken in trusting these MSA estimates in real data, as they are highly dependent on the parameters dmax
and Nmin
. This point is discussed with more detail in the main paper.
Now that we know how the msa()
function works, we can apply it to the experimental APT reconstruction from the previous section. This analysis will show how dependent the output from the MSA is on dmax
and Nmin
.
We will run the MSA on the marked points pattern contained in the pp3.full.scaled
object, which contains the scaled version of the experimental reconstruction. We will do this at multiple combinations of dmax
and Nmin
. This may take a few minutes to run:
nmins <- c(10, 15, 20)
dmaxs <- c(1.4, 1.7, 2.0)
params <- expand.grid(nmins, dmaxs)
msa.sweep <- lapply(1:nrow(params), function(i){
return(msa(pp3.full.scaled,
params[i,2],
params[i,1],
params[i,2],
params[i,2],
clust.mark = c('Mg1', 'Zn1')))
})
Now, we can plot the average cluster radius and average intra-cluster concentration identified by MSA as a function of both Nmin
and dmax
:
mfrow3d(nr = 1, nc = 2, sharedMouse = FALSE)
plot3d(msa.data$Nmin, msa.data$dmax, msa.data$rad, type = 's', col = 'blue',
xlab = 'Nmin', ylab = 'dmax', zlab = 'Average Cluster Radius')
plot3d(msa.data$Nmin, msa.data$dmax, msa.data$rho1, type = 's', col = 'red',
xlab = 'Nmin', ylab = 'dmax', zlab = 'Average Intra-Cluster Concentration')
rglwidget()
And we see that these summary statistics of what the MSA identifies as clusters vary significantly based on Nmin
and dmax
.
Instead of looking at these summaries, we could instead just look at the clusters themselves. The plots blow show the cluster species points identified as being in a cluster for each combination of Nmin
and dmax
.