Trace Plots for NCI Multivar MCMC

Amer Moosa

3/12/2024

When performing an analysis with the NCI method, it is important to assess the measurement error model to ensure that it properly represents the distribution of the true parameters. The NCI method uses a Markov Chain Monte Carlo (MCMC) method to fit the measurement error model, so it worthwhile to become familiar with common methods used to assess MCMC models. These methods can be used to diagnose issues in the model and can help guide refinements to it.

This vignette will demonstrate how to use trace plots to assess the models fit with the nci_multivar_mcmc() function. Plots for a good model as well as models with some common problems will be shown to build familiarity with the use of trace plots in error diagnosis.

library(ncimultivar)

The dataset used for this analysis is derived from the 2005-2010 NHANES data. A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

Intakes of total grain (G_TOTAL) and refined grain (G_REFINED) will be modeled to demonstrate trace plots for different scenarios.

The covariates of interest are smoking status (SMK_REC), age (RIDAGEYR), and sex (RIAGENDR). The nuisance covariates will be whether the recall was on a weekend (Weekend) or whether the recall was on day 2 (Day2).

The base weight variable (WTDRD1) will be used for all models.

For more information on the specifics of the data processing, please see the food and nutrient analysis vignette.

Subjects with missing values are removed, and categorical variables are transformed into binary indicators.

#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]

#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)

#remove subjects that are missing any covariates or variables
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) | 
                      is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$G_TOTAL) | is.na(input.dataset$G_REFINED)

input.dataset <- input.dataset[!missing.covariates & !missing.variables,]

#break down smoking status into binary indicators
input.dataset$Current.Smoker <- as.numeric(input.dataset$SMK_REC == 1)
input.dataset$Former.Smoker <- as.numeric(input.dataset$SMK_REC == 2)
input.dataset$Never.Smoker <- as.numeric(input.dataset$SMK_REC == 3)

#rename variables for readability
input.dataset$Total.Grain <- input.dataset$G_TOTAL
input.dataset$Refined.Grain <- input.dataset$G_REFINED

The variables will now be transformed and standardized for use in the MCMC models.

#Winsorize extreme values
wins.total.grain <- boxcox_survey(input.data=input.dataset,
                                  row.subset=(input.dataset$Day2 == 0),
                                  variable="Total.Grain",
                                  weight="WTDRD1",
                                  do.winsorization=TRUE,
                                  id="SEQN",
                                  repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Total.Grain
#>   SEQN DAY Total.Grain Total.Grain.winsorized
#>  40879   1       55.92               51.05915

wins.refined.grain <- boxcox_survey(input.data=input.dataset,
                                    row.subset=(input.dataset$Day2 == 0),
                                    variable="Refined.Grain",
                                    weight="WTDRD1",
                                    do.winsorization=TRUE,
                                    id="SEQN",
                                    repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Refined.Grain
#> [1] SEQN                     DAY                      Refined.Grain           
#> [4] Refined.Grain.winsorized
#> <0 rows> (or 0-length row.names)

input.dataset$Total.Grain <- pmin(input.dataset$Total.Grain, 51.05915)


#Find best Box-Cox lambdas for each variable
boxcox.total.grain <- boxcox_survey(input.data=input.dataset,
                                    row.subset=(input.dataset$Day2 == 0),
                                    variable="Total.Grain",
                                    covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
                                    weight="WTDRD1")

boxcox.refined.grain <- boxcox_survey(input.data=input.dataset,
                                      row.subset=(input.dataset$Day2 == 0),
                                      variable="Refined.Grain",
                                      covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
                                      weight="WTDRD1")

boxcox.lambda.data <- rbind(boxcox.total.grain, boxcox.refined.grain)


#calculate minimum amounts
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
                                                row.subset=(input.dataset$Day2 == 0),
                                                daily.variables=c("Total.Grain", "Refined.Grain"))

#run pre-processor to get MCMC input data
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
                                           daily.variables=c("Total.Grain", "Refined.Grain"),
                                           continuous.covariates="RIDAGEYR",
                                           boxcox.lambda.data=boxcox.lambda.data,
                                           minimum.amount.data=minimum.amount.data)

The first example will be of a model that has mixed well. The trace_plots() function will be used to create trace plots for the MCMC model that can be used to assess it.

The trace plots of the values for beta, sigma-e, and sigma-u have a vertical red line part of the way through the x-axis. This represents the burn-in period where the MCMC is expected to be unstable. Any iterations before the burn-in period are discarded. The stability of the parameters should be assessed after the burn-in.

An ideal trace plot should look like random movement around a flat horizontal line. The y-value of the line represents the mean of the distribution and the amplitude of the random movement represents the variance. The values of the mean and variance do not matter for assessing the plot. As long as the mean and variance appear stable over time, the model is said to mix well.

good.model <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
                                id="SEQN",
                                repeat.obs="DAY",
                                weight="WTDRD1",
                                daily.variables="Total.Grain",
                                default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                num.mcmc.iterations=3000,
                                num.burn=1000,
                                num.thin=2,
                                mcmc.seed=9999)

trace_plots(multivar.mcmc.model=good.model)

#> NULL

The following example demonstrates a model with too few burn-in iterations.

Many of the parameters do not have stable estimates with the reduced number of burn-in iterations. The trace plots show that the center of the random movement (i.e., the mean) and the intensity of the fluctuations (i.e., the variance) does not stay constant over time.

low.burn.in <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
                                 id="SEQN",
                                 repeat.obs="DAY",
                                 weight="WTDRD1",
                                 daily.variables="Total.Grain",
                                 default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                 num.mcmc.iterations=300,
                                 num.burn=100,
                                 num.thin=2,
                                 mcmc.seed=9999)

trace_plots(multivar.mcmc.model=low.burn.in)

#> NULL

Another potential issue is an insufficient number of subjects with more than one recall.

The effect of the low sample size can be seen in all of the trace plots, but it is especially apparent in the variance components (Sigma-e and Sigma-u). While there isn’t a clear trend up or down, the center of the trace appears “choppy” which shows that the mean is not able to settle on a value even well after the burn-in period.

Note that the number of subjects with one recall does not have any bearing on model convergence. The limiting factor is the number of subjects that have two or more observations. As a general rule, at least 50 subjects should have two or more observations for proper mixing. More complicated relationships in the data (such as high correlation between different variance components) will require more subjects to fit.

#subset input dataset so that only the first 10 subjects have a second recall
first.recall <- input.dataset[input.dataset$Day2 == 0,]
second.recall <- input.dataset[input.dataset$Day2 == 1,][1:10,]
input.subset <- rbind(first.recall, second.recall)

pre.mcmc.subset <- nci_multivar_preprocessor(input.data=input.subset,
                                             daily.variables="Total.Grain",
                                             continuous.covariates="RIDAGEYR",
                                             boxcox.lambda.data=boxcox.lambda.data,
                                             minimum.amount.data=minimum.amount.data) 

small.sample <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.subset,
                                  id="SEQN",
                                  repeat.obs="DAY",
                                  weight="WTDRD1",
                                  daily.variables="Total.Grain",
                                  default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                  num.mcmc.iterations=3000,
                                  num.burn=1000,
                                  num.thin=2,
                                  mcmc.seed=9999)

trace_plots(multivar.mcmc.model=small.sample)

#> NULL

Variables that are highly correlated will affect the model fit significantly. For example, a variable that is a subset of another variable will cause problems with convergence. This can result from not properly disaggregating variables during pre-processing.

As with the case of low effective sample size, the value trace plots do not have a clear trend but have a “choppy” appearance and do not settle around a value. Additionally, the burn-in period appears much longer. If the number of subjects with two or more observations is high, then lack of convergence without a clear trend could indicate that one or more variables are related to each other.

correlated.model <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
                                      id="SEQN",
                                      repeat.obs="DAY",
                                      weight="WTDRD1",
                                      daily.variables=c("Total.Grain", "Refined.Grain"),
                                      default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                      num.mcmc.iterations=3000,
                                      num.burn=1000,
                                      num.thin=2,
                                      mcmc.seed=9999)

trace_plots(multivar.mcmc.model=correlated.model)

#> NULL

The graphical methods of model assessment shown above are a good way to both check for convergence and also to diagnose errors. However, sometimes a more rigorous assessment of convergence is required. In this case, other methods such as the Gelman-Rubin test can be used. The Gelman-Rubin test works by creating multiple MCMC chains with different random seeds and calculating the within-chain and between-chain variation (Gelman and Rubin, 1992). If the model parameters converge, there should be little to no difference between different chains which will cause the between-chain variance to fall to zero.

The Gelman-Rubin statistic is the square root of the ratio between the total variance and the within-chain variance of a parameter. Convergence is indicated by the Gelman-Rubin statistics for all of the parameters being close to 1. There is no exact rule for how close to 1 the statistics must be, though an upper bound of 1.1 for convergence is suggested by Gelman, et al. (2004). In general, iteration parameters should be chosen that get the Gelman-Rubin statistics as close to 1 as possible while balancing computation time.

The advantage to the Gelman-Rubin test is that it is a more rigorous assessment of convergence since it actually analyzes the variances of the parameters as opposed to a subjective assessment of a graph. The trade off for this is the need for multiple chains to be created which makes this method much more computationally intensive and less suited for a quick diagnostic test on an already-fitted model. Additionally, different patterns in trace plots can give more insight into why a model is not converging as demonstrated above. The best practice is to use both methods for a more thorough assessment.

The gelman_rubin() function computes Gelman-Rubin statistics for each MCMC parameter. The number of MCMC chains to run for the test is selected with the num.chains parameter. The initial random seed is specified by initial.mcmc.seed. This seed will be changed for each MCMC chain.

The Gelman-Rubin test can be used to show that the “good model” shown earlier in the vignette has converged.

gr.statistics <- gelman_rubin(num.chains=5,
                              pre.mcmc.data=pre.mcmc.data,
                              id="SEQN",
                              repeat.obs="DAY",
                              weight="WTDRD1",
                              daily.variables="Total.Grain",
                              default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                              num.mcmc.iterations=3000,
                              num.burn=1000,
                              num.thin=2,
                              initial.mcmc.seed=9999)

gr.statistics
#>          beta.amt.Total.Grain.intercept     beta.amt.Total.Grain.Current.Smoker 
#>                                1.000844                                1.001135 
#>      beta.amt.Total.Grain.Former.Smoker       beta.amt.Total.Grain.std.RIDAGEYR 
#>                                1.001358                                1.001057 
#>           beta.amt.Total.Grain.RIAGENDR               beta.amt.Total.Grain.Day2 
#>                                1.000885                                1.001208 
#>            beta.amt.Total.Grain.Weekend sigma.e.amt.Total.Grain.amt.Total.Grain 
#>                                1.000697                                1.000683 
#> sigma.u.amt.Total.Grain.amt.Total.Grain 
#>                                1.000695