/*
Title: Trace Plots for NCI Multivar MCMC
Author: NCI/Information Management Services
Date: 6/9/2025
*/

/*
When performing an analysis with the NCI method, it is important to assess the measurement error model to ensure that it properly represents the distribution of the true parameters. 
The NCI method uses a Markov Chain Monte Carlo (MCMC) method to fit the measurement error model, so it worthwhile to become familiar with common methods used to assess MCMC models. 
These methods can be used to diagnose issues in the model and can help guide refinements to it.

This vignette will demonstrate how to use trace plots to assess the models fit with the %nci_multivar_mcmc() function. 
Plots for a good model as well as models with some common problems will be shown to build familiarity with the use of trace plots in error diagnosis.
*/

libname indata "../data";

%include "../macros/ncimultivar.sas";

/*
The dataset used for this analysis is derived from the 2005-2010 NHANES data.
A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

Intakes of total grain (G_TOTAL) and refined grain (G_REFINED) will be modeled to demonstrate trace plots for different scenarios.

The covariates of interest are smoking status (SMK_REC), age (RIDAGEYR), and sex (RIAGENDR).
The nuisance covariates will be whether the recall was on a weekend (Weekend) or whether the recall was on day 2 (Day2).

The base weight variable (WTDRD1) will be used for all models.
 
For more information on the specifics of the data processing, please see the food and nutrient analysis vignette (food_and_nutrient_analysis.sas). 

Subjects with missing values are removed, and categorical variables are transformed into binary indicators. 
*/

**subset data;
data input_dataset;
	set indata.nhcvd;
	if SDMVSTRA in (48, 54, 60, 66, 72, 78);
	
	**Define indicator for Day 2;
	Day2 = (DAY = 2);
run;

data input_dataset;
	set input_dataset;
	
	**remove subjects that are missing any covariates or variables;
	if not missing(SMK_REC) and
		 not missing(RIDAGEYR) and
		 not missing(RIAGENDR) and
		 not missing(Weekend) and
		 not missing(Day2) and
		 not missing(G_TOTAL) and
		 not missing(G_REFINED);
		 
	**break down smoking status into binary indicators;
	Current_Smoker = (SMK_REC = 1);
	Former_Smoker = (SMK_REC = 2);
	Never_Smoker = (SMK_REC = 3);
	
	**rename variables for readability;
	Total_Grain = G_TOTAL;
	Refined_Grain = G_REFINED;
run;

/*
The variables will now be transformed and standardized for use in the MCMC models.
*/

**Winsorize extreme values;
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Total_Grain,
							 weight=WTDRD1,
							 do_winsorization=Y,
							 id=SEQN,
							 repeat_obs=DAY);
							 
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Refined_Grain,
							 weight=WTDRD1,
							 do_winsorization=Y,
							 id=SEQN,
							 repeat_obs=DAY);
							 
data input_dataset;
	set input_dataset;
	
	Total_Grain = min(Total_Grain, 51.05915);
run;

**Find best Box-Cox lambdas for each variable;
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Total_Grain,
							 covariates=Current_Smoker Former_Smoker RIDAGEYR RIAGENDR Weekend,
							 weight=WTDRD1);
							 
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Refined_Grain,
							 covariates=Current_Smoker Former_Smoker RIDAGEYR RIAGENDR Weekend,
							 weight=WTDRD1);
							 
data boxcox_lambda_data;
	set bc_Total_Grain
			bc_Refined_Grain;
run;

**calculate minimum amounts;
%calculate_minimum_amount(input_data=input_dataset,
													row_subset=%quote(Day2 = 0),
													daily_variables=Total_Grain Refined_Grain);
													
**run pre-processor to get MCMC input data;
%nci_multivar_preprocessor(input_data=input_dataset,
													 daily_variables=Total_Grain Refined_Grain,
													 continuous_covariates=RIDAGEYR,
													 boxcox_lambda_data=boxcox_lambda_data,
													 minimum_amount_data=minimum_amount_data,
													 outname=model);
													 
/*
The first example will be of a model that has mixed well. 
The %trace_plots() function will be used to create trace plots for the MCMC model that can be used to assess it.

The trace plots of the values for beta, sigma-e, and sigma-u have a vertical red line part of the way through the x-axis. 
This represents the burn-in period where the MCMC is expected to be unstable. 
Any iterations before the burn-in period are discarded. 
The stability of the parameters should be assessed after the burn-in. 

An ideal trace plot should look like random movement around a flat horizontal line. 
The y-value of the line represents the mean of the distribution and the amplitude of the random movement represents the variance. 
The values of the mean and variance do not matter for assessing the plot. 
As long as the mean and variance appear stable over time, the model is said to mix well.
*/

%nci_multivar_mcmc(pre_mcmc_data=model,
									 id=SEQN,
									 repeat_obs=DAY,
									 weight=WTDRD1,
									 daily_variables=Total_Grain,
									 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
									 num_mcmc_iterations=3000,
									 num_burn=1000,
									 num_thin=2,
									 mcmc_seed=9999,
									 outname=good);

ods listing close;

ods pdf file="good_model.pdf";								 
	%trace_plots(multivar_mcmc_model=good);
ods pdf close;

ods listing;

/*
The following example demonstrates a model with too few burn-in iterations.

Many of the parameters do not have stable estimates with the reduced number of burn-in iterations. 
The trace plots show that the center of the random movement (i.e., the mean) and the intensity of the fluctuations (i.e., the variance) does not stay constant over time.
*/

%nci_multivar_mcmc(pre_mcmc_data=model,
									 id=SEQN,
									 repeat_obs=DAY,
									 weight=WTDRD1,
									 daily_variables=Total_Grain,
									 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
									 num_mcmc_iterations=300,
									 num_burn=100,
									 num_thin=2,
									 mcmc_seed=9999,
									 outname=low_burn);

ods listing close;

ods pdf file="low_burn_in.pdf";								 
	%trace_plots(multivar_mcmc_model=low_burn);
ods pdf close;

ods listing;

/*
Another potential issue is an insufficient number of subjects with more than one recall.

The effect of the low sample size can be seen in all of the trace plots, but it is especially apparent in the variance components (Sigma-e and Sigma-u). 
While there isn't a clear trend up or down, the center of the trace appears choppy which shows that the mean is not able to settle on a value even well after the burn-in period.

Note that the number of subjects with one recall does not have any bearing on model convergence. 
The limiting factor is the number of subjects that have two or more observations. 
As a general rule, at least 50 subjects should have two or more observations for proper mixing. 
More complicated relationships in the data (such as high correlation between different variance components) will require more subjects to fit. 
*/

**subset input dataset so that only the first 10 subjects have a second recall;
data first_recall
		 second_recall;
	set input_dataset;
	
	if Day2 = 0 then output first_recall;
	else if Day2 = 1 then output second_recall;
run;

data second_recall;
	set second_recall;
	if _N_ <= 10;
run;

data input_subset;
	set first_recall
			second_recall;
run;

%nci_multivar_preprocessor(input_data=input_subset,
													 daily_variables=Total_Grain,
													 continuous_covariates=RIDAGEYR,
													 boxcox_lambda_data=boxcox_lambda_data,
													 minimum_amount_data=minimum_amount_data,
													 outname=subset);
													 
%nci_multivar_mcmc(pre_mcmc_data=subset,
									 id=SEQN,
									 repeat_obs=DAY,
									 weight=WTDRD1,
									 daily_variables=Total_Grain,
									 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day Weekend,
									 num_mcmc_iterations=3000,
									 num_burn=1000,
									 num_thin=2,
									 mcmc_seed=9999,
									 outname=subset);
									 
ods listing close;

ods pdf file="small_sample.pdf";									 
	%trace_plots(multivar_mcmc_model=subset);
ods pdf close;

ods listing;

/*
Variables that are highly correlated will affect the model fit significantly. 
For example, a variable that is a subset of another variable will cause problems with convergence. 
This can result from not properly disaggregating variables during pre-processing.

As with the case of low effective sample size, the value trace plots do not have a clear trend but have a choppy appearance and do not settle around a value. 
Additionally, the burn-in period appears much longer. 
If the number of subjects with two or more observations is high, then lack of convergence without a clear trend could indicate that one or more variables are related to each other.
*/

%nci_multivar_mcmc(pre_mcmc_data=model,
									 id=SEQN,
									 repeat_obs=DAY,
									 weight=WTDRD1,
									 daily_variables=Total_Grain Refined_Grain,
									 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
									 num_mcmc_iterations=3000,
									 num_burn=1000,
									 num_thin=2,
									 mcmc_seed=9999,
									 outname=corr);
									 
ods listing close;

ods pdf file="correlated_model.pdf";									 
	%trace_plots(multivar_mcmc_model=corr);
ods pdf close;

ods listing;

/*
The graphical methods of model assessment shown above are a good way to both check for convergence and also to diagnose errors. 
However, sometimes a more rigorous assessment of convergence is required. 
In this case, other methods such as the Gelman-Rubin test can be used. 
The Gelman-Rubin test works by creating multiple MCMC chains with different random seeds and calculating the within-chain and between-chain variation (Gelman and Rubin, 1992). 
If the model parameters converge, there should be little to no difference between different chains which will cause the between-chain variance to fall to zero. 

The Gelman-Rubin statistic is the square root of the ratio between the total variance and the within-chain variance of a parameter. 
Convergence is indicated by the Gelman-Rubin statistics for all of the parameters being close to 1. 
There is no exact rule for how close to 1 the statistics must be, though an upper bound of 1.1 for convergence is suggested by Gelman, et al. (2004).
In general, iteration parameters should be chosen that get the Gelman-Rubin statistics as close to 1 as possible while balancing computation time.

The advantage to the Gelman-Rubin test is that it is a more rigorous assessment of convergence since it actually analyzes the variances of the parameters as opposed to a subjective assessment of a graph. 
The trade off for this is the need for multiple chains to be created which makes this method much more computationally intensive and less suited for a quick diagnostic test on an already-fitted model. 
Additionally, different patterns in trace plots can give more insight into why a model is not converging as demonstrated above. 
The best practice is to use both methods for a more thorough assessment.

The %gelman_rubin() macro computes Gelman-Rubin statistics for each MCMC parameter.
The number of MCMC chains to run for the test is selected with the num_chains parameter.
The initial random seed is specified by initial_mcmc_seed. This seed will be changed for each MCMC chain.

The Gelman-Rubin test can be used to show that the good model shown earlier in the vignette has converged. 
*/

%gelman_rubin(num_chains=5,
							pre_mcmc_data=model,
							id=SEQN,
							repeat_obs=DAY,
							weight=WTDRD1,
							daily_variables=Total_Grain,
							default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
							num_mcmc_iterations=3000,
							num_burn=1000,
							num_thin=2,
							initial_mcmc_seed=9999,
							outname=model_gelman_rubin);
							
proc print data=model_gelman_rubin; run;
