/*
Title: Univariate Distribution
Author: NCI/Information Management Services
Date: 3/24/2025
*/

/*
This example demonstrates creating a univariate distribution using the NCI method.
*/

libname indata "./ncimultivar/data";

%include "./ncimultivar/macros/ncimultivar.sas";

/*
The first step is to look at the dataset used for the analysis, which is derived from the 2005-2010 NHANES data.
A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

A univariate distribution will be created for sodium (TSODI).

Individual subjects are identified by SEQN. Each subject has up to two dietary recalls which are identified by DAY.

The covariates being examined are smoking status (SMK_REC), age (RIDAGEYR), sex (RIAGENDR).
Two nuisance covariates are being accounted for, whether the recall was on a weekend (`Weekend`) and whether the recall is on day 2 (`Day2`). 
They are considered nuisance covariates because they are not necessarily of interest to study but must be accounted for to get an accurate distribution.

The WTDRD1 variable is the survey weighting of each observation.

The functions in this package expect data in a tall format where each row represents one observation of a subject.
In some datasets, not all subjects will have the same number of observations. This is permitted as long as there are some subjects that have more than one observation. 
A good rule is to have at least 50 subjects with more than one observation with non-zero consumption.
*/

**subset data;
data input_dataset;
	set indata.nhcvd;
	if SDMVSTRA in (48 54 60 66 72 78);
	
	**Define indicator for Day 2;
	Day2 = (DAY = 2);
run;

proc print data=input_dataset (obs=10);

	var SEQN DAY TSODI SMK_REC RIDAGEYR RIAGENDR Day2 Weekend WTDRD1;
run;

/*
Additionally, all subjects used in the analysis must be complete cases (i.e, each subject must have all of the covariates, nutrients, and foods that are being analyzed). 
For this example, subjects who are missing one or more variables being analyzed will be removed from the dataset.

Categorical covariates (such as smoking status) need to broken down into binary indicator (dummy) variables. 
Since there are three categories, three binary indicators will be created. 
However, the reference group will not be included as a covariate in the model. 
This example will use never-smokers (SMK_REC = 3) as the reference group.
*/

data input_dataset;
	set input_dataset;
	
	**remove subjects that are missing any covariates or sodium intake;
	if not missing(SMK_REC)  and 
		 not missing(RIDAGEYR) and 
		 not missing(RIAGENDR) and
		 not missing(Weekend)  and
		 not missing(Day2)		 and
		 not missing(TSODI);
	
	**break down smoking status into binary indicators;
	Current_Smoker = (SMK_REC = 1);
	Former_Smoker = (SMK_REC = 2);
	Never_Smoker = (SMK_REC = 3);
	
	**rename sodium variable for readability;
	Sodium = TSODI;
run;

/*
The next step is to Winsorize extreme values of the sodium variable. 
The %boxcox_survey() macro has a helpful utility to suggest cutoffs for Winsorization.

By default, extreme values are defined as being beyond the range three times the interquartile range (IQR) below the 25th or above the 75th percentile on the transformed scale. 
This can be changed using the iqr_multiple parameter which sets how many times the IQR away from 25th and 75th percentiles a value can be before being considered an outlier.
Lower IQR multiples lead to stricter Winsorization of the data since more values are identified as outliers.
For this example analysis, a cutoff of twice the IQR is used to illustrate the structure of the Winsorization report. 
For an actual analysis, the ideal Winsorization cutoff should be the least strict value that still removes problematic outliers.
This will have to be found through experimentation, though the default value of three times the IQR will usually suffice.


Note that the suggested Winsorized values are found using values in the first recall but are applied to the entire dataset. 
Additionally, only positive values can be used since the Box-Cox transformation is invalid for zero and negative values. 
The %boxcox_survey() macro will only use positive values to find the best lambda. 
Since sodium is a daily consumed nutrient, values are expected to be above zero for all subjects.

For Winsorization, the id and repeat_obs parameters are needed to specify the subject and observation of each row of the dataset.
This allows the algorithm to identify which rows are in need of Winsorization and produce a report.
*/

**run Box-Cox survey on sodium variable to get suggested Winsorization cutoffs;
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Sodium,
							 weight=WTDRD1,
							 do_winsorization=Y,
							 iqr_multiple=2,
							 id=SEQN,
							 repeat_obs=DAY);
							 
/*
The Winsorization report has identified extreme values of Sodium as being below 387.79 mg or above 13598.56 mg. 
Using these values, it is now possible to Winsorize the Sodium variable.
*/

data input_dataset;
	set input_dataset;
	
	Sodium = max(Sodium, 387.7924);
	Sodium = min(Sodium, 13598.5605);
run;

/*
The NCI method assumes that food and nutrient amounts have a normal distribution. 
The Sodium variable can be normalized using a Box-Cox transformation. 
A suitable transformation parameter can be found using the same %boxcox_survey() macro used to find the Winsorization cutoffs. 

Unlike with Winsorization, the function will be run using the covariates that will be used in the model.
The covariates are specified using a space-delimited list.
*/

**run Box-Cox survey with covariates to get transformation parameter;
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Sodium,
							 covariates=Current_Smoker Former_Smoker RIDAGEYR RIAGENDR Weekend,
							 weight=WTDRD1);
							 
/*
Next, the minimum amount for sodium is calculated using the %calculate_minimum_amount() macro. 
The minimum amount is calculated as half the smallest non-zero intake in the dataset.
This value becomes the minimum allowed usual intake when simulating usual intakes later in the workflow.
As with the Box-Cox lambda, the minimum amount is found using the first recall.
*/

%calculate_minimum_amount(input_data=input_dataset,
													row_subset=%quote(Day2 = 0),
													daily_variables=Sodium);
													
/*
Now, the %nci_multivar_preprocessor() macro can be used to generate the input dataset for the MCMC model. 
The preprocessor function will apply the Box-Cox transformation from boxcox_survey to the sodium variable.
The transformed variable is then standardized to a mean of zero and a variance of 2 as required by the MCMC algorithm. 

Continuous covariates (such as age) should be standardized to a mean of zero and a variance of 1 for best results in the MCMC algorithm. 
The covariates to be standardized should be given in continuous_covariates as a space-delimited list.
The standardized covariates created by %nci_multivar_preprocessor() have the prefix std_.

The name of the model is specified by the outname parameter. This name is used as a prefix for all of the output datasets.
For example, the pre-processed dataset will be called model_mcmc_in since outname is set to model.
This name will be used as the input to the next function.
*/

%nci_multivar_preprocessor(input_data=input_dataset,
													 daily_variables=Sodium,
													 continuous_covariates=RIDAGEYR,
													 boxcox_lambda_data=bc_Sodium,
													 minimum_amount_data=minimum_amount_data,
													 outname=model);
													 
/*
The next step is to fit the MCMC model.

Covariates to be used for every variable in the model are specified in default_covariates as a space-delimited list.
Note that the standardized version of the age covariate generated by the preprocessor (std_RIDAGEYR) is used in the MCMC model.

The number of iterations to run the MCMC must also be selected.
The iteration parameters are:

	num_mcmc_iterations: The total number of iterations to run the MCMC including burn-in
	num_burn: The number of burn-in iterations that will be discarded when calculating posterior means
	num_thin: The spacing between iterations that will be used to calculate posterior means to ensure independence between iterations

The random seed for the MCMC is given in mcmc_seed.
Re-running the MCMC model using the same seed will produce the same results each time.
*/

%nci_multivar_mcmc(pre_mcmc_data=model,
									 id=SEQN,
									 repeat_obs=DAY,
									 weight=WTDRD1,
									 daily_variables=Sodium,
									 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
									 num_mcmc_iterations=3000,
									 num_burn=1000,
									 num_thin=2,
									 mcmc_seed=9999,
									 outname=model);
									 
/*
It is important to choose iteration parameters so that there are enough thinned samples to obtain stable estimates of the posterior means.
A good starting point is to make sure that the number of thinned iterations, calculated as (num_mcmc_iterations - num_burn)/num_thin, is at least 400.

Another consideration is whether the MCMC model has converged with the variables, covariates, and iteration parameters given.
Fortunately, there are some tests that can be used to assess the convergence of an MCMC model.

Trace plots are a quick, visual method of checking MCMC convergence.
A trace plot is a graph of the value of an MCMC parameter over each iteration.
An ideal trace plot looks like a random walk around a specific region of the plot.

Trace plots can be generated for a multivar MCMC model with the %trace_plots() utility.
The red line in the produced trace plots represents the end of the burn-in period.
*/

%trace_plots(multivar_mcmc_model=model);

/*
Sometimes, a more rigorous statistical assessment of convergence may be required.
The Gelman-Rubin test runs an MCMC model multiple times with different random seeds and compares the variation within and between the MCMC chains (Gelman and Rubin, 1992).
If an MCMC model has converged, then there should be little to no difference between the chains which will cause the between-chain variance to fall to zero.

The Gelman-Rubin statistic is the square root of the ratio between the total variance and the within-chain variance of a parameter. 
Convergence is indicated by the Gelman-Rubin statistics for all of the parameters being close to 1. 
There is no exact rule for how close to 1 the statistics must be, though an upper bound of 1.1 for convergence is suggested by Gelman, et al. (2004).
In general, iteration parameters should be chosen that get the Gelman-Rubin statistics as close to 1 as possible while balancing computation time.

The %gelman_rubin() macro computes Gelman-Rubin statistics for each MCMC parameter.
The number of MCMC chains to run for the test is selected with the num_chains parameter.
The initial random seed is specified by initial_mcmc_seed. This seed will be changed for each MCMC chain.
*/

%gelman_rubin(num_chains=5,
							pre_mcmc_data=model,
							id=SEQN,
							repeat_obs=DAY,
							weight=WTDRD1,
							daily_variables=Sodium,
							default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
							num_mcmc_iterations=3000,
							num_burn=1000,
							num_thin=2,
							initial_mcmc_seed=9999,
							outname=model_gelman_rubin);
							
proc print data=model_gelman_rubin; run;
									 
/*
The next step is to use the %nci_multivar_distrib() macro to simulate a dataset of usual intakes to be used to represent the distribution of true usual intakes. 
Note that this is not the same as predicting or calculating true usual intakes for each subject. 
Summary statistics can then be calculated for the distribution of simulated usual intakes.

A population-based dataset must be created that contains all of the subjects and additional variables that will be used for simulating usual intakes and downstream analysis. 
For this example, the population base will be the first instance of each subject in the MCMC input dataset.
*/

**get first instance of each subject;
proc sort data=model_mcmc_in; by SEQN; run;
data distrib_pop;
	set model_mcmc_in;
	by SEQN;
	if first.SEQN; 
run;

/*
Once the initial population base is created, nuisance variables must be accounted for.
To factor out the effect of the recall being conducted on day 2, the Day2 variable will be set to zero for all subjects.
*/

data distrib_pop;
	set distrib_pop;
	
	Day2 = 0;
run;

/*
To account for weekend vs weekday consumption, the simulated usual intakes for weekends and weekdays will be weighted and averaged for each subject.
To accomplish this, a repeat of each subject will be created in the population base. 
The first instance of each subject corresponds to weekday consumption (Weekend = 0) and the second instance corresponds to weekend consumption (Weekend = 1).
Since weekends are defined as three days in this dataset (Friday, Saturday, and Sunday), weekend consumption will be given a weight of 3 while weekday consumption will be given a weight of 4.
*/

**create a repeat of each subject and set weekend indicators and weights;
data distrib_pop;
	set distrib_pop;
	
	Weekend = 0;
	Weekend_Weight = 4;
	output;
	
	Weekend = 1;
	Weekend_Weight = 3;
	output;
run;

/*
The population base and MCMC multivar parameters can now be used to create a distribution of simulated usual intakes. 
In this analysis, each subject will have 200 simulated usual intakes.

The id and weight parameters specify the subject identifier and weight from the population base to include in the output.
For population bases with multiple nuisance variable levels (such as Weekend), nuisance_weight specifies the weight for each level.
The number of usual intakes to simulate for each population base subject is num_simulated_u.

The random seed for usual intake simulation is given by distrib_seed.
Re-simulating the usual intakes with the same seed and the same MCMC model will produce the same simulated values.
*/

**create dataset with 200 simulated usual intakes for each subject;
%nci_multivar_distrib(multivar_mcmc_model=model,
											distrib_population=distrib_pop,
											id=SEQN,
											weight=WTDRD1,
											nuisance_weight=Weekend_Weight,
											num_simulated_u=200,
											distrib_seed=99999,
											outname=model_distrib_out);
											
/*
The distribution of simulated usual intakes can now be used for further analysis.

For this example, summary statistics will be calculated using the %nci_multivar_summary() macro.

The proportions are be specified by creating datasets for the lower and upper intake thresholds.
Each dataset has two columns:

	variable: Name(s) of the variable(s) to calculate a proportion for
	threshold: The lower or upper limit(s) of intake
	
The lower_thresholds parameter is used for calculating the proportions of intakes above a lower limit.
Likewise, upper_thresholds is used for calculating the proportions of intakes below an upper limit.
*/

**datasets for lower and upper thresholds for proportions;
data lower;

	variable = "usl_Sodium";
	threshold = 2200;
run;

data upper;

	variable = "usl_Sodium";
	threshold = 3600;
run;

**compute means, quantiles, and proportions for simulated sodium intakes;
%nci_multivar_summary(input_data=model_distrib_out,
											variables=usl_Sodium,
											weight=WTDRD1,
											do_means=Y,
											do_quantiles=Y,
											quantiles=5 25 50 75 95,
											do_proportions=Y,
											lower_thresholds=lower,
											upper_thresholds=upper,
											outname=summary_stats);
											
proc print data=summary_stats; 

	title "Usual Intake Distribution of Sodium";
run;

/*
Note that the summary statistics dataset is structured so that all of the statistic values are in a single column. 
This will become important when using replicates to estimate standard errors.
*/