/*
Title: NCI Method Food and Nutrient
Author: NCI/Information Management Services
Date: 6/9/2025
*/

/*
The NCI method can be used to model multiple variables at once, including both episodically consumed foods (referred to as foods) and daily consumed nutrients (referred to as nutrients).
This vignette demonstrates how to perform a simple multivariate analysis using one food and one nutrient. 

This vignette assumes basic familiarity with the NCI method and the ncimultivar package. 
A more in-depth introduction to the basic NCI method procedures can be found in the daily nutrient vignette (daily_nutrient_analysis.sas).

The workflow for an analysis involving episodic foods is as follows:

1. Prepare the input dataset variables for analysis and winsorize using %boxcox_survey().
2. Find Box-Cox transformation parameters for the food and nutrient using %boxcox_survey().
3. Transform, standardize, and create indicators for the variables to be analyzed using %nci_multivar_preprocessor().
4. Fit a measurement error model using %nci_multivar_mcmc().
5. Create a dataset of simulated usual intakes with %nci_multivar_distrib().
6. Compute summary statistics such as means, quantiles, and proportions using %nci_multivar_summary().

Some of the steps are modified to accommodate non-consumption days for episodically consumed foods.
Standard errors and confidence intervals for the calculated statistics will be generated by resampling with balanced repeated replication (BRR).
*/

libname indata "../data";

%include "../macros/ncimultivar.sas";

/*
The dataset used for this analysis is derived from the 2005-2010 NHANES data.
A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

The subject identifier is SEQN and each observation for a subject is identified by DAY.

The food being analyzed in this example is whole grains (G_WHOLE). The nutrient being analyzed is energy (TKCAL).

The covariates being examined are smoking status (SMK_REC), age (RIDAGEYR), and sex (RIAGENDR). 
Two nuisance covariates will be factored in as well: whether the recall was on a weekend (Weekend) and and whether the recall is on day 2 (Day2).

The WTDRD1 variable is the base weighting for each observation.

Subjects with missing values are removed, and categorical variables are transformed into binary indicators. 
*/

**subset data;
data input_dataset;
	set indata.nhcvd;
	if SDMVSTRA in (48, 54, 60, 66, 72, 78);
	
	**Define indicator for Day 2;
	Day2 = (DAY = 2);
run;

data input_dataset;
	set input_dataset;
	
	**remove subjects that are missing any covariates or variables;
	if not missing(SMK_REC) and
		 not missing(RIDAGEYR) and
		 not missing(RIAGENDR) and
		 not missing(Weekend) and
		 not missing(Day2) and
		 not missing(G_WHOLE) and
		 not missing(TKCAL);
		 
	**break down smoking status into binary indicators;
	Current_Smoker = (SMK_REC = 1);
	Former_Smoker = (SMK_REC = 2);
	Never_Smoker = (SMK_REC = 3);
	
	**rename whole grain and energy variables for readability;
	Whole_Grain = G_WHOLE;
	Energy = TKCAL;
run;

/*
BRR weights will be added to the dataset.
*/

%let fay_factor = 0.7;

%brr_weights(input_data=input_dataset,
						 id=SEQN,
						 strata=SDMVSTRA,
						 psu=SDMVPSU,
						 cell=PSCELL,
						 weight=WTDRD1,
						 fay_factor=&fay_factor.,
						 outname=input_dataset);

/*					 
Next, extreme observations will be identified for Winsorization. 
The %boxcox_survey() macro must be run for both variables. 
The non-zero values of the first recall will be used. 

Since whole grain is an episodic variable, Winsorization for low outliers is done slightly differently. 
Instead of changing values that are too small to a minimum threshold, they are set to zero and treated as a non-consumption observation. 
This behavior is toggled by setting the is_episodic parameter to Y.
*/

%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Whole_Grain,
							 is_episodic=Y,
							 weight=RepWt_0,
							 do_winsorization=Y,
							 iqr_multiple=2,
							 id=SEQN,
							 repeat_obs=DAY);

%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Energy,
							 weight=RepWt_0,
							 do_winsorization=Y,
							 iqr_multiple=2,
							 id=SEQN,
							 repeat_obs=DAY);
							 
data input_dataset;
	set input_dataset;
	
	**Winsorize whole grain;
	Whole_Grain = min(Whole_Grain, 10.71163);
	
	**Winsorize energy;
	Energy = max(Energy, 269.0701);
	Energy = min(Energy, 8026.0436);
run;

/*
Next, the best Box-Cox lambda parameter for each variable can be found using the Winsorized data in the presence of covariates. 
A Box-Cox survey must be run for each variable used in the model, and the ootput datasets are concatenated.
*/

%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Whole_Grain,
							 covariates=Current_Smoker Former_Smoker RIDAGEYR RIAGENDR Weekend,
							 weight=RepWt_0);
							 
%boxcox_survey(input_data=input_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Energy,
							 covariates=Current_Smoker Former_Smoker RIDAGEYR RIAGENDR Weekend,
							 weight=RepWt_0);
							 
data boxcox_lambda_data;
	set bc_Whole_Grain
			bc_Energy;
run;

/*
Next, the minimum amounts for whole grain and energy can be calculated using the first recall. 

The %calculate_minimum_amount() function can handle multiple variables at the same time.
Episodically consumed foods are specified with episodic_variables while daily consumed nutrients are specified with daily_variables.
If there are multiple daily and/or episodic variables, they should be specified as space-delimited lists in their respective parameters.
*/

%calculate_minimum_amount(input_data=input_dataset,
													row_subset=%quote(Day2 = 0),
													episodic_variables=Whole_Grain,
													daily_variables=Energy);
													
/*
The %nci_multivar_preprocessor() macro can now generate the MCMC input data. 
Whole grains should be input in the episodic_variables parameter to tell the macro to generate a consumption indicator variable in addition to an amount variable.

The episodic_variables and daily_variables parameters are space-delimited lists of episodically consumed foods and daily consumed nutrients, respectively.

Continuous covariates to be standardized are specified with continuous_covariates just like the daily nutrient vignette (daily_nutrient_analysis.sas).
*/

%nci_multivar_preprocessor(input_data=input_dataset,
													 episodic_variables=Whole_Grain,
													 daily_variables=Energy,
													 continuous_covariates=RIDAGEYR,
													 boxcox_lambda_data=boxcox_lambda_data,
													 minimum_amount_data=minimum_amount_data,
													 outname=model);

/*
The preprocessor creates the indicator variable ind_Whole_Grain` for and amount variable amt_Whole_Grain for Whole_Grain.
The indicator variable shows whether or not the subject reported consuming whole grains on a particular day. 
The amount is always non-zero when the indicator is 1 and missing when the indicator is 0.

The first 10 observations demonstrate the relationship between indicators and amounts.
*/

proc print data=model_mcmc_in (obs = 10);
	
	var SEQN DAY ind_Whole_Grain amt_Whole_Grain;
run;

/*
The MCMC model can now be fit using both whole grains and energy for all of the BRR replicates.

Episodic and daily variables are specified in episodic_variables and daily_variables as in the previous steps.

Note that the number of iterations and burn-in is higher than for a single daily consumed nutrient.
This is because models with episodic variables and multivariate models take more iterations to converge.
*/

%let num_brr = 8;

%macro mcmc_brr(num_brr=);

	%do brr_rep = 0 %to &num_brr.;
	
		%put Starting Iteration &brr_rep.;
		
		%nci_multivar_mcmc(pre_mcmc_data=model,
											 id=SEQN,
											 repeat_obs=DAY,
											 weight=RepWt_&brr_rep.,
											 episodic_variables=Whole_Grain,
											 daily_variables=Energy,
											 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
											 num_mcmc_iterations=4000,
											 num_burn=2000,
											 num_thin=5,
											 mcmc_seed=%eval(9999 + &brr_rep.),
											 outname=model_brr&brr_rep.);
	%end;
%mend mcmc_brr;

%mcmc_brr(num_brr=&num_brr.);

/*
The %nci_multivar_distrib() macro will be used to simulate a dataset of usual intakes to be used to represent the distribution of true usual intakes. 
This procedure does not calculate or predict true usual intakes for each subject.

A population-based dataset must be constructed as with the daily nutrient vignette (daily_nutrient_analysis.sas).
*/

proc sort data=model_mcmc_in; by SEQN; run;
data distrib_pop;
	set model_mcmc_in;
	by SEQN;
	
	**get first instance of each subject;
	if first.SEQN then do;
	
		**Set Day 2 to zero to factor out the effect of Day 2 recalls;
		Day2 = 0;
	
		**create repeats of each subject for weekday and weekend consumption;
		Weekend = 0;
		Weekend_Weight = 4;
		output;
	
		Weekend = 1;
		Weekend_Weight = 3;
		output;
	end;
run;

/*
The %nci_multivar_distrib() macro can now be used to simulate 200 usual intakes for each subject for each BRR replicate.
The resulting distribution dataset contains simulated usual intakes for both whole grains and energy.

The %nci_multivar_summary() macro is used to calculate means and quantiles. 
Multiple variables can be summarized at the same time by specifying them as a space-delimited list in the variables parameter.
*/

%macro summary_brr(num_brr=);

	%do brr_rep = 0 %to &num_brr.;
	
		%put Starting Iteration &brr_rep.;
		
		**create dataset with 200 simulated usual intakes for each subject;
		%nci_multivar_distrib(multivar_mcmc_model=model_brr&brr_rep.,
													distrib_population=distrib_pop,
													id=SEQN,
													weight=RepWt_&brr_rep.,
													nuisance_weight=Weekend_Weight,
													num_simulated_u=200,
													distrib_seed=%eval(99999 + &brr_rep.),
													outname=model_distrib_out);
													
		**compute means, quantiles, and proportions for simulated whole grain and energy intakes;
		%nci_multivar_summary(input_data=model_distrib_out,
													variables=usl_Whole_Grain usl_Energy,
													weight=RepWt_&brr_rep.,
													do_means=Y,
													do_quantiles=Y,
													quantiles=5 25 50 75 95,
													outname=summary_brr&brr_rep.);
	%end;
	
	**extract point estimate and BRR replicates;
	data summary_brr_data;
		set summary_brr0;
		%do brr_rep = 1 %to &num_brr.;
			set summary_brr&brr_rep. (keep = value rename=(value = brr&brr_rep.));
		%end;
	run;
%mend summary_brr;

%summary_brr(num_brr=&num_brr.);

/*
The BRR standard errors and confidence intervals can now be calculated. 
To properly calculate the BRR standard error, it is important to divide by the square of the Fay factor that was used to generate the BRR replicate weights. 
For this dataset, the replicate weights were generated with a Fay factor of 0.7, but this can vary for different datasets. 
For ordinary bootstrapping, the same formula can be used with the Fay factor set to 1.
*/

**calculate degrees of freedom;
proc sort data=input_dataset; by SDMVSTRA; run;

data _NULL_;
	set input_dataset end=last_obs;
	by SDMVSTRA;
	
	retain num_strata 0;;
	
	if first.SDMVSTRA then num_strata = num_strata + 1;
	
	if last_obs = 1 then call symputx("df", num_strata);
run;

**create summary report;
data summary_report (keep = population variable statistic value std_error confidence_lower confidence_upper);
	set summary_brr_data;
	
	array reps{&num_brr.} brr1-brr&num_brr.;
	
	**calculate BRR standard error;
	sum_sq_diff = 0;
	do i = 1 to &num_brr.;
	
		sum_sq_diff = sum_sq_diff + (reps{i} - value)**2;
	end;
	
	std_error = sqrt(sum_sq_diff/(&num_brr.*&fay_factor.**2));
	
	**95% confidence intervals;
	confidence_lower = value + tinv(0.025, &df.)*std_error;
	confidence_upper = value + tinv(0.975, &df.)*std_error;
run;

proc print data=summary_report; run;

