/*
Title: Working with Complex Surveys
Author: NCI/Information Management Services
Date: 8/11/2025
*/

/*
This example demonstrates calculating standard errors using bootstrapping and balanced repeated replication (BRR).

NOTE: THIS EXAMPLE MAY HAVE LONG COMPUTATION TIMES; IT IS RECOMMENDED TO RUN THIS PRIOR TO THE BREAKOUT SESSION.
*/

libname indata "./ncimultivar/data";

%include "./ncimultivar/macros/ncimultivar.sas";

/*
Resampling methods can be used to estimates variances and standard errors for summary statistics of distributions produced by the NCI method.
These standard errors can be used to calculate confidence intervals which can be reported alongside the estimated statistics.
Bootstrap and Balanced Repeated Replication (BRR) are two common resampling methods used to generate standard errors for means, quantiles, and other statistics.
The best method to use will be different, depending on the data and how it is structured.


The bootstrap example will estimate the mean and quantiles of whole grains (G_WHOLE) intake from synthetic data.
The synthetic data has two design strata (STRATA), defined by sex.

The covariates will be age (AGE), sex (SEX), and BMI (LOG_BMI).
The BMI covariate is log-transformed so that is has a more normal distribution.
No nuisance covariates will be adjusted for.
Each subject has an equal weight in the dataset.

Subjects with missing values are removed.
*/

data boot_dataset;
	set indata.sim_g_whole;
	
	**remove subjects missing any covariates or whole grain intake;
	if not missing(AGE) and
		 not missing(SEX) and
		 not missing(LOG_BMI) and
		 not missing(G_WHOLE);
		 
	**binary indicator for Day 2;
	Day2 = (DAY = 2);
	
	**rename whole grain variable for readability;
	Whole_Grain = G_WHOLE;
run;

/*
The BRR example will estimate the mean and quantiles of whole grain (G_WHOLE) intake from 2005-2010 NHANES data.
A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.
The NHANES dataset strata are masked pseudo-strata that hide the confidential information used to make the true design strata.
These pseudo-strata are designed to produce the same variance estimates as the true strata.

The covariates will be age (RIDAGEYR), sex (RIAGENDR), and smoking status (SMK_REC).
The nuisance covariates to adjust for are the recall being on day 2 (Day2) or on a weekend (Weekend).
Each subject's survey weight is given by WTDRD1.

Subjects with missing values are removed, and categorical variables are transformed into binary indicators.
*/

**subset data;
data brr_dataset;
	set indata.nhcvd;
	if SDMVSTRA in (48 54 60 66 72 78);
	
	**binary indicator for Day 2;
	Day2 = (DAY = 2);
run;

data brr_dataset;
	set brr_dataset;
	
	**remove subjects missing any covariates or whole grain intake;
	if not missing(SMK_REC)  and
		 not missing(RIDAGEYR) and
		 not missing(RIAGENDR) and
		 not missing(Day2)     and
		 not missing(Weekend)  and
		 not missing(G_WHOLE);
		 
	**break down smoking status into binary indicators;
	Current_Smoker = (SMK_REC = 1);
	Former_Smoker  = (SMK_REC = 2);
	Never_Smoker   = (SMK_REC = 3);
	
	**rename whole grain variable for readability;
	Whole_Grain = G_WHOLE;
run;

/*
To calculate standard errors using bootstrapping, sets of bootstrap replicate weights must be generated for the dataset.
Each bootstrap weight set is equivalent to resampling the Primary Sampling Units (PSUs) in the original dataset with replacement.
Often, as in this example, each subject is treated as its own PSU when bootstrapping.

Bootstrap weights are generated using the %boot_weights() macro.
The %boot_weights()% macro takes in the strata (STRATA) and PSU (ID).
Note that the subject identifier is used as the PSU so that each subject is treated as its own PSU.
The subjects are equally weighted so no weight variable is specified and the %boot_weights() macro treats each subject as having a weight of 1.

The output of the function will contain the base weight variable (RepWt_0) and a RepWt_ variable for each bootstrap weight set.
The output weights are integerized by rounding them to the nearest integer and distributing the remainders to minimize rounding error.

Many bootstrap replicates must be performed in order to obtain accurate standard errors.
In practice, 200-500 replicates are used, depending on the dataset.
Some data sources will have specific recommendations for the number of replicates to use.
For this example, 50 replicates will be performed to save on computation time.
*/

%boot_weights(input_data=boot_dataset,
							id=ID,
							strata=STRATA,
							psu=ID,
							num_reps=50,
							boot_seed=99,
							outname=boot_dataset);
							
/*
As with bootstrap, sets of BRR replicate weights must be generated.
BRR is similar to bootstrap, but it is performed using a structured set of weights generated using a Hadamard matrix instead of random sampling.
For standard BRR, half of the sample has double weight and the other half has zero weight for each set of weights. 

The number of weight sets is determined by dimension of the Hadamard matrix, which must be 1, 2, or a multiple of 4 and greater than the number of strata.
Since this example uses a subset of the NHANES data with 6 Strata, a Hadamard matrix of dimension 8 is used, and 8 BRR replicates are required.
If the full dataset were used, there are 46 strata so 48 BRR replicates would be needed.
Since ordinary bootstrap usually requires 200 or more replicates, BRR is often far more efficient for datasets where it can be applied.

Post-stratification adjustment will also be performed so that the sum of the weights in each post-stratification cell is the same as for the base weights. 

To use BRR, each strata in the dataset must have exactly two primary sampling units (PSUs).
Most NHANES data, including the dataset used in this example, meet this criteria.
When NHANES cycles, or other data, contain unusual strata that do not have exactly two PSUs, these strata can be corrected using the %fix_nhanes_strata() utility.
The macro call is demonstrated below even though all strata in the dataset used in this example have 2 PSUs.
*/

%fix_nhanes_strata(input_data=brr_dataset,
									 outname=brr_dataset);
									 
/*
BRR replicate weight sets can be generated using the %brr_weights() macro. 
The %brr_weights() macro takes in the strata (SDMVSTRA), PSU (SDMVPSU), base weight (WTDRD1), and post-stratification cell (PSCELL) variables. 
Similar to the bootstrap weights, the output dataset has an integerized base weight (RepWt_0) and a RepWt_ variable for each BRR weight set.

In this tutorial, the Fay method of BRR will be used. 
This variation of BRR adjusts the weights using a specified value called the Fay factor (f), which must be between 0 and 1.
The weights in half of the sample are multiplied by 1 + f and the weights in the other half are multiplied by 1 - f for every replicate weight set.
Using a Fay factor below 1 ensures that every observation will be used in every weight set.
A Fay factor of 1 is the same as standard BRR.

For this example, a Fay factor of 0.7 will be used. 
This multiplies the weights in half of the sample by 1.7 (1 + 0.7) and multiplies the weights in the other half by 0.3 (1 - 0.7) for every replicate weight set.
It is important to record the Fay factor used to generate the weights since it will become important when calculating the BRR variance and standard error.
*/

%let fay_factor = 0.7;

%brr_weights(input_data=brr_dataset,
						 id=SEQN,
						 strata=SDMVSTRA,
						 psu=SDMVPSU,
						 cell=PSCELL,
						 weight=WTDRD1,
						 fay_factor=&fay_factor.,
						 outname=brr_dataset);
						 
/*
The variables in both datasets are transformed and standardized for use in the MCMC algorithm.
For the bootstrap dataset, no weight variable is needed since all subjects have a weight of 1.
For the BRR dataset, the weight is given as RepWt_0 which corresponds to the integerized base weight.
*/

**Check for extreme values;
%boxcox_survey(input_data=boot_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Whole_Grain,
							 is_episodic=Y,
							 do_winsorization=Y,
							 id=ID,
							 repeat_obs=DAY);
							 
%boxcox_survey(input_data=brr_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Whole_Grain,
							 is_episodic=Y,
							 weight=RepWt_0,
							 do_winsorization=Y,
							 id=SEQN,
							 repeat_obs=DAY);
							 
**Find best Box-Cox survey lambdas;
%boxcox_survey(input_data=boot_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Whole_Grain,
							 covariates=AGE SEX LOG_BMI);
							 
data boxcox_boot;
	set bc_Whole_Grain;
run;

%boxcox_survey(input_data=brr_dataset,
							 row_subset=%quote(Day2 = 0),
							 variable=Whole_Grain,
							 weight=RepWt_0,
							 covariates=Current_Smoker Former_Smoker RIDAGEYR RIAGENDR Weekend);
							 
data boxcox_brr;
	set bc_Whole_Grain;
run;

**Calculate minimum consumption amounts;
%calculate_minimum_amount(input_data=boot_dataset,
													row_subset=%quote(Day2 = 0),
													episodic_variables=Whole_Grain);
													
data minimum_amount_boot;
	set minimum_amount_data;
run;

%calculate_minimum_amount(input_data=brr_dataset,
													row_subset=%quote(Day2 = 0),
													episodic_variables=Whole_Grain);
													
data minimum_amount_brr;
	set minimum_amount_data;
run;

**Transform and standardize variables;
%nci_multivar_preprocessor(input_data=boot_dataset,
													 episodic_variables=Whole_Grain,
													 continuous_covariates=AGE LOG_BMI,
													 boxcox_lambda_data=boxcox_boot,
													 minimum_amount_data=minimum_amount_boot,
													 outname=boot);
													 
%nci_multivar_preprocessor(input_data=brr_dataset,
													 episodic_variables=Whole_Grain,
													 continuous_covariates=RIDAGEYR,
													 boxcox_lambda_data=boxcox_brr,
													 minimum_amount_data=minimum_amount_brr,
													 outname=brr);
													 
/*
An MCMC model must be fit using the base weights and each set of bootstrap replicate weights.
Occasionally, a model may not converge with one or more replicate weight sets.
In that case, the replicate that did not converge should be omitted from the variance calculations.
It is good practice to check convergence of replicates using methods such as trace plots or the Gelman-Rubin test.

The random seed should be changed each time the MCMC is run within the analysis to avoid biasing the results.
*/

%let num_boot = 50;

%macro mcmc_boot(num_boot=);

	%do boot_rep = 0 %to &num_boot.;
	
		%nci_multivar_mcmc(pre_mcmc_data=boot,
											 id=ID,
											 repeat_obs=DAY,
											 weight=RepWt_&boot_rep.,
											 episodic_variables=Whole_Grain,
											 default_covariates=std_AGE SEX std_LOG_BMI,
											 num_mcmc_iterations=4000,
											 num_burn=2000,
											 num_thin=2,
											 mcmc_seed=%eval(999 + &boot_rep.),
											 outname=boot&boot_rep.);
	%end;
%mend mcmc_boot;

%mcmc_boot(num_boot=&num_boot.);

/*
The MCMC replicates for BRR are implemented the same way as for bootstrap.

For BRR to work properly, nearly all BRR replicates should be used in variance calculation.
As with bootstrap, it is good practice to check to see if all replicates have converged.
Replicates that did not converge should be removed from the variance calculation.
*/

%let num_brr = 8;

%macro mcmc_brr(num_brr=);

	%do brr_rep = 0 %to &num_brr.;
	
		%nci_multivar_mcmc(pre_mcmc_data=brr,
											 id=SEQN,
											 repeat_obs=DAY,
											 weight=RepWt_&brr_rep.,
											 episodic_variables=Whole_Grain,
											 default_covariates=Current_Smoker Former_Smoker std_RIDAGEYR RIAGENDR Day2 Weekend,
											 num_mcmc_iterations=4000,
											 num_burn=2000,
											 num_thin=2,
											 mcmc_seed=%eval(9999 + &brr_rep.),
											 outname=brr&brr_rep.);
	%end;
%mend mcmc_brr;

%mcmc_brr(num_brr=&num_brr.);

/*
Population bases are created for each dataset to simulate usual intakes.
For the bootstrap dataset, there are no nuisance covariates so the population base is simply the first instance of each subject in the data.
*/

proc sort data=boot_mcmc_in; by ID; run;

data distrib_pop_boot;
	set boot_mcmc_in;
	by ID;
	if first.ID;
run;

/*
The population base for BRR must be adjusted for the Day2 and Weekend nusiance covariates.
*/

proc sort data=brr_mcmc_in; by SEQN; run;

data distrib_pop_brr;
	set brr_mcmc_in;
	by SEQN;
	
	if first.SEQN then do;
	
		Day2 = 0;
		
		Weekend = 0;
		Weekend_Weight = 4;
		output;
		
		Weekend = 1;
		Weekend_Weight = 3;
		output;
	end;
run;

/*
The %nci_multivar_distrib() macro is used to simulate usual intakes from each MCMC bootstrap replicate.
Means and quantiles are then calculated from each simulated distribution using %nci_multivar_summary().

Since the simulated intake dataset is very large, it is more memory-efficient to calculate the summary statistics in the same loop as simulating usual intakes when doing replicates.
This way, the distribution dataset can be overwritten each time instead of saving a large dataset for every replicate.

As with the MCMC, the seed should be changed for each usual intake simulation in the analysis to avoid biasing the results.
*/

%macro summary_boot(num_boot=);

	%do boot_rep = 0 %to &num_boot.;
	
		**Simulate usual intakes;
		%nci_multivar_distrib(multivar_mcmc_model=boot&boot_rep.,
													distrib_population=distrib_pop_boot,
													id=ID,
													weight=RepWt_&boot_rep.,
													num_simulated_u=200,
													distrib_seed=%eval(99999 + &boot_rep.),
													outname=distrib_out);
													
		**Calculate means and quantiles;
		%nci_multivar_summary(input_data=distrib_out,
													variables=usl_Whole_Grain,
													weight=RepWt_&boot_rep.,
													do_means=Y,
													do_quantiles=Y,
													quantiles=5 25 50 75 95,
													do_proportions=N,
													outname=summary_boot&boot_rep.);
	%end;
	
	**extract point estimate and bootstrap replicates;
	data summary_boot_data;
		set summary_boot0;
		%do boot_rep = 1 %to &num_boot.;
			set summary_boot&boot_rep. (keep = value rename=(value = boot&boot_rep.));
		%end;
	run;
%mend summary_boot;

%summary_boot(num_boot=&num_boot.);

/*
The process of simulating usual intakes and calculating summary statistics is the same for BRR as it is for bootstrap.
*/

%macro summary_brr(num_brr=);

	%do brr_rep = 0 %to &num_brr.;
	
		**Simulate usual intakes;
		%nci_multivar_distrib(multivar_mcmc_model=brr&brr_rep.,
													distrib_population=distrib_pop_brr,
													id=SEQN,
													weight=RepWt_&brr_rep.,
													nuisance_weight=Weekend_Weight,
													num_simulated_u=200,
													distrib_seed=%eval(999999 + &brr_rep.),
													outname=distrib_out);
													
		**Calculate means and quantiles;
		%nci_multivar_summary(input_data=distrib_out,
													variables=usl_Whole_Grain,
													weight=RepWt_&brr_rep.,
													do_means=Y,
													do_quantiles=Y,
													quantiles=5 25 50 75 95,
													do_proportions=N,
													outname=summary_brr&brr_rep.);
	%end;
	
	**extract point estimate and BRR replicates;
	data summary_brr_data;
		set summary_brr0;
		%do brr_rep = 1 %to &num_brr.;
			set summary_brr&brr_rep. (keep = value rename=(value = brr&brr_rep.));
		%end;
	run;
%mend summary_brr;

%summary_brr(num_brr=&num_brr.);

/*
With a point estimate and bootstrap replicates for every summary statistic, standard errors and confidence intervals can now be calculated. 
With the data set up as one column per replicate, this can be done easily and efficiently using vectorized code.

When calculating confidence intervals, it is important to use the correct number of degrees of freedom. 
This is equal to the total number of PSUs across all strata minus the number of strata.
Since each subject was used as its own PSU, the number of PSUs is the number of subjects.
*/

**calculate degrees of freedom (PSUs - Strata);
proc sort data=boot_dataset; by STRATA ID; run;

data _NULL_;
	set boot_dataset end=last;
	by STRATA ID;
	
	retain num_strata num_psu 0;
	
	if first.STRATA then num_strata = num_strata + 1;
	if first.ID then num_psu = num_psu + 1;
	
	if last = 1 then call symputx("df_boot", num_psu - num_strata);
run;



**create summary report;
data summary_report_boot (keep = population variable statistic value std_error confidence_lower confidence_upper);
	set summary_boot_data;
	
	array reps{&num_boot.} boot1-boot&num_boot.;
	
	**Calculate bootstrap standard error;
	sum_sq_diff = 0;
	
	do i = 1 to &num_boot.;
	
		sum_sq_diff = sum_sq_diff + (reps{i} - value)**2;
	end;
	
	std_error = sqrt(sum_sq_diff/&num_boot.);
	
	**95% confidence intervals;
	confidence_lower = value + tinv(0.025, &df_boot.)*std_error;
	confidence_upper = value + tinv(0.975, &df_boot.)*std_error;
run;

proc print data=summary_report_boot;

	title "Whole Grain Distribution with Bootstrap Standard Errors";
run;

/*
Calculating standard errors for BRR is done similarly to bootstrap.
However, if using the Fay method, the variance must be divided by the square of the Fay factor used to generate the weights.
The BRR replicate weights in this dataset used a Fay factor of 0.7, so this must be accounted for in calculating the variance. 
Other datasets may use different Fay factors when generating BRR replicate weights so it is important to verify the Fay factor before calculating variances. 

As with bootstrap, the number of degrees of freedom is the number of PSUs minus the number strata.
Since BRR requires exactly two PSUs per strata, the number of degrees of freedom is simply the number of strata.
*/

**calculate degrees of freedom;
proc sort data=brr_mcmc_in; by SDMVSTRA; run;

data _NULL_;
	set brr_mcmc_in end=last;
	by SDMVSTRA;
	
	retain num_strata 0;
	
	if first.SDMVSTRA then num_strata = num_strata + 1;
	
	if last = 1 then call symputx("df_brr", num_strata);
run;

**create summary report;
data summary_report_brr (keep = population variable statistic value std_error confidence_lower confidence_upper);
	set summary_brr_data;
	
	array reps{&num_brr.} brr1-brr&num_brr.;
	
	**Calculate BRR standard error;
	sum_sq_diff = 0;
	
	do i = 1 to &num_brr.;
	
		sum_sq_diff = sum_sq_diff + (reps{i} - value)**2;
	end;
	
	std_error = sqrt(sum_sq_diff/(&num_brr.*&fay_factor.**2));
	
	**95% confidence intervals;
	confidence_lower = value + tinv(0.025, &df_brr.)*std_error;
	confidence_upper = value + tinv(0.975, &df_brr.)*std_error;
run;

proc print data=summary_report_brr;

	title "Whole Grain Distribution with BRR Standard Errors";
run;
							 
