NCI Method Daily Nutrient

Amer Moosa

12/29/2023

The ncimultivar package contains utilities for performing analysis with the multivariate NCI method.

The different functions offer a multitude of options, so it is recommended to become familiar with them before beginning an analysis. This example will cover the basic workflow of an NCI method analysis with a daily consumed nutrient:

  1. Prepare the input dataset variables for analysis and Winsorize extreme values using boxcox_survey().
  2. Find a Box-Cox transformation parameter for the nutrient using boxcox_survey().
  3. Transform and standardize the variable to be analyzed using nci_multivar_preprocessor().
  4. Fit a measurement error model using nci_multivar_mcmc().
  5. Create a dataset of simulated usual intakes with nci_multivar_distrib().
  6. Compute summary statistics such as means, quantiles, and proportions using nci_multivar_summary().

Standard errors and confidence intervals for the calculated statistics will be generated by resampling with balanced repeated replication (BRR).

library(ncimultivar)

The dataset used for this analysis is derived from the 2005-2010 NHANES data. A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time. The NHANES dataset strata are masked pseudo-strata that hide the confidential information used to make the true design strata. These pseudo-strata are designed to produce the same variance estimates as the true strata.

Individuals subjects are identified by SEQN. Each subject has up to two dietary recalls which are identified by DAY.

The nutrient being analyzed is sodium (TSODI).

The covariates being examined are smoking status (SMK_REC), age (RIDAGEYR), sex (RIAGENDR). Two nuisance covariates are being accounted for, whether the recall was on a weekend (Weekend) and whether the recall is on day 2 (Day2). They are considered nuisance covariates because they are not necessarily of interest to study but must be accounted for to get an accurate distribution.

The WTDRD1 variable is the survey weighting of each observation.

The data is structured in a “tall” format where each row represents one observation of a subject.

In some datasets, not all subjects will have the same number of observations. This is permitted as long as there are some subjects that have more than one observation. A good rule is to have at least 50 subjects with more than one non-zero consumption day.

Looking at the first 10 observations shows the structure of the data.

#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]

#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)

#First 10 observations
head(input.dataset[,c("SEQN", "DAY", "TSODI", "SMK_REC", "RIDAGEYR", "RIAGENDR", "Day2", "Weekend", "WTDRD1")], 10)
#>     SEQN DAY TSODI SMK_REC RIDAGEYR RIAGENDR  Day2 Weekend    WTDRD1
#> 7  31155   1  5720       3       38        0 FALSE       1 22335.658
#> 8  31155   2  4552       3       38        0  TRUE       0 22335.658
#> 16 31186   1  3453       1       46        1 FALSE       1  2969.506
#> 17 31186   2  4350       1       46        1  TRUE       0  2969.506
#> 18 31187   1  3260       3       22        1 FALSE       1 13055.354
#> 19 31187   2  4026       3       22        1  TRUE       0 13055.354
#> 34 31214   1   282       2       46        1 FALSE       1 21388.889
#> 35 31214   2  3801       2       46        1  TRUE       0 21388.889
#> 42 31228   1  2654       2       34        0 FALSE       1  5018.875
#> 43 31228   2   441       2       34        0  TRUE       1  5018.875

The data should now be prepared for analysis.

All subjects used in the analysis must be complete cases (i.e, each subject must have all of the covariates, nutrients, and foods that are being analyzed). For this example, subjects who are missing one or more variables being analyzed will be removed from the dataset.

Categorical covariates (such as smoking status) need to broken down into binary indicator (“dummy”) variables. Since there are three categories, three binary indicators will be created. However, the reference group will not be included as a covariate in the model. This example will use never-smokers (SMK_REC = 3) as the reference group.

#remove subjects that are missing any covariates or variables
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) | 
                      is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$TSODI)

input.dataset <- input.dataset[!missing.covariates & !missing.variables,]

#break down smoking status into binary indicators
input.dataset$Current.Smoker <- as.numeric(input.dataset$SMK_REC == 1)
input.dataset$Former.Smoker <- as.numeric(input.dataset$SMK_REC == 2)
input.dataset$Never.Smoker <- as.numeric(input.dataset$SMK_REC == 3)

#rename sodium variable for readability
input.dataset$Sodium <- input.dataset$TSODI

Since standard errors will be calculated using balanced repeated replication (BRR), a set of BRR weights must be generated based on the base survey weights. BRR is similar to bootstrap, but it is performed using a structured set of weights generated using a Hadamard matrix. For standard BRR, half of the sample has double weight and the other half has zero weight for each set of weights.

The number of weight sets is determined by dimension of the Hadamard matrix, which must be 1, 2, or a multiple of 4 and greater than the number of strata. Since this example uses a subset of the NHANES data with 6 Strata, a Hadamard matrix of dimension 8 is used, and 8 BRR replicates are required. If the full dataset were used, there are 46 strata so 48 BRR replicates would be needed. Since ordinary bootstrap usually requires 200 or more replicates, BRR is often far more efficient for datasets where it can be applied.

To use BRR, each strata in the dataset must have exactly two primary sampling units (PSUs). Most NHANES data, including the dataset used in this example, meets this criteria. Some NHANES cycles contain unusual strata that do not have exactly two PSUs. These strata can be corrected the fix_nhanes_strata() utility.

BRR replicate weight sets can be generated using the brr_weights() function. Post-stratification adjustment will also be performed so that the sum of the weights in each post-stratification cell is the same as for the base weights. The brr_weights() function takes in the strata (SDMVSTRA), PSU (SDMVPSU), base weight (WTDRD1), and post-stratification cell (PSCELL) variables.

The output of the function will contain the base weight variable (RepWt_0) and a RepWt_ variable for each BRR weight set. The output weights are integerized by rounding them to the nearest integer and distributing the remainders to minimize rounding error.

In this tutorial, the Fay method of BRR will be used. This variation of BRR adjusts the weights using a specified value called the Fay factor (f), which must be between 0 and 1. The weights in half of the sample are multiplied by 1 + f and the weights in the other half are multiplied by 1 - f for every replicate weight set. Using a Fay factor below 1 ensures that every observation will be used in every weight set. A Fay factor of 1 is the same as standard BRR.

For this example, a Fay factor of 0.7 will be used. This multiplies the weights in half of the sample by 1.7 (1 + 0.7) and multiplies the weights in the other half by 0.3 (1 - 0.7) for every replicate weight set. It is important to record the Fay factor used to generate the weights since it will become important when calculating the BRR variance and standard error.

fay.factor <- 0.7

input.dataset <- brr_weights(input.data=input.dataset,
                             id="SEQN",
                             strata="SDMVSTRA",
                             psu="SDMVPSU",
                             cell="PSCELL",
                             weight="WTDRD1",
                             fay.factor=fay.factor)

The next step is to Winsorize extreme values of the sodium variable. The boxcox_survey() function has a helpful utility to suggest cutoffs for Winsorization.

By default, extreme values are defined as being beyond the range three times the interquartile range (IQR) below the 25th or above the 75th percentile on the transformed scale. This can be changed using the iqr.multiple parameter which sets how many times the IQR away from 25th and 75th percentiles a value can be before being considered an outlier. Lower IQR multiples lead to stricter Winsorization of the data since more values are identified as outliers. For this example analysis, a cutoff of twice the IQR is used to illustrate the structure of the Winsorization report. For an actual analysis, the ideal Winsorization cutoff should be the least strict value that still removes problematic outliers. This will have to be found through experimentation, though the default value of three times the IQR will usually suffice.

Note that the suggested Winsorized values are found using values in the first recall but are applied to the entire dataset. Additionally, only positive values can be used since the Box-Cox transformation is invalid for zero and negative values. The boxcox_survey() function will only use positive values to find the best lambda. Since sodium is a daily consumed nutrient, values are expected to be above zero for all subjects.

For Winsorization, the id and repeat.obs parameters are needed to specify the subject and observation of each row of the dataset. This allows the algorithm to identify which rows are in need of Winsorization and produce a report.

#run Box-Cox survey on sodium variable to get suggested Winsorization cutoffs using the weighting for the point estimate
wins.sodium <- boxcox_survey(input.data=input.dataset,
                             row.subset=(input.dataset$Day2 == 0),
                                         variable="Sodium",
                                         weight="RepWt_0",
                                         do.winsorization=TRUE,
                                         iqr.multiple=2,
                                         id="SEQN",
                                         repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Sodium
#>   SEQN DAY Sodium Sodium.winsorized
#>  31214   1    282          387.7924
#>  31403   2     55          387.7924
#>  31435   1    338          387.7924
#>  35957   1    335          387.7924
#>  37945   2    151          387.7924
#>  40879   1  18053        13598.5605
#>  42997   1  20165        13598.5605
#>  43528   1    308          387.7924
#>  44183   1  17687        13598.5605
#>  46795   1  21248        13598.5605
#>  47865   1    315          387.7924
#>  50316   1  15337        13598.5605
#>  51510   2    115          387.7924
#>  57179   2    163          387.7924
#>  57963   2     60          387.7924
#>  59838   1    207          387.7924

The Winsorization report has identified extreme values of Sodium as being below 387.79 mg or above 13598.56 mg. Using these values, it is now possible to Winsorize the Sodium variable.

input.dataset$Sodium <- pmax(input.dataset$Sodium, 387.7924)
input.dataset$Sodium <- pmin(input.dataset$Sodium, 13598.5605)

The NCI method assumes that food and nutrient amounts have a normal distribution. The Sodium variable can be normalized using a Box-Cox transformation. A suitable transformation parameter can be found using the same boxcox_survey() function used to find the Winsorization cutoffs.

Unlike with Winsorization, the function will be run using the covariates that will be used in the model. The covariates are specified as a vector.

#run Box-Cox survey with covariates to get transformation parameter
boxcox.lambda.data <- boxcox_survey(input.data=input.dataset,
                                    row.subset=(input.dataset$Day2 == 0),
                                    variable="Sodium",
                                    covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
                                    weight="RepWt_0")

Next, the minimum intake for sodium is calculated using the calculate_minimum_amount() function. The minimum amount is calculated as half the smallest non-zero intake in the dataset. This value becomes the minimum allowed usual intake when simulating usual intakes later in the workflow. As with the Box-Cox lambda, the minimum amount is found using the first recall.

minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
                                                row.subset=(input.dataset$Day2 == 0),
                                                daily.variables="Sodium")

Now, the nci_multivar_preprocessor() function can be used to generate the input dataset for the MCMC model. The preprocessor function will apply the Box-Cox transformation from boxcox_survey() to the sodium variable. The transformed variable is then standardized to a mean of zero and a variance of 2 as required by the MCMC algorithm.

Continuous covariates (such as age) should be standardized to a mean of zero and a variance of 1 for best results in the MCMC algorithm. The covariates to be standardized should be given in continuous.covariates as a vector. The standardized covariates created by nci_multivar_preprocessor() have the prefix std..

pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
                                           daily.variables="Sodium",
                                           continuous.covariates="RIDAGEYR",
                                           boxcox.lambda.data=boxcox.lambda.data,
                                           minimum.amount.data=minimum.amount.data)

Before fitting the MCMC model, it is worthwhile to explore the parts of the MCMC input.

The mcmc.input element is the input dataset with the transformed and standardized variables and covariates added.

The backtransformation element contains the data used to transform the sodium variable. This data is necessary to interpret the MCMC results and will be utilized later to calculate the true usual intakes on the original scale.

names(pre.mcmc.data)
#> [1] "mcmc.input"         "backtransformation"

It is also important to be familiar with the backtransformation data columns. tran_lambda: The Box-Cox lambda used to transform the variable. minamount: The minimum allowed usual intake, defined as half of the smallest non-zero intake in the observed data. tran_center: The mean of the Box-Cox transformed variable before standardization. tran_scale: The standard deviation of the Box-Cox transformed variable before standardization divided by sqrt(2). biomarker: Logical flag of whether the variable is a biomarker assumed to be unbiased on the transformed scale. If FALSE, a bias correction factor will be added and a 9-point approximation will be used for backtransformation. If TRUE, an exact backtransformation will be used with no correction.

pre.mcmc.data$backtransformation
#>   variable tran_lambda minamount tran_center tran_scale biomarker
#> 1   Sodium        0.31  193.8962    35.57676   4.594267     FALSE

Models must now be fit for the base weight and each set of replicate weights.

For BRR to work properly, nearly all BRR replicates should be used in variance calculation. It is good practice to verify that all of the BRR replicates have converged using techniques such as trace plots and the Gelman-Rubin test. Replicates that did not converge should not be included in the variance calculation.

Covariates to be used for every variable in the model are specified in default.covariates as a vector. Note that the standardized version of the age covariate generated by the preprocessor (std.RIDAGEYR) is used in the MCMC model.

The number of iterations to run the MCMC must also be selected. Often, this is selected by experimentation. More complex models with more variables usually need more iterations to run. The iteration parameters are:

num.mcmc.iterations: The total number of iterations to run the MCMC including burn-in num.burn: The number of burn-in iterations that will be discarded when calculating posterior means num.thin: The spacing between iterations that will be used to calculate posterior means to ensure independence between iterations

It is important to choose iteration parameters so that there are enough thinned samples to obtain stable estimates of the posterior means. A good starting point is to make sure that the number of thinned iterations, calculated as (num.mcmc.iterations - num.burn)/num.thin, is at least 400.

The random seed for the MCMC is given in mcmc.seed. The seed should be changed each time the MCMC is run within the analysis to avoid biasing the results. Re-running the MCMC models using the same sequence of seeds will produce the same results each time.

num.brr <- 8

mcmc.brr <- vector(mode="list", length=num.brr + 1)

for(brr.rep in 0:num.brr) {
  
  print(paste0("Starting Iteration ", brr.rep))
  
  mcmc.brr[[brr.rep + 1]] <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
                                               id="SEQN",
                                               repeat.obs="DAY",
                                               weight=paste0("RepWt_", brr.rep),
                                               daily.variables="Sodium",
                                               default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                               num.mcmc.iterations=3000,
                                               num.burn=1000,
                                               num.thin=2,
                                               mcmc.seed=(9999 + brr.rep))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"

The next step is to use the nci_multivar_distrib() function to simulate a dataset of usual intakes to be used to represent the distribution of true usual intakes. Note that this is not the same as predicting or calculating true usual intakes for each subject. Summary statistics can then be calculated for the simulated usual intakes. This process must be done for each BRR replicate.

A population-based dataset must be created that contains all of the subjects and additional variables that will be used for simulating usual intakes and downstream analysis. For this example, the population base will be the first instance of each subject in the MCMC input dataset. Using the original data is not required and other workflows may use a different population base.

#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)

Once the initial population base is created, nuisance variables must be accounted for. To factor out the effect of the recall being conducted on day 2, the Day2 variable will be set to zero for all subjects.

distrib.pop$Day2 <- 0

To account for weekend vs weekday consumption, the simulated usual intakes for weekends and weekdays will be weighted and averaged for each subject. To accomplish this, a repeat of each subject will be created in the population base. The first instance of each subject corresponds to weekday consumption (Weekend = 0) and the second instance corresponds to weekend consumption (Weekend = 1). Since weekends are defined as three days in this dataset (Friday, Saturday, and Sunday), weekend consumption will be given a weight of 3 while weekday consumption will be given a weight of 4.

#create a repeat of each subject
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]

#set weekend indicators and weights for the two instances of each subject
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)

The population base and MCMC multivar parameters can now be used to create a distribution of simulated usual intake for each BRR replicate. In this analysis, each subject will have 200 simulated usual intakes.

The id and weight parameters specify the subject identifier and weight from the population base to include in the output. For population bases with multiple nuisance variable levels (such as Weekend), nuisance.weight specifies the weight for each level. The number of usual intakes to simulate for each population base subject is num.simulated.u.

The random seed for usual intake simulation is given by distrib.seed. AsAs with the MCMC, the seed should be changed for each call to simulate usual intakes in the analysis to avoid biasing the results. Re-simulating the usual intakes with the same seeds and the same MCMC models will produce the same simulated values.

Summary statistics can then be calculated from the simulated usual intakes using the nci_multivar_summary() function. This is done in the same loop as the usual intake simulation so that the usual intake distribution dataset can be overwritten each time to save memory.

The proportions are specified as named lists for the lower and upper intake thresholds. The name(s) of the list represent the variable(s) to calculate proportions for while the value(s) represent the lower or upper intake limits.

The lower.thresholds parameter is used for calculating the proportions of intakes above a lower limit. Likewise, upper.thresholds is used for calculating the proportions of intakes below an upper limit.

summary.brr <- vector(mode="list", length=num.brr + 1)

for(brr.rep in 0:num.brr) {
  
  #create dataset with 200 simulated usual intakes for each subject
  distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.brr[[brr.rep + 1]],
                                       distrib.population=distrib.pop,
                                       id="SEQN",
                                       weight=paste0("RepWt_", brr.rep),
                                       nuisance.weight="Weekend.Weight",
                                       num.simulated.u=200,
                                       distrib.seed=(99999 + brr.rep))
  
  #compute means, quantiles, and proportions for simulated sodium intakes
  summary.brr[[brr.rep + 1]] <- nci_multivar_summary(input.data=distrib.data,
                                                     variables="usual.intake.Sodium",
                                                     weight=paste0("RepWt_", brr.rep),
                                                     do.means=TRUE,
                                                     do.quantiles=TRUE,
                                                     quantiles=c(0.05, 0.25, 0.5, 0.75, 0.95),
                                                     do.proportions=TRUE,
                                                     lower.thresholds=list(usual.intake.Sodium=2200),
                                                     upper.thresholds=list(usual.intake.Sodium=3600))
}

With a point estimate and BRR replicates for every summary statistic, standard errors and confidence intervals can now be calculated. With the data set up as one column per replicate, this can be done easily and efficiently using vectorized code.

The BRR replicate weights in this dataset used a Fay factor of 0.7, so this must be accounted for in calculating the variance. Other datasets may use different Fay factors when generating BRR replicate weights, so it is important to verify it before calculating variances. If ordinary bootstrapping is used instead of BRR, the same standard error formula can be used with the Fay factor set to 1.

When calculating confidence intervals, it is important to use the correct number of degrees of freedom. This is equal to the total number of PSUs across all strata minus the number of strata. Since BRR uses exactly two PSUs per strata, the degrees of freedom is simply the number of strata.

#calculate degrees of freedom
df <- length(unique(input.dataset$SDMVSTRA))

#extract point estimate and BRR replicates
point <- summary.brr[[1]]$value
reps <- vapply(summary.brr[-1], function(brr.i) brr.i$value, point)

#calculate BRR standard error
std.error <- sqrt(rowSums((reps - point)^2)/(num.brr*fay.factor^2))

#95% confidence intervals
confidence.lower <- point + qt(0.025, df)*std.error
confidence.upper <- point + qt(0.975, df)*std.error

#create summary report
summary.report <- data.frame(summary.brr[[1]], 
                             std.error=std.error,
                             confidence.lower=confidence.lower,
                             confidence.upper=confidence.upper)

summary.report
#>   population                   variable  statistic        value    std.error
#> 1        All        usual.intake.Sodium       Mean 3596.3190633 9.531772e+01
#> 2        All        usual.intake.Sodium         5% 1979.7619246 4.228440e+01
#> 3        All        usual.intake.Sodium        25% 2773.9421088 6.337429e+01
#> 4        All        usual.intake.Sodium        50% 3477.6814193 8.972452e+01
#> 5        All        usual.intake.Sodium        75% 4290.7204553 1.206570e+02
#> 6        All        usual.intake.Sodium        95% 5617.5462303 1.987009e+02
#> 7        All usual.intake.Sodium > 2200 Proportion    0.9109695 9.321863e-03
#> 8        All usual.intake.Sodium < 3600 Proportion    0.5430692 3.317736e-02
#>   confidence.lower confidence.upper
#> 1     3363.0850124     3829.5531141
#> 2     1876.2957362     2083.2281130
#> 3     2618.8708106     2929.0134071
#> 4     3258.1334342     3697.2294044
#> 5     3995.4834388     4585.9574719
#> 6     5131.3426775     6103.7497831
#> 7        0.8881598        0.9337793
#> 8        0.4618871        0.6242513