NCI Method Food and Nutrient

Amer Moosa

2/15/2024

The NCI method can be used to model multiple variables at once, including both episodically consumed foods (referred to as ‘foods’) and daily consumed nutrients (referred to as ‘nutrients’). This vignette demonstrates how to perform a simple multivariate analysis using one food and one nutrient.

This vignette assumes basic familiarity with the NCI method and the ncimultivar package. A more in-depth introduction to the basic NCI method procedures can be found in the daily nutrient vignette.

The workflow for an analysis involving episodic foods is as follows:

  1. Prepare the input dataset variables for analysis and winsorize using boxcox_survey().
  2. Find Box-Cox transformation parameters for the food and nutrient using boxcox_survey().
  3. Transform, standardize, and create indicators for the variables to be analyzed using nci_multivar_preprocessor().
  4. Fit a measurement error model using nci_multivar_mcmc().
  5. Create a dataset of simulated usual intakes with nci_multivar_distrib().
  6. Compute summary statistics such as means, quantiles, and proportions using nci_multivar_summary().

Some of the steps are modified to accommodate non-consumption days for episodically consumed foods. Standard errors and confidence intervals for the calculated statistics will be generated by resampling with balanced repeated replication (BRR).

library(ncimultivar)

The dataset used for this analysis is derived from the 2005-2010 NHANES data. A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

The subject identifier is SEQN and each observation for a subject is identified by DAY.

The food being analyzed in this example is whole grains (G_WHOLE). The nutrient being analyzed is energy (TKCAL).

The covariates being examined are smoking status (SMK_REC), age (RIDAGEYR), and sex (RIAGENDR). Two nuisance covariates will be factored in as well: whether the recall was on a weekend (Weekend) and and whether the recall is on day 2 (Day2).

The WTDRD1 variable is the base weighting for each observation.

Subjects with missing values are removed, and categorical variables are transformed into binary indicators.

#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]

#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)

#remove subjects that are missing any covariates or variables
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) | 
                      is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$G_WHOLE) | is.na(input.dataset$TKCAL)

input.dataset <- input.dataset[!missing.covariates & !missing.variables,]

#break down smoking status into binary indicators
input.dataset$Current.Smoker <- as.numeric(input.dataset$SMK_REC == 1)
input.dataset$Former.Smoker <- as.numeric(input.dataset$SMK_REC == 2)
input.dataset$Never.Smoker <- as.numeric(input.dataset$SMK_REC == 3)

#rename whole grain and energy variables for readability
input.dataset$Whole.Grain <- input.dataset$G_WHOLE
input.dataset$Energy <- input.dataset$TKCAL

BRR weights will be added to the dataset.

fay.factor <- 0.7

input.dataset <- brr_weights(input.data=input.dataset,
                             id="SEQN",
                             strata="SDMVSTRA",
                             psu="SDMVPSU",
                             cell="PSCELL",
                             weight="WTDRD1",
                             fay.factor=fay.factor)

Next, extreme observations will be identified for Winsorization. The boxcox_survey() function must be run for both variables. The non-zero values of the first recall will be used.

Since whole grain is an episodic variable, Winsorization for low outliers is done slightly differently. Instead of changing values that are too small to a minimum threshold, they are set to zero and treated as a non-consumption observation. This behavior is toggled by setting the is.episodic parameter to TRUE.

wins.whole.grain <- boxcox_survey(input.data=input.dataset,
                                  row.subset=(input.dataset$Day2 == 0),
                                  variable="Whole.Grain",
                                  is.episodic=TRUE,
                                  weight="RepWt_0",
                                  do.winsorization=TRUE,
                                  iqr.multiple=2,
                                  id="SEQN",
                                  repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Whole.Grain
#>   SEQN DAY Whole.Grain Whole.Grain.winsorized
#>  35580   1       11.04               10.71163
#>  55905   2       12.19               10.71163

wins.energy <- boxcox_survey(input.data=input.dataset,
                             row.subset=(input.dataset$Day2 == 0),
                             variable="Energy",
                             weight="RepWt_0",
                             do.winsorization=TRUE,
                             iqr.multiple=2,
                             id="SEQN",
                             repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Energy
#>   SEQN DAY Energy Energy.winsorized
#>  31435   1    267          269.0701
#>  33646   2    197          269.0701
#>  40879   1   8550         8026.0436
#>  42997   1  12823         8026.0436
#>  44183   1  13398         8026.0436
#>  46795   1  13509         8026.0436
#>  51510   2    192          269.0701
#>  52708   2    214          269.0701
#>  57267   1   9165         8026.0436
#>  57920   2    224          269.0701
#>  57963   2      0          269.0701

#Winsorize whole grain
input.dataset$Whole.Grain <- pmin(input.dataset$Whole.Grain, 10.71163)

#Winsorize energy
input.dataset$Energy <- pmax(input.dataset$Energy, 269.0701)
input.dataset$Energy <- pmin(input.dataset$Energy, 8026.0436)

Next, the best Box-Cox lambda parameter for each variable can be found using the Winsorized data in the presence of covariates. A Box-Cox survey must be run for each variable used in the model, and the ootput datasets are concatenated.

boxcox.whole.grain <- boxcox_survey(input.data=input.dataset,
                                    row.subset=(input.dataset$Day2 == 0),
                                    variable="Whole.Grain",
                                    covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
                                    weight="RepWt_0")

boxcox.energy <- boxcox_survey(input.data=input.dataset,
                               row.subset=(input.dataset$Day2 == 0),
                               variable="Energy",
                               covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
                               weight="RepWt_0")

boxcox.lambda.data <- rbind(boxcox.whole.grain, boxcox.energy)

Next, the minimum amounts for whole grain and energy can be calculated using the first recall.

The calculate_minimum_amount() function can handle multiple variables at the same time. Episodically consumed foods are specified with episodic.variables while daily consumed nutrients are specified with daily.variables. If there are multiple daily and/or episodic variables, they should be specified as vectors in their respective parameters.

minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
                                                row.subset=(input.dataset$Day2 == 0),
                                                episodic.variables="Whole.Grain",
                                                daily.variables="Energy")

The nci_multivar_preprocessor() function can now generate the MCMC input data. Whole grains should be input in the episodic.variables parameter to tell the function to generate a consumption indicator variable in addition to an amount variable.

The episodic.variables and daily.variables parameters are vectors of episodically consumed foods and daily consumed nutrients, respectively.

Continuous covariates to be standardized are specified with continuous.covariates just like the daily nutrient analysis.

pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
                                           episodic.variables="Whole.Grain",
                                           daily.variables="Energy",
                                           continuous.covariates="RIDAGEYR",
                                           boxcox.lambda.data=boxcox.lambda.data,
                                           minimum.amount.data=minimum.amount.data)

The preprocessor creates the indicator variable ind.Whole.Grain for and amount variable amt.Whole.Grain for Whole.Grain. The indicator variable shows whether or not the subject reported consuming whole grains on a particular day. The amount is always non-zero when the indicator is TRUE and missing when the indicator is FALSE.

The first 10 observations demonstrate the relationship between indicators and amounts.

head(pre.mcmc.data$mcmc.input[,c("SEQN", "DAY", "ind.Whole.Grain", "amt.Whole.Grain")], 10)
#>     SEQN DAY ind.Whole.Grain amt.Whole.Grain
#> 1  31155   1           FALSE              NA
#> 2  31155   2           FALSE              NA
#> 3  31186   1           FALSE              NA
#> 4  31186   2            TRUE      -0.3457101
#> 5  31187   1            TRUE      -0.6091954
#> 6  31187   2            TRUE      -1.9316578
#> 7  31214   1           FALSE              NA
#> 8  31214   2           FALSE              NA
#> 9  31228   1           FALSE              NA
#> 10 31228   2            TRUE       0.7615192

The MCMC model can now be fit using both whole grains and energy for all of the BRR replicates.

Episodic and daily variables are specified in episodic.variables and daily.variables as in the previous steps.

Note that the number of iterations and burn-in is higher than for a single daily consumed nutrient. This is because models with episodic variables and multivariate models take more iterations to converge.

num.brr <- 8

mcmc.brr <- vector(mode="list", length=num.brr + 1)

for(brr.rep in 0:num.brr) {
  
  print(paste0("Starting Iteration ", brr.rep))
  
  mcmc.brr[[brr.rep + 1]] <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
                                               id="SEQN",
                                               repeat.obs="DAY",
                                               weight=paste0("RepWt_", brr.rep),
                                               episodic.variables="Whole.Grain",
                                               daily.variables="Energy",
                                               default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                               num.mcmc.iterations=4000,
                                               num.burn=2000,
                                               num.thin=2,
                                               mcmc.seed=(9999 + brr.rep))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"

The nci_multivar_distrib() function will be used to simulate a dataset of usual intakes to be used to represent the distribution of true usual intakes. This procedure does not calculate or predict true usual intakes for each subject.

A population-based dataset must be constructed as with the daily nutrient analysis.

#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)

#Set Day 2 to zero to factor out the effect of Day 2 recalls
distrib.pop$Day2 <- 0

#create repeats of each subject for weekday and weekend consumption
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)

The nci_multivar_distrib() function can now be used to simulate 200 usual intakes for each subject for each BRR replicate. The resulting distribution dataset contains simulated usual intakes for both whole grains and energy.

The nci_multivar_summary() function is used to calculate means and quantiles. Multiple variables can be summarized at the same time by specifying them as a vector in the variables parameter.

summary.brr <- vector(mode="list", length=num.brr + 1)

for(brr.rep in 0:num.brr) {
  
  print(paste0("Starting Iteration ", brr.rep))
  
  #create dataset with 200 simulated usual intakes for each subject
  distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.brr[[brr.rep + 1]],
                                       distrib.population=distrib.pop,
                                       id="SEQN",
                                       weight=paste0("RepWt_", brr.rep),
                                       nuisance.weight="Weekend.Weight",
                                       num.simulated.u=200,
                                       distrib.seed=(99999 + brr.rep))
  
  #compute means, quantiles, and proportions for simulated whole grain and energy intakes
  summary.brr[[brr.rep + 1]] <- nci_multivar_summary(input.data=distrib.data,
                                                     variables=c("usual.intake.Whole.Grain", "usual.intake.Energy"),
                                                     weight=paste0("RepWt_", brr.rep),
                                                     do.means=TRUE,
                                                     do.quantiles=TRUE,
                                                     quantiles=c(0.05, 0.25, 0.5, 0.75, 0.95))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"

The BRR standard errors and confidence intervals can now be calculated. To properly calculate the BRR standard error, it is important to divide by the square of the Fay factor that was used to generate the BRR replicate weights. For this dataset, the replicate weights were generated with a Fay factor of 0.7, but this can vary for different datasets. For ordinary bootstrapping, the same formula can be used with the Fay factor set to 1.

#calculate degrees of freedom
df <- length(unique(input.dataset$SDMVSTRA))

#extract point estimate and BRR replicates
point <- summary.brr[[1]]$value
reps <- vapply(summary.brr[-1], function(brr.i) brr.i$value, point)

#calculate BRR standard error
std.error <- sqrt(rowSums((reps - point)^2)/(num.brr*fay.factor^2))

#95% confidence intervals
confidence.lower <- point + qt(0.025, df)*std.error
confidence.upper <- point + qt(0.975, df)*std.error

#create summary report
summary.report <- data.frame(summary.brr[[1]], 
                             std.error=std.error,
                             confidence.lower=confidence.lower,
                             confidence.upper=confidence.upper)

summary.report
#>    population                 variable statistic        value   std.error
#> 1         All usual.intake.Whole.Grain      Mean 8.407250e-01  0.11187065
#> 2         All      usual.intake.Energy      Mean 2.243106e+03 55.36184419
#> 3         All usual.intake.Whole.Grain        5% 2.900628e-02  0.01666859
#> 4         All usual.intake.Whole.Grain       25% 2.333891e-01  0.06565393
#> 5         All usual.intake.Whole.Grain       50% 6.190778e-01  0.11533789
#> 6         All usual.intake.Whole.Grain       75% 1.217915e+00  0.17572342
#> 7         All usual.intake.Whole.Grain       95% 2.413512e+00  0.29579432
#> 8         All      usual.intake.Energy        5% 1.238884e+03 62.15913933
#> 9         All      usual.intake.Energy       25% 1.718265e+03 62.29657087
#> 10        All      usual.intake.Energy       50% 2.160088e+03 59.37482236
#> 11        All      usual.intake.Energy       75% 2.680624e+03 56.49799558
#> 12        All      usual.intake.Energy       95% 3.527010e+03 67.08672144
#>    confidence.lower confidence.upper
#> 1        0.56698734     1.114463e+00
#> 2     2107.63999571     2.378571e+03
#> 3       -0.01178028     6.979285e-02
#> 4        0.07273972     3.940385e-01
#> 5        0.33685613     9.012994e-01
#> 6        0.78793572     1.647895e+00
#> 7        1.68972890     3.137294e+00
#> 8     1086.78572143     1.390982e+03
#> 9     1565.83093817     1.870699e+03
#> 10    2014.80334530     2.305373e+03
#> 11    2542.37847758     2.818870e+03
#> 12    3362.85478898     3.691165e+03