The NCI method can be used to model multiple variables at once, including both episodically consumed foods (referred to as ‘foods’) and daily consumed nutrients (referred to as ‘nutrients’). This vignette demonstrates how to perform a simple multivariate analysis using one food and one nutrient.
This vignette assumes basic familiarity with the NCI method and the ncimultivar package. A more in-depth introduction to the basic NCI method procedures can be found in the daily nutrient vignette.
The workflow for an analysis involving episodic foods is as follows:
boxcox_survey().boxcox_survey().nci_multivar_preprocessor().nci_multivar_mcmc().nci_multivar_distrib().nci_multivar_summary().Some of the steps are modified to accommodate non-consumption days for episodically consumed foods. Standard errors and confidence intervals for the calculated statistics will be generated by resampling with balanced repeated replication (BRR).
The dataset used for this analysis is derived from the 2005-2010
NHANES data. A subset of six strata (SDMVSTRA) will be used
to reduce computation time and allow this example to run in real
time.
The subject identifier is SEQN and each observation for
a subject is identified by DAY.
The food being analyzed in this example is whole grains
(G_WHOLE). The nutrient being analyzed is energy
(TKCAL).
The covariates being examined are smoking status
(SMK_REC), age (RIDAGEYR), and sex
(RIAGENDR). Two nuisance covariates will be factored in as
well: whether the recall was on a weekend (Weekend) and and
whether the recall is on day 2 (Day2).
The WTDRD1 variable is the base weighting for each
observation.
Subjects with missing values are removed, and categorical variables are transformed into binary indicators.
#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]
#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)
#remove subjects that are missing any covariates or variables
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) |
is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$G_WHOLE) | is.na(input.dataset$TKCAL)
input.dataset <- input.dataset[!missing.covariates & !missing.variables,]
#break down smoking status into binary indicators
input.dataset$Current.Smoker <- as.numeric(input.dataset$SMK_REC == 1)
input.dataset$Former.Smoker <- as.numeric(input.dataset$SMK_REC == 2)
input.dataset$Never.Smoker <- as.numeric(input.dataset$SMK_REC == 3)
#rename whole grain and energy variables for readability
input.dataset$Whole.Grain <- input.dataset$G_WHOLE
input.dataset$Energy <- input.dataset$TKCALBRR weights will be added to the dataset.
fay.factor <- 0.7
input.dataset <- brr_weights(input.data=input.dataset,
id="SEQN",
strata="SDMVSTRA",
psu="SDMVPSU",
cell="PSCELL",
weight="WTDRD1",
fay.factor=fay.factor)Next, extreme observations will be identified for Winsorization. The
boxcox_survey() function must be run for both variables.
The non-zero values of the first recall will be used.
Since whole grain is an episodic variable, Winsorization for low
outliers is done slightly differently. Instead of changing values that
are too small to a minimum threshold, they are set to zero and treated
as a non-consumption observation. This behavior is toggled by setting
the is.episodic parameter to TRUE.
wins.whole.grain <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Whole.Grain",
is.episodic=TRUE,
weight="RepWt_0",
do.winsorization=TRUE,
iqr.multiple=2,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Whole.Grain
#> SEQN DAY Whole.Grain Whole.Grain.winsorized
#> 35580 1 11.04 10.71163
#> 55905 2 12.19 10.71163
wins.energy <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Energy",
weight="RepWt_0",
do.winsorization=TRUE,
iqr.multiple=2,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Energy
#> SEQN DAY Energy Energy.winsorized
#> 31435 1 267 269.0701
#> 33646 2 197 269.0701
#> 40879 1 8550 8026.0436
#> 42997 1 12823 8026.0436
#> 44183 1 13398 8026.0436
#> 46795 1 13509 8026.0436
#> 51510 2 192 269.0701
#> 52708 2 214 269.0701
#> 57267 1 9165 8026.0436
#> 57920 2 224 269.0701
#> 57963 2 0 269.0701
#Winsorize whole grain
input.dataset$Whole.Grain <- pmin(input.dataset$Whole.Grain, 10.71163)
#Winsorize energy
input.dataset$Energy <- pmax(input.dataset$Energy, 269.0701)
input.dataset$Energy <- pmin(input.dataset$Energy, 8026.0436)Next, the best Box-Cox lambda parameter for each variable can be found using the Winsorized data in the presence of covariates. A Box-Cox survey must be run for each variable used in the model, and the ootput datasets are concatenated.
boxcox.whole.grain <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Whole.Grain",
covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
weight="RepWt_0")
boxcox.energy <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Energy",
covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
weight="RepWt_0")
boxcox.lambda.data <- rbind(boxcox.whole.grain, boxcox.energy)Next, the minimum amounts for whole grain and energy can be calculated using the first recall.
The calculate_minimum_amount() function can handle
multiple variables at the same time. Episodically consumed foods are
specified with episodic.variables while daily consumed
nutrients are specified with daily.variables. If there are
multiple daily and/or episodic variables, they should be specified as
vectors in their respective parameters.
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
episodic.variables="Whole.Grain",
daily.variables="Energy")The nci_multivar_preprocessor() function can now
generate the MCMC input data. Whole grains should be input in the
episodic.variables parameter to tell the function to
generate a consumption indicator variable in addition to an amount
variable.
The episodic.variables and daily.variables
parameters are vectors of episodically consumed foods and daily consumed
nutrients, respectively.
Continuous covariates to be standardized are specified with
continuous.covariates just like the daily nutrient analysis.
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
episodic.variables="Whole.Grain",
daily.variables="Energy",
continuous.covariates="RIDAGEYR",
boxcox.lambda.data=boxcox.lambda.data,
minimum.amount.data=minimum.amount.data)The preprocessor creates the indicator variable
ind.Whole.Grain for and amount variable
amt.Whole.Grain for Whole.Grain. The indicator
variable shows whether or not the subject reported consuming whole
grains on a particular day. The amount is always non-zero when the
indicator is TRUE and missing when the indicator is
FALSE.
The first 10 observations demonstrate the relationship between indicators and amounts.
head(pre.mcmc.data$mcmc.input[,c("SEQN", "DAY", "ind.Whole.Grain", "amt.Whole.Grain")], 10)
#> SEQN DAY ind.Whole.Grain amt.Whole.Grain
#> 1 31155 1 FALSE NA
#> 2 31155 2 FALSE NA
#> 3 31186 1 FALSE NA
#> 4 31186 2 TRUE -0.3457101
#> 5 31187 1 TRUE -0.6091954
#> 6 31187 2 TRUE -1.9316578
#> 7 31214 1 FALSE NA
#> 8 31214 2 FALSE NA
#> 9 31228 1 FALSE NA
#> 10 31228 2 TRUE 0.7615192The MCMC model can now be fit using both whole grains and energy for all of the BRR replicates.
Episodic and daily variables are specified in
episodic.variables and daily.variables as in
the previous steps.
Note that the number of iterations and burn-in is higher than for a single daily consumed nutrient. This is because models with episodic variables and multivariate models take more iterations to converge.
num.brr <- 8
mcmc.brr <- vector(mode="list", length=num.brr + 1)
for(brr.rep in 0:num.brr) {
print(paste0("Starting Iteration ", brr.rep))
mcmc.brr[[brr.rep + 1]] <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight=paste0("RepWt_", brr.rep),
episodic.variables="Whole.Grain",
daily.variables="Energy",
default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
num.mcmc.iterations=4000,
num.burn=2000,
num.thin=2,
mcmc.seed=(9999 + brr.rep))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"The nci_multivar_distrib() function will be used to
simulate a dataset of usual intakes to be used to represent the
distribution of true usual intakes. This procedure does not calculate or
predict true usual intakes for each subject.
A population-based dataset must be constructed as with the daily nutrient analysis.
#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)
#Set Day 2 to zero to factor out the effect of Day 2 recalls
distrib.pop$Day2 <- 0
#create repeats of each subject for weekday and weekend consumption
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)The nci_multivar_distrib() function can now be used to
simulate 200 usual intakes for each subject for each BRR replicate. The
resulting distribution dataset contains simulated usual intakes for both
whole grains and energy.
The nci_multivar_summary() function is used to calculate
means and quantiles. Multiple variables can be summarized at the same
time by specifying them as a vector in the variables
parameter.
summary.brr <- vector(mode="list", length=num.brr + 1)
for(brr.rep in 0:num.brr) {
print(paste0("Starting Iteration ", brr.rep))
#create dataset with 200 simulated usual intakes for each subject
distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.brr[[brr.rep + 1]],
distrib.population=distrib.pop,
id="SEQN",
weight=paste0("RepWt_", brr.rep),
nuisance.weight="Weekend.Weight",
num.simulated.u=200,
distrib.seed=(99999 + brr.rep))
#compute means, quantiles, and proportions for simulated whole grain and energy intakes
summary.brr[[brr.rep + 1]] <- nci_multivar_summary(input.data=distrib.data,
variables=c("usual.intake.Whole.Grain", "usual.intake.Energy"),
weight=paste0("RepWt_", brr.rep),
do.means=TRUE,
do.quantiles=TRUE,
quantiles=c(0.05, 0.25, 0.5, 0.75, 0.95))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"The BRR standard errors and confidence intervals can now be calculated. To properly calculate the BRR standard error, it is important to divide by the square of the Fay factor that was used to generate the BRR replicate weights. For this dataset, the replicate weights were generated with a Fay factor of 0.7, but this can vary for different datasets. For ordinary bootstrapping, the same formula can be used with the Fay factor set to 1.
#calculate degrees of freedom
df <- length(unique(input.dataset$SDMVSTRA))
#extract point estimate and BRR replicates
point <- summary.brr[[1]]$value
reps <- vapply(summary.brr[-1], function(brr.i) brr.i$value, point)
#calculate BRR standard error
std.error <- sqrt(rowSums((reps - point)^2)/(num.brr*fay.factor^2))
#95% confidence intervals
confidence.lower <- point + qt(0.025, df)*std.error
confidence.upper <- point + qt(0.975, df)*std.error
#create summary report
summary.report <- data.frame(summary.brr[[1]],
std.error=std.error,
confidence.lower=confidence.lower,
confidence.upper=confidence.upper)
summary.report
#> population variable statistic value std.error
#> 1 All usual.intake.Whole.Grain Mean 8.407250e-01 0.11187065
#> 2 All usual.intake.Energy Mean 2.243106e+03 55.36184419
#> 3 All usual.intake.Whole.Grain 5% 2.900628e-02 0.01666859
#> 4 All usual.intake.Whole.Grain 25% 2.333891e-01 0.06565393
#> 5 All usual.intake.Whole.Grain 50% 6.190778e-01 0.11533789
#> 6 All usual.intake.Whole.Grain 75% 1.217915e+00 0.17572342
#> 7 All usual.intake.Whole.Grain 95% 2.413512e+00 0.29579432
#> 8 All usual.intake.Energy 5% 1.238884e+03 62.15913933
#> 9 All usual.intake.Energy 25% 1.718265e+03 62.29657087
#> 10 All usual.intake.Energy 50% 2.160088e+03 59.37482236
#> 11 All usual.intake.Energy 75% 2.680624e+03 56.49799558
#> 12 All usual.intake.Energy 95% 3.527010e+03 67.08672144
#> confidence.lower confidence.upper
#> 1 0.56698734 1.114463e+00
#> 2 2107.63999571 2.378571e+03
#> 3 -0.01178028 6.979285e-02
#> 4 0.07273972 3.940385e-01
#> 5 0.33685613 9.012994e-01
#> 6 0.78793572 1.647895e+00
#> 7 1.68972890 3.137294e+00
#> 8 1086.78572143 1.390982e+03
#> 9 1565.83093817 1.870699e+03
#> 10 2014.80334530 2.305373e+03
#> 11 2542.37847758 2.818870e+03
#> 12 3362.85478898 3.691165e+03