The output of the NCI method can be used to perform statistical procedures for measurement error correction such as regression calibration.
This vignette assumes some familiarity with the NCI method and the ncimultivar package. A more in-depth overview of the basic NCI method procedures can be found in the daily nutrient vignette.
This example will demonstrate the workflow of a basic regression calibration analysis:
The standard errors for the model parameters will be calculated using balanced repeated replication (BRR).
While this example demonstrates regression using base R, it is recommended to use a dedicated function for survey data such as from the ‘survey’ R package.
The dataset used for this analysis is derived from the 2005-2010
NHANES data. A subset of six strata (SDMVSTRA) will be used
to reduce computation time and allow this example to run in real
time.
The nutrient being analyzed is potassium (TPOTA).
The covariates being examined are smoking status
(SMK_REC), age (RIDAGEYR), and sex
(RIAGENDR). The two nuisance covariates are whether the
recall was on a weekend (Weekend) and and whether the
recall is on day 2 (Day2).
The WTDRD1 variable is the base weighting for each
observation.
The outcome variables are systolic blood pressure
(BPSY_AVG) and hypertension (HTN_BIN). A
linear model will be fit for systolic blood pressure and a logistic
model will be fit for hypertension.
Subjects with missing values are removed, and categorical variables are transformed into binary indicators.
#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]
#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)
#remove subjects that are missing any covariates, variables, or outcomes
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) |
is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$TPOTA)
missing.outcomes <- is.na(input.dataset$BPSY_AVG) | is.na(input.dataset$HTN_BIN)
input.dataset <- input.dataset[!missing.covariates & !missing.variables & !missing.outcomes,]
#break down smoking status into binary indicators
input.dataset$Current.Smoker <- as.numeric(input.dataset$SMK_REC == 1)
input.dataset$Former.Smoker <- as.numeric(input.dataset$SMK_REC == 2)
input.dataset$Never.Smoker <- as.numeric(input.dataset$SMK_REC == 3)
#rename potassium variable for readability
input.dataset$Potassium <- input.dataset$TPOTABRR weights will be added to the dataset.
#generate BRR weights
fay.factor <- 0.7
input.dataset <- brr_weights(input.data=input.dataset,
id="SEQN",
strata="SDMVSTRA",
psu="SDMVPSU",
cell="PSCELL",
weight="WTDRD1",
fay.factor=fay.factor)The potassium variable can now be transformed and standardized for use in the MCMC algorithm.
#Winsorize extreme values of potassium intake
wins.potassium <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Potassium",
weight="RepWt_0",
do.winsorization=TRUE,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Potassium
#> SEQN DAY Potassium Potassium.winsorized
#> 57963 2 0 42.45263
input.dataset$Potassium <- pmax(input.dataset$Potassium, 42.45263)
#run Box-Cox survey and create Box-Cox lambda data using the first recall
boxcox.lambda.data <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Potassium",
covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
weight="RepWt_0")
#Calculate minimum amount of potassium intake in the first recall
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
daily.variables="Potassium")
#Run MCMC pre-preprocessor
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
daily.variables="Potassium",
continuous.covariates="RIDAGEYR",
boxcox.lambda.data=boxcox.lambda.data,
minimum.amount.data=minimum.amount.data)The MCMC measurement error model can now be fit for all BRR replicates. The outcome variable is not involved in the measurement error model.
To perform regression calibration, draws of the U matrix conditional
on the mean MCMC parameters must be taken. This is accomplished using
the num.post parameter in the
nci_multivar_mcmc() function. Each conditional U matrix
draw will be used in place of a simulated U matrix in
nci_multivar_distrib() to simulate a usual intake for each
subject. To ensure that enough data is generated to produce good
estimates of the model parameters, 500 conditional U matrices will be
drawn in order to simulate 500 usual intakes for each subject.
num.brr <- 8
mcmc.brr <- vector(mode="list", length=num.brr + 1)
for(brr.rep in 0:num.brr) {
print(paste0("Starting Iteration ", brr.rep))
mcmc.brr[[brr.rep + 1]] <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight=paste0("RepWt_", brr.rep),
daily.variables="Potassium",
default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
num.mcmc.iterations=3000,
num.burn=1000,
num.post=500,
num.thin=2,
mcmc.seed=(9999 + brr.rep))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"A dataset with simulated usual intakes for each subject can now be
created using nci_multivar_distrib(). This dataset
represents the conditional expectation of usual intake given the
observed data for each subject to be used in the regression calibration
procedure. It is not a prediction or calculation of the true usual
intake for individuals.
The population-based dataset is created the same way as for the daily nutrient analysis.
#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)
#Set Day 2 to zero to factor out the effect of Day 2 recalls
distrib.pop$Day2 <- 0
#create repeats of each subject for weekday and weekend consumption
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)Each subject will have 500 simulated usual intakes from the 500
conditional U matrix draws taken from the output of
nci_multivar_mcmc(). In order to tell
nci_multivar_distrib() to use the conditional U matrix
draws instead of simulating U matrices, the
use.mcmc.u.matrices parameter must be set to
TRUE.
The additional.output parameter is used to include the
outcome variable and covariates in the distribution dataset to use in
the regression. The variables supplied in additional.output
will be passed through from the population base dataset.
To perform regression calibration, the 500 simulated intakes for each subject need to be averaged. The mean simulated intakes are then used as a covariate in the regression models. The effect per milligram of potassium is very small, so the potassium usual intake will be converted to grams so that the coefficient is more readable.
The coefficients of the models are saved for each BRR replicate.
summary.brr <- vector(mode="list", length=num.brr + 1)
for(brr.rep in 0:num.brr) {
print(paste0("Starting Iteration ", brr.rep))
#simulate usual intakes for each subject using MCMC U matrices
distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.brr[[brr.rep + 1]],
distrib.population=distrib.pop,
id="SEQN",
weight=paste0("RepWt_", brr.rep),
nuisance.weight="Weekend.Weight",
additional.output=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "BPSY_AVG", "HTN_BIN"),
use.mcmc.u.matrices=TRUE,
distrib.seed=99999+brr.rep)
#take average usual intake per subject
regression.data <- aggregate(distrib.data[,"usual.intake.Potassium",drop=FALSE],
by=distrib.data[,c("SEQN", paste0("RepWt_", brr.rep), "Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR",
"BPSY_AVG", "HTN_BIN")],
mean)
#scale down simulated potassium intake by 1000 to show the effect per 1,000 mg of potassium
regression.data$usual.intake.Potassium <- regression.data$usual.intake.Potassium/1000
#rescale weights so that models can converge
regression.data[,paste0("RepWt_", brr.rep)] <- regression.data[,paste0("RepWt_", brr.rep)]/mean(regression.data[,paste0("RepWt_", brr.rep)])
#fit linear model of systolic blood pressure against the average simulated potassium intakes
bp.model <- lm(BPSY_AVG ~ usual.intake.Potassium + Current.Smoker + Former.Smoker + RIDAGEYR + RIAGENDR,
data=regression.data,
weights=regression.data[,paste0("RepWt_", brr.rep)])
bp.parameters <- summary_coefficients(model=bp.model)
#fit logistic model of diagnosed hypertension against the average simulated potassium intakes
#family is specified as quasibinomial to handle non-integer weights
htn.model <- glm(HTN_BIN ~ usual.intake.Potassium + Current.Smoker + Former.Smoker + RIDAGEYR + RIAGENDR,
data=regression.data,
weights=regression.data[,paste0("RepWt_", brr.rep)],
family=quasibinomial(link="logit"))
htn.parameters <- summary_coefficients(model=htn.model)
summary.brr[[brr.rep + 1]] <- rbind(bp.parameters,
htn.parameters)
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"The BRR standard error and confidence intervals can be calculated as usual. The BRR replicate weights in this dataset used a Fay factor of 0.7, so this must be accounted for in calculating the variance. Other datasets may use different Fay factors when generating BRR replicate weights, so it is important to verify it before calculating variances.
#calculate degrees of freedom
df <- length(unique(input.dataset$SDMVSTRA))
#extract point estimate and BRR replicates
point <- summary.brr[[1]]$value
reps <- vapply(summary.brr[-1], function(brr.i) brr.i$value, point)
#calculate BRR standard error
std.error <- sqrt(rowSums((reps - point)^2)/(num.brr*fay.factor^2))
#95% confidence intervals
confidence.lower <- point + qt(0.025, df)*std.error
confidence.upper <- point + qt(0.975, df)*std.error
#create summary report
summary.report <- data.frame(summary.brr[[1]],
std.error=std.error,
confidence.lower=confidence.lower,
confidence.upper=confidence.upper)
summary.report
#> population variable statistic value
#> 1 All BPSY_AVG Coefficient for (Intercept) 113.05108719
#> 2 All BPSY_AVG Coefficient for usual.intake.Potassium -2.35735666
#> 3 All BPSY_AVG Coefficient for Current.Smoker -0.59002768
#> 4 All BPSY_AVG Coefficient for Former.Smoker 0.28047403
#> 5 All BPSY_AVG Coefficient for RIDAGEYR 0.35397544
#> 6 All BPSY_AVG Coefficient for RIAGENDR -6.51452720
#> 7 All HTN_BIN Coefficient for (Intercept) -2.88269021
#> 8 All HTN_BIN Coefficient for usual.intake.Potassium -0.22628765
#> 9 All HTN_BIN Coefficient for Current.Smoker 0.60912325
#> 10 All HTN_BIN Coefficient for Former.Smoker 0.24351942
#> 11 All HTN_BIN Coefficient for RIDAGEYR 0.04354076
#> 12 All HTN_BIN Coefficient for RIAGENDR -0.36279833
#> std.error confidence.lower confidence.upper
#> 1 4.732255653 101.47167475 124.63049963
#> 2 1.221397873 -5.34600959 0.63129627
#> 3 0.990866015 -3.01458947 1.83453412
#> 4 0.539864757 -1.04052744 1.60147550
#> 5 0.016996676 0.31238607 0.39556481
#> 6 1.117872687 -9.24986313 -3.77919127
#> 7 0.465241367 -4.02109483 -1.74428560
#> 8 0.131607955 -0.54832072 0.09574542
#> 9 0.231068624 0.04371870 1.17452781
#> 10 0.230022847 -0.31932621 0.80636505
#> 11 0.004818188 0.03175108 0.05533045
#> 12 0.218058347 -0.89636788 0.17077122